Big Data vs. Data Science Practically

I have seen many theoretical explanations on difference between big data / machine learning / data science etc.  The below is based on what I see practically.

Let me elaborate on why 18 months of effort is required for data scientist

1 month for  SAS / R : to be able to

o  Read write data

o  Filter / sort / merge / append data

o  Derive / format new fields

o  Able to apply set of commands together for various use cases

6 months for basic statistics : to be able to become comfortable with

o  Probabilities, Central tendencies & dispersion around centre

o  Normal distribution & Central Limit Theorem,

o  Sampling distribution of mean, proportion

o  Hypothesis testing (p value, 1 / 2 tail test, type 1 / 2 error, power of a test)  

o  Linear regression (coefficient of determination, regression coefficient)

o  ANOVA (one way / two way)

o  Categorical data analysis (chi square tests of contingency tables)

o  non parametric tests (run test / spearman rank correlation)

6 months on machine learning

o  Decision tree techniques

  1. CHAID (chi square automatic interaction detector)
  2. CART – classification algorithm (for categorical outcome) and regression tree (for linear outcome variable)
  3. ID3 – another algorithm of decision tree, which use entropy
  4. C 4.5
  5. Random Forest method

o  Logistic regression

  1. Model design
  2. Variable selection
  3. Dealing with multi collinearity
  4. Strength of the model (KS / GINI etc.)
  5. Model validation

o  Cluster analysis

  1. Hierarchical clustering
  2. Non-hierarchical (k-means) clustering
  3. Dealing with practical challenges of clustering

6 months for knowing applicability of different techniques / valid model design / proper model validation etc.)

Leave a Reply

Your email address will not be published. Required fields are marked *