I have seen many theoretical explanations on difference between big data / machine learning / data science etc. The below is based on what I see practically.
Let me elaborate on why 18 months of effort is required for data scientist
1 month for SAS / R : to be able to
o Read write data
o Filter / sort / merge / append data
o Derive / format new fields
o Able to apply set of commands together for various use cases
6 months for basic statistics : to be able to become comfortable with
o Probabilities, Central tendencies & dispersion around centre
o Normal distribution & Central Limit Theorem,
o Sampling distribution of mean, proportion
o Hypothesis testing (p value, 1 / 2 tail test, type 1 / 2 error, power of a test)
o Linear regression (coefficient of determination, regression coefficient)
o ANOVA (one way / two way)
o Categorical data analysis (chi square tests of contingency tables)
o non parametric tests (run test / spearman rank correlation)
6 months on machine learning
o Decision tree techniques
- CHAID (chi square automatic interaction detector)
- CART – classification algorithm (for categorical outcome) and regression tree (for linear outcome variable)
- ID3 – another algorithm of decision tree, which use entropy
- C 4.5
- Random Forest method
- Model design
- Variable selection
- Dealing with multi collinearity
- Strength of the model (KS / GINI etc.)
- Model validation
- Hierarchical clustering
- Non-hierarchical (k-means) clustering
- Dealing with practical challenges of clustering