1/ DataScience – It has four dimensions – Execution skills, Basic Statistics, MachineLearning and Business Knowledge. If you know all these four areas, you will be able to make big impact in business scenario using data science.
2/ You will also enjoy your data science work more because you will understand end to end impact of the data. Otherwise at times, you will find your work monotonous / meaningless. In Execution skills you require SAS / R / Python / SQL etc.
3/ These executions skills have two kind of jobs
- Getting data from various systems / merging datasets/ creating aggregate fields / create derived variables / cleaning data etc.
- Applying machine learning / statistical procedure on data
4/ Basic statistics makes the basis of all the analysis. Sum / count / average / standard deviation is the among the most used methods still today. You must know normal distribution / central limit theorem / hypothesis testing to be enjoy the analyse of the data.
5/ Machine learning is actually applied statistics only. There are three types of machine learning (supervised / unsupervised / reinforcement) . However in the business scenario – supervised machine learning is most used technique.
6/ Supervised machine learning refers to a scenario where there is a known dependent variable. And whole purpose is to find independent variable to indicate the dependent variable. Like find pattern to detect write-off / loss giving customers.
7/ Here write-off event is the dependent variable. Customers demographic profile / payment characteristics / delinquency patterns / purchase pattern etc. mostly six to 18 months prior to write-off event makes the independent variables.
8/ Selecting some 5-15 independent variable to predict the dependent variable is called dimensionality reduction technique. It requires study of pattern of the data and application of techniques like information value / fisher’s linear discriminant ration/ principal component
9/ analysis / step wise regression / chi square test of independence etc. Other supervised machine learning examples can be – who will roll to the next stage of delinquency / who will accept the offer / who will revolve next month / who will attrite from your base/ who will not survive next month / who will claim insurance etc.
10/ Unsupervised learning is the scenario -when you have no dependent variable. It is trying to create homogeneous groups with the data, which each element is very similar to other elements of the same group. However, we want each group to be very dissimilar from other groups.
11/ Cluster analysis makes majority of unsupervised machine learning. K-means technique is the most popular technique of clustering. Cluster analysis requires careful selection of independent variable. Most often than not, we start with 30/40 variables,
12/ which makes business sense and then apply dimensionality reduction techniques to select final variables for the process. We in general try to 8 / 15 segments, which are homogeneous within and homogeneous across.
13/ Last dimensions of #datascience is business knowledge. We must understand the business model. How does the business earn it’s revenue and how does it incur loss / expenses? How to design data for supervised / unsupervised machine learning scenario? How to make sure that data is useful for the purpose.
14/ How to make sure that data is useful for the purpose. One must understand the focus of his vertical. Also how does different verticals collaborate (theoretically otherwise at times they also think about the vertical’s objective) for the organizational goal.
15/ One must learn how different statistical & machine learning techniques are applied to solve business problem. This will ensure that you will be able to use models in a better way. Also, you will be able to make better design of test & control to measure true impact of model
16/ Model performance measurement is very important aspect to get benefits of the model. When performance of model starts degrading, you refit (means just redevelop the equation on the basis of same set of independent variables) or redevelop model afresh
17/ Model redevelopment is as good as developing new model. However, care data design is very important for this. Also, you might need to do reject inference, if you have rejected some through the door applications based on existing score.