Which machine learning technique you use?

Machine learning has plenty of techniques, however it is quite clear that some techniques find more frequent usage in day to day situation and some are rarely used or rarely result into actionable analytics.

The below details are as my understanding. Please feel free to comment and inform if you are using other technique to deliver actionable analytics.

Supervised Machine Learning – classification type – is most used technique (Roughly 60% of all use cases)

Most of the situation (who will default, which transaction is fraud, who will respond, who will attrite … ) requires usage of classification. Following algorithms in this category are quite popular.

  1. XG Boost classifier – when consistent power of separation matter most and situation doesn’t demand bring only required columns for scoring
  2. Random Forest Classifier – when consistent power of separation matter most and situation doesn’t demand bring only required columns for scoring
  3. Logistic regression – when parsimony of model is of prime importance
  4. Classification tree – when understanding of model is most important
  5. Artificial Neural Network / Deep Learning – when power of separation matter most
  6. Support Vector Machine


Supervised Machine Learning – regression type – is second most used technique (Roughly 10% of all use cases)

Situations like how much is the income, how much will be the balance, what is the pocket size, what will be the quality scores etc. requires regression. Following algorithms in this category are quite popular.

  1. Linear Regression
  2. XG Boost Regressor
  3. Random Forest Regressor
  4. Regression Tree – when understanding of the model is most important


UnSupervised Machine Learning – cluster analysis- (Roughly 7% of all use cases)

The situation, where you don’t have any objective function, but you want to gather more information about a group of individuals and see if they create homogeneous within and heterogeneous across kind of naturally occurring groups so that you can develop customized offers to each group.

Though there are many algorithms which are of conceptual interest and many way clusters can be developed, invariably it is K-means clustering that is used in almost all industrial situations.

Occasionally, one might need to use some sort of variable selection algorithm, which can be used in situation, where there is no dependent variable, and then develop clusters based on those variables. Varclus is quite common technique to select variable which has least value of (1-R-square within) / (1- R-square outside)

Clustering mostly requires one to standardize variables, then develop clusters and then provide cluster mean definitions in terms of original variables.


Time series analysis (Roughly 8% of all use cases)

The situation when you have time period field and a value (like sales volume over time, purchase quantity over time), people use time series analysis. Most wide use of this technique is at the time of forecasting, planning etc.

It mostly requires using field which is not getting impacted by inflation etc. much (like quantity sold will be preferred than revenue as revenue can vary based on competition / discount etc.)

Most popular techniques are

  1. Decomposition techniques – which breaks data into trend (T), cyclic (C), seasonality (S) and irregular component (I).
  2. ARIMA – where auto regression (AR) with past terms, Integration (created new series with Z(t) = Y(t)-Y(t-1) , moving averages (MA) are used to forecast the value


Association Rules Mining analysis (Roughly 3% of all use cases)

in this situation, you find support (which measures frequency), confidence (which measures conditional probability B given A kind of ) and Lift matters.

This is used to design layout of super store etc. At times it is also used to validate layout of websites. This is also used to recommend product to users, in case you are a mega super store like Amazon / Flipkart.


Collaborative filtering (Roughly 3% of all use cases)

This is used frequently in entertainment industry to recommend movie to users.

It is all about finding users who are most similar to a given user say A. Usually, people prefer cosine centered similarity for this, which centers each user rating to 0 and then measure similarity.

After that it is all about finding new movie, which has been viewed and liked by other similar users but have not been seen by the user A.


Text Mining (roughly 4% of all use cases)

Text mining is all about finding words, which are most common. It requires several steps like

  • Removal of stop words – words like the, A, an etc.
  • Stemming – make running like run, posponed like postpone
  • Case correction – all words to be treated in same case
  • Punctuation removal – so that explanation mark, question mark, apostrophe etc. has no impact

Subsequent processes starts only after the above four steps.

  • Supervised Text Mining – When there are different classes available in the data, it finds words associated with each class. Usually Naive Bayes classifier is used to find the words and associated odds for a given class
  • Unsupervised Text Mining – When there are no classes, it is all about finding words which are most frequent


Do you use any other technique more frequently? Please feel free to comment and let others know

Leave a Reply

Your email address will not be published. Required fields are marked *