33 Most Common Data Science procedures

  1. Aggregation:
    • SQL aggregate functions (e.g., GROUP BY, SUM, COUNT, AVG, MIN, MAX)
  2. Sorting:
    • Sorting data by one or multiple columns
  3. Joining/Merging:
    • Combining datasets using keys (e.g., inner join, outer join, left join, right join)
  4. Appending:
    • Concatenating datasets vertically
  5. Dropping Columns/Rows:
    • Removing unnecessary columns or rows
  6. Viewing Subsets of Data:
    • Viewing a few records (head, tail)
    • Viewing specific columns
  7. Data Type Conversion:
    • Changing the format of fields (e.g., int to float, string to datetime)
  8. String Operations:
    • Concatenation
    • Splitting strings by delimiter (text-to-columns)
    • Substring extraction
    • Finding the last n characters
  9. Mathematical Operations:
    • Addition, subtraction, multiplication, division of columns
    • Calculating differences between data points
  10. Date and Time Extraction:
    • Extracting year, month, quarter, day, etc., from date values
  11. Finding Extremes:
    • Finding maximum or minimum values across multiple fields for each row
  12. Binning:
    • Creating bins of numeric variables (e.g., using quantiles or fixed intervals)
  13. Custom Grouping:
    • Creating customized groupings from character variables
  14. Flooring and Capping:
    • Setting minimum (flooring) and maximum (capping) limits for values
  15. Missing Value Treatment:
    • Handling missing data (e.g., imputation, removal)
  16. Dummy Variable Creation:
    • Creating binary/indicator variables for categorical data
  17. Target Encoding:
    • Encoding categorical variables based on the target variable’s mean or other statistics
  18. Normalization and Standardization:
    • Scaling features to a specific range or to have zero mean and unit variance
  19. Pivoting:
    • Transforming data from long format to wide format (pivot tables)
  20. Unpivoting:
    • Transforming data from wide format to long format (melt)
  21. Filtering:
    • Subsetting data based on conditions
  22. Window Functions:
    • Applying operations over a window of rows (e.g., rolling average, cumulative sum)
  23. Rank and Percentile Calculation:
    • Ranking data and calculating percentiles
  24. Correlation and Covariance Calculation:
    • Computing the correlation and covariance between variables
  25. Feature Engineering:
    • Creating new features from existing data (e.g., polynomial features, interaction terms)
  26. Text Processing:
    • Removing stopwords, stemming, lemmatization
  27. Outlier Detection and Treatment:
    • Identifying and handling outliers
  28. Sampling:
    • Drawing samples from the dataset (e.g., random sampling, stratified sampling)
  29. Merging Data from Different Sources:
    • Integrating data from multiple sources (e.g., databases, APIs)
  30. Data Encryption and Decryption:
    • Encrypting sensitive data fields for security
  31. Data Validation and Cleaning:
    • Ensuring data quality by validating data types, consistency, and accuracy
  32. Visualization:
    • Creating plots and charts to visualize data trends and distributions
  33. Pareto Analysis:
    • Creating bar graph and cumulative % on y-axis

By using these operations, data analysts can clean, transform, and analyze datasets effectively, gaining insights and preparing the data for further modeling or reporting tasks.

Can you believe that Extreme-ML does all these things? Contact us to see a demo of the same

Leave a Reply

Your email address will not be published. Required fields are marked *