Best Resources to Learn Feature Engineering for Machine Learning

Feature Engineering — Image from the author — All rights reserved

Data in its raw format is almost never ready to be used to train machine learning models. But, we can transform the data to build suitable features for machine learning. The process of transforming the variables and creating new features is called feature engineering, and it is typically the stage where data scientists devote most of their effort in a machine learning project.

As Pedro Domingos said in the article “A few useful things to know about machine learning”:

“At the end of the day, some machine learning projects succeed and some fail. What makes the difference? Easily the most important factor is the features used”.

Some aspects of feature engineering are domain-specific: we need to know a few things about the data and the business area, or organizations purpose to derive useful features. But a big chunk of feature engineering is also quite repetitive and can be automated.

Many of the techniques to perform the more repetitive aspect of feature engineering are used across organizations and in many data science competitions, and they include procedures to remove missing data, encode categorical variables or extract features from text, to name a few. More and more, feature engineering practices are being consolidated, and many organizations adopt similar practices to clean and prepare the data.

Surprisingly, even though feature engineering is a crucial part of any machine learning pipeline, and also the most time consuming, it is barely covered in the extensive catalogue of machine learning online courses.

In this article, I will discuss the best and potentially, the only, available resources to learn more about feature engineering.

Should I learn feature engineering?

Once you made a start with data science and machine learning courses, you are familiar with off-the-shelf machine learning algorithms like regression, decision trees, random forests, and you are relatively comfortable with programming either in Python or R, one of the logical next steps is to gain exposure to feature engineering techniques.

In fact, after you worked in a few data science projects either in your organization or in a data competition, you will soon realize how much needs to be done before utilizing the data to train an algorithm. So sooner or later, you will need to become familiar with various feature engineering techniques.

What exactly do I need to learn?

There are a few fundamental aspects of feature engineering. The first one is missing data imputation. Missing data, is the lack of values for some observations. Libraries like Scikit-learn, do not support missing data as input, therefore, we need to replace missing values with a number.

If you are an R user, many R machine learning packages, will allow you to pass data with missing values, but certainly not all of them, so learning a few missing data imputation techniques is quite handy.

In categorical variables, the values of the variable are strings instead of numbers. Some libraries, like Scikit-learn, do not support strings as input, so it is also useful to have a few categorical encoding techniques in your tool belt.

Often, a better spread of the values increases some machine learning models performance. In fact, many variables show extremely skewed distributions. Thus, we often utilize mathematical transformations or discretization to obtain a more homogeneous value spread.

Some machine learning algorithms are sensitive to feature magnitude, for example linear models, support vector machines, neural networks and distance based algorithms like PCA and k-means clustering. In these cases, we tend to scale the variables as well.

In many datasets, there are dates as variables. Dates or datetime variables are not fed as such into machine learning models, instead we derive more useful information by extracting new features from them, like for example time elapsed between 2 variables.

In some specific cases, some of the variables are GIS coordinates, which again, would provide us with more information if we pre-processed them. We could also find transaction data or time series, which we tend to aggregate in a single view per customer, extracting a point feature from a temporal series.

Finally, often our data are texts, or images, and we can as well create features from them, to use in machine learning.

So in a nutshell, feature engineering refers to techniques to perform:

  1. Missing data imputation
  2. Categorical variable encoding
  3. Numerical variable transformation
  4. Discretization
  5. Engineering of datetime variables
  6. Engineering of coordinates — GIS data
  7. Feature extraction from text
  8. Feature extraction from images
  9. Feature extraction from time series
  10. New feature creation by combining existing variables

That sounds like a lot to take on, so..

Where can I learn about feature engineering?

In this post, I describe the best, and perhaps the only, available resources to learn feature engineering for machine learning. Some of these resources are very comprehensive, covering almost every aspect of feature engineering, therefore providing the student with a wider repertoire of techniques that are suitable for different scenarios, algorithms and datasets. Some of the resources are more focused on the techniques that are more main stream, trying to quickly empower the student to crack on with data pre-processing.

Let’s dive in…

Disclaimer: Two of the recommendations in the article are my course in Udemy “Feature Engineering for Machine Learning” and my book from Packt “Python Feature Engineering Cookbook”.

Opinions in this article are my own and I do not become financial compensation from any of the links included in this article (except those mentioned in the precedent paragraph). The article does not contain affiliate links.

Contents

Online Courses

  1. Feature Engineering for Machine Learning - Udemy​​
  2. How to Win a Data Science Competition: Learn from top Kagglers — Coursera
  3. Feature Engineering for Machine Learning in Python - Datacamp
  4. Feature Engineering — Coursera

Books

  1. Python Feature Engineering Cookbook - Packt
  2. Feature Engineering for Machine Learning Models — O’Reilly​
  3. Feature Engineering Made Easy - Packt
  4. Feature Engineering and Selection: A Practical Approach for Predictive Models

Articles and other free resources

  1. The 2009 Knowledge Discovery in Data Competition (KDD Cup 2009)
  2. ​Beating Kaggle the Easy Way

Feature Engineering libraries

  1. Scikit-learn
  2. Feature-engine
  3. Category encoders
  4. Featuretools

Comprehensive Blogs about Feature Engineering

Online Courses

1. Feature Engineering for Machine Learning - Udemy​​

Feature Engineering for Machine Learning is the most comprehensive online course on feature engineering, to date. It covers almost every aspect of feature engineering, discussing all most known and widely used techniques, as well as alternative techniques used in data competitions and in organizations.

In the course Feature Engineering for Machine Learning you will learn multiple procedures for:

  1. Missing data imputation: like mean, median, mode, arbitrary, end of tail and random sample imputation. You will also learn multivariate imputation.
  2. Categorical variable encoding: like one-hot, ordinal and mean encoding, and also weight-of evidence, binarization and feature hashing.
  3. Numerical variable transformation: like logarithmic, reciprocal, exponential, Box-Cox and Yeo-Johnson transformations.
  4. Variable discretization: like equal width and equal-frequency discretization, discretization using k-means and discretization with decision trees.
  5. Outlier removal: removal, capping, Winsorization
  6. Feature Scaling: standardization, MinMax scaling, robust scaling, norm scaling and more
  7. Engineering of datetime variables: including extracting features from day, month and year parts, and capturing elapsed time including in different time zones.
  8. Engineering of mixed numerical and categorical variables
  9. Compared code implementation with different available open source Python packages, like Scikit-learn, Feature-engine and Category encoders.

Feature Engineering for Machine Learning teaches multiple techniques for each one of the topics mentioned above, discussing the assumptions made by the technique, its advantages and limitations, and provides Python code to implement the technique using Python open source libraries like pandas, NumPy and Scikit-learn, Feature-Engine and Category encoders.

Feature engineering for Machine Learning — Online Course — Image by the author -All rights reserved

Feature Engineering for Machine Learning starts addressing the characteristics of variables and how these may affect the performance of multiple machine learning algorithms. Then it discusses the assumptions made by various machine learning algorithms, and how feature transformations can improve model performance.

Feature Engineering for Machine Learning continues to introduce the multiple techniques for feature engineering. And finally, the course provides end-to-end pipelines of feature transformation.

Feature Engineering for Machine Learning includes videos where the instructor discusses the advantages, limitations and implications of each technique, accompanied by Jupyter Notebooks with the code to implement these techniques in Python.

Courses in Udemy are not for free, but you can get it at a discounted price using frequently released vouchers.

2. How to Win a Data Science Competition: Learn from Top Kagglers - Coursera

How to Win a Data Science Competition: Learn from Top Kagglers is tailored to students seeking to enter and win data science competitions. The authors themselves have won various competitions in Kaggle, and in the course they explain several of the techniques they used to engineer and select their variables, and build and tune their machine learning models.

How to Win a Data Science Competition: Learn from Top Kagglers includes 3 sections on feature engineering. In the first section, they describe basic techniques to impute missing data, transform numerical variables, encode categorical variables and work with dates and coordinates. They also teach how to create bag of words from text and how to create features with Word2vec and CNNs.

In the second section the authors cover extensively a technique for categorical encoding called mean (or target) encoding. And in the final section on feature engineering, they move on to describe how to capture feature interactions, and how to create new statistics and distance based features.

Courses on Coursera can be audited for free, or you can choose to pay a fee if you wish a certification and access to the full material and practices.

3. Feature Engineering for Machine Learning in Python — Datacamp

Feature Engineering for Machine Learning in Python, is a hands-on course that teaches many aspects of feature engineering for categorical and continuous variables, and text data. The course discusses some techniques for variable discretization, missing data imputation, and for categorical variable encoding. It also discusses procedures for variable transformation, feature scaling and outlier removal.

Feature Engineering for Machine Learning in Python is composed of 4 chapters. The first chapter is available for free, but the remaining 3 require payment of a fee.

4. Feature Engineering - Coursera

The course Feature Engineering in Coursera introduces a few feature engineering techniques, focusing mainly on how to implement these techniques using the Google Cloud Platform, how to select good features and how to do feature pre-processing at scale.

Some students have complained that some of the notebooks would not run as expected, however the overall rating of the course is very good.

Courses on Coursera can be audited for free, or you can choose to pay a fee if you wish a certification and access to the full material and practices.

Books

1. Python Feature Engineering Cookbook, Packt

In the book Python Feature Engineering Cookbook, I provide the most extensive battery of feature engineering techniques, focusing on the practical implementation in Python and leveraging the power of pandas, Scikit-learn’s newer tools for feature transformation, the open-source package Feature-engine and other powerful Python packages for feature engineering like Category Encoders and Featuretools.

Python Feature Engineering Cookbook is a recipes book, and thus, dives straight into the code implementation of the feature engineering techniques.

Specifically, the book covers:

  1. Missing data imputation: including mean, median, mode, arbitrary, end of tail, random and multivariate imputation.
  2. Categorical variable encoding: the widest battery of encoding techniques, including rare label encoding, one hot, ordinal and mean encoding, weight of evidence, and binarization and feature hashing.
  3. Numerical variable transformation: including all mathematical transformations and the transformation of Box-Cox and Yeo-Johnson.
  4. Discretization: including equal-width and equal-frequency, k-means and tree derived discretization as well as discretization into arbitrary buckets.
  5. Feature scaling: including standardization, MinMax, Robust, Maximum absolute scaling and other techniques.
  6. Text preprocessing: including the creation of features to capture text complexity, like counting characters, words, unique words and lexical diversity, and also bag-of-words and TF-IDF with or without n-grams, as well as text cleaning techniques.
  7. Time series and transaction data: including extraction of features that capture signal complexity, or aggregate history at a single time point.
  8. Feature creation: including the creation of features with mathematical combinations, PCA and polynomial expansion.

All techniques are implemented in various open source Python packages for comparison, including Pandas, NumPy, Scikit-learn, Scipy.stats, Feature-engine, Category encoders and Featuretools, whenever possible.

2. Feature Engineering for Machine Learning Models, O’Reilly​

In the book Feature Engineering for Machine Learning Models, the authors teach various feature engineering techniques, focusing on the practical application with exercises in Python using Pandas, NumPy, Scikit-learn and Matplotlib.

Specifically, the book covers:

  1. Numerical variables: discretization, scaling, log and power transforms
  2. Categorical variables: one hot encoding, feature hashing and bin-counting
  3. Text: bag-of-words, n-grams, and phrase detection
  4. PCA
  5. Creating features with k-means
  6. Extracting features from images

3. Feature Engineering Made Easy, Packt

Feature Engineering Made Easy covers various aspects of feature engineering, including imputation of missing data, categorical encoding, numerical feature transformation, extraction of features from text and images, and feature creation with PCA.

Feature Engineering Made Easy includes examples that guide the reader through the implementation of these techniques in Python.

Feature Engineering Made Easy capitalizes in understanding the data at hand, so it includes a few chapters on data exploration as well.

4. Feature Engineering and Selection: A Practical Approach for Predictive Models

For R users, the book Feature Engineering and Selection: A Practical Approach for Predictive Models, is a good alternative. The book covers many aspects of feature engineering, including imputing missing data, categorical encoding, and numerical feature transformation.

I personally find the book is a bit extensive in text, as the authors try to provide their experience when using those methods, at the expense of code examples on how to implement the techniques.

Articles and other free resources

1. The 2009 Knowledge Discovery in Data Competition (KDD Cup 2009)

The 2009 Knowledge Discovery in Data Competition (KDD Cup 2009) is a series of articles published after the 2009 KDD competition, where the winners and runner-up competitors described the data pre-processing techniques they used to prepare the data to build the machine learning models.

The competition aimed to predict a highly unbalanced target, and the data contained a multitude of categorical and highly cardinal variables, as well as features with missing values. Therefore, the different articles describe multiple creative solutions to tackle these data issues.

For imputation of missing data, the authors used mean and median imputation together with adding a binary missing indicator. The authors also discuss several different ways of coping with the categorical encoding and high cardinality of the variables. A few of the solutions used discretization, including one solution that sorted the data into buckets using decision trees, and created new features by combining variables, also with decision trees, to pick up the feature interactions.

Certainly a very interesting series of articles for those dealing with datasets with thousands of variables, with a mix of categorical and numerical features with missing information.​

​2. Beating Kaggle the Easy Way

Beating Kaggle the Easy Way is the master thesis of a student in the Technische Universitaet Darmstadt, where the student explores multiple feature engineering techniques, in various data science competitions available on Kaggle.

The goal of the thesis was to get the best possible results, with minimal effort, across various data competitions, by re-using the feature engineering pipeline built for the first competition. This may sound a bit cheeky, however, in practice we do use the same, or very similar techniques across projects to make our machine learning models more predictive.

In the thesis, the student describes various data pre-processing and data cleaning techniques, and feature transformations that they used across competitions.

Although it may sound a bit daunting to read a thesis, this work is actually quite amenable and easy to follow, and quite enlightening as well, if you are just starting as a data scientist, so I highly recommend you give it a go.

Feature Engineering libraries

1. Scikit-learn

Scikit-learn, the industry standard Python library for machine learning, has recently released multiple transformers or classes for feature engineering, including transformers for missing data imputation, categorical encoding, discretization and variable transformation.

With the SimpleImputer class, we can perform mean, median, mode and arbitrary imputation, while the IterativeImputer class (still in experimental mode) allows us to do multivariate imputation.

Scikit-learn also includes the OneHotEncoder for one hot encoding and the LabelEncoder to replace categories by integers. With the KBinsDiscretiser we can discretize our variables, and with the PowerTransformer we can apply Yeo-Johnson and Box-Cox transformations.

As with any other Scikit-learn transformer, the feature engineering classes do not allow to select which variables we want to process with each technique. But, the developers have also released the ColumnTransformer class, that can be used to do exactly so.

2. Feature-Engine

Feature-Engine is an open source python package that was originally created to support the Udemy course Feature Engineering for Machine Learning. Feature-engine has however incorporated many techniques beyond the scope of the course and is supported by a growing number of contributors.

Feature-Engine contains multiple transformers or classes for missing data imputation, categorical encoding, discretization and variable transformation. Feature-Engine ‘s battery of transformers is more extensive than the one offered by Scikit-learn, and it has a few nice additional perks:

  1. It returns a pandas dataframe, so you can easily do a feature transformation and continue with your data exploration.
  2. It is very user friendly, allowing you to specify within the transformer which are the variables that you want to pre-process.
  3. It can be integrated into the Scikit-learn pipeline with a single line of code.

More details into what is unique about Feature-engine can be found here:

3. Categorical encoders

Categorical-encoder is the most extensive Python package for categorical variable encoding, including some common procedures like one hot encoding and weight of evidence, as well as more complex ways of encoding variables like BaseN and feature hashing.

If you want to learn more about these particular categorical encoding techniques, there is a good explanation in the blog Smarter Ways to Encode Categorical Data for Machine Learning.

For compared code implementations utilizing these Python libraries, visit this article:

4. Featuretools

Featuretools is an open source Python library that facilitates the pre-processing of transaction and time-series data. With Featuretools, we need only determine a time window, and the package functionality will derive new features from different aggregation procedures over the time-series or transactions.These new features are derived from mathematical formulations, like finding maximum and minimum values, mean, median and standard deviation, among others. And also, by determining time elapsed since last event, or between events.

Featuretools also allows the user to determine functions to create bespoke features, which can then be applied to aggregate the data over fixed time windows.

Online Articles about Feature Engineering

If you want to read more about feature engineering, you can check my other articles on Medium and TowardsDataScience:

Lead Data Scientist, author of “Python Feature Engineering Cookbook”, instructor of online courses on machine learning and developer of open-source Python code.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store