Feature Engineering for Machine Learning: A Comprehensive Overview

Feature Engineering for Machine Learning — Online Course — Image by the author

Feature engineering is the process of using domain knowledge of the data to transform existing features or to create new variables from existing ones, for use in machine learning.

Data in its raw format is almost never suitable for use to train machine learning algorithms. Instead, data scientists devote a substantial amount of time to pre-process the variables to use them in machine learning.

Why do we need to engineer features?

There are various reasons why we engineer features:

  1. Some machine learning libraries do not support missing values or strings as inputs, for example Scikit-learn.
  2. Some machine learning models are sensitive to the magnitude of the features, for example linear models, SVMs and neural networks and all distance based algorithms like PCA and nearest neighbours.
  3. Some algorithms are sensitive to outliers, for example linear models and adaboost
  4. Some variables provide almost no information in their raw format, for example dates
  5. Often variable pre-processing allows us to capture more information, which can boost algorithm performance, for example target mean encoding of categorical variables
  6. Frequently variable combinations are more predictive than variables in isolation, for example the sum or the mean of a group of variables.
  7. Some variables contain information about transactions, providing time-stamped data, and we may want to aggregate them into a static view.

As you can see, feature engineering is an umbrella term that includes multiple techniques to perform everything from filling missing values, to encoding categorical variables, to variable transformation, to creating new variables from existing ones.

In this article, I highlight the main feature engineering techniques to process the data and leave it ready to use for machine learning. I describe what each technique entails, and say a few words about when we should use each technique.

For code and step-by-step tutorials with the advantages and shortcomings of each technique, check out the online course “ Feature Engineering for Machine Learning ” or the book “Python Feature Engineering Cookbook”.

You can also check this article for compared code implementations and discussion of available open-source packages for feature engineering:

Table of Contents

  1. Missing Data Imputation
  2. Categorical Encoding
  3. Variable Transformation
  4. Discretization
  5. Outlier Engineering
  6. Feature Scaling
  7. Date and Time Engineering
  8. Feature Creation
  9. Aggregating Transaction Data

1. Missing Data Imputation

Imputation is the act of replacing missing data with statistical estimates of the missing values. The goal of any imputation technique is to produce a complete data set that can be used to train machine learning models.

There are multiple techniques for missing data imputation:

  1. Complete Case Analysis
  2. Mean / Median / Mode Imputation
  3. Random Sample Imputation
  4. Replacement by Arbitrary Value
  5. End of Distribution Imputation
  6. Missing Value Indicator
Techniques for missing data imputation — Image by the author

Complete case analysis implies analyzing only those observations in the data set that contain values in all the variables. In other words, in complete case analysis we remove all observations with missing values. This procedure is suitable when there are few observations with missing data in the data set. But, if the data set contains missing data across multiple variables, or some variables contain a high proportion of missing observations, we can easily remove a big chunk of the data set, and this is undesired.

We can replace missing values with the mean, the median or the mode of the variable. Mean / median / mode imputation is widely adopted in organizations and data competitions.

Although in practice these techniques are used in almost every situation, the procedure is suitable if data is missing at random and in small proportions. If there are a lot of missing observations, however, we will distort the distribution of the variable, as well as its relationship with other variables in the data set. Distortion in the variable distribution may affect the performance of linear models.

For categorical variables, replacement by the mode, is also known as replacement by the most frequent category.

Random sample imputation refers to randomly selecting values from the variable to replace the missing data. This technique preserves the variable distribution, and is well suited for data missing at random. But, we need to account for randomness by adequately setting a seed. Otherwise, the same missing observation could be replaced by different values in different code runs, and therefore lead to a different model predictions. This is not desired when using our models within an organization.

Replacement by an arbitrary value, as its names indicates, refers to replacing missing data by any, arbitrarily determined value, but the same value for all missing data.

Replacement by an arbitrary value is suitable if data is not missing at random, or if there is a huge proportion of missing values. If all values are positive, a typical replacement is -1. Alternatively, replacing by 999 or -999 are common practice. We need to anticipate that these arbitrary values are not a common occurrence in the variable.

Replacement by arbitrary values however may not be suited for linear models, as it most likely will distort the distribution of the variables, and therefore model assumptions may not be met.

For categorical variables, this is the equivalent of replacing missing observations with the label “Missing” which is a widely adopted procedure.

End of tail imputation involves replacing missing values by a value at the far end of the variable distribution. This technique is similar in essence to imputing by an arbitrary value. However, by placing the value at the end of the distribution, we need not look at each variable distribution individually, as the algorithm does it automatically for us. This imputation technique tends to work well with tree-based algorithms, but it may affect the performance of linear models, as it distorts the variable distribution.

The missing indicator technique involves adding a binary variable to indicate whether the value is missing for a certain observation. This variable takes the value 1 if the observation is missing, or 0 otherwise.

One thing to notice is that we still need to replace the missing values in the original variable, which we tend to do with mean or median imputation. By using these 2 techniques together, if the missing value has predictive power, it will be captured by the missing indicator, and if it doesn’t it will be masked by the mean / median imputation.

These 2 techniques in combination tend to work well with linear models. But, adding a missing indicator expands the feature space and, as multiple variables tend to have missing values for the same observations, many of these newly created binary variables could be identical or highly correlated.

There are, in addition, multivariate techniques for missing data imputation, like MICE (Multivariate Imputation with Chained Equations) that I will not cover in this post, but are covered in the course “ Feature Engineering for Machine Learning” and in the book “Python Feature Engineering Cookbook”.

2. Categorical Encoding

Categorical variable encoding is an umbrella term for techniques used to transform the strings or labels of categorical variables into numbers. There are multiple techniques available to us:

  1. One hot encoding
  2. Count and Frequency encoding
  3. Target encoding / Mean encoding
  4. Ordinal encoding
  5. Weight of Evidence
  6. Rare label encoding
Categorical encoding techniques — Image by the author

One hot encoding (OHE) creates a binary variable for each one of the different categories present in a variable. These binary variables take 1 if the observation shows a certain category or 0 otherwise.

OHE is suitable for linear models. But, OHE expands the feature space quite dramatically if the categorical variables are highly cardinal, or if there are many categorical variables. In addition, many of the derived dummy variables could be highly correlated.

In count encoding we replace the categories by the count of the observations that show that category in the data set. Similarly, we can replace the category by the frequency -or percentage- of observations in the data set. That is, if 10 of our 100 observations show the colour blue, we would replace blue by 10 if doing count encoding, or by 0.1 if replacing by the frequency.

These techniques capture the representation of each label in a data set, but the encoding may not necessarily be predictive of the outcome. These are however, very popular encoding methods in Kaggle competitions.

In target encoding, also called mean encoding, we replace each category of a variable, by the mean value of the target for the observations that show a certain category. For example, we have the categorical variable “city”, and we want to predict if the customer will buy a TV provided we send a letter. If 30 percent of the people in the city “London” buy the TV, we would replace London by 0.3.

This technique has 3 advantages:

  1. it does not expand the feature space,
  2. it captures some information regarding the target at the time of encoding the category, and
  3. it creates a monotonic relationship between the variable and the target.

Monotonic relationships between variable and target tend to improve linear model performance.

In ordinal encoding we replace the categories by digits, either arbitrarily or in an informed manner. If we encode categories arbitrarily, we assign an integer per category from 1 to n, where n is the number of unique categories. If instead, we assign the integers in an informed manner, we observe the target distribution: we order the categories from 1 to n, assigning 1 to the category for which the observations show the highest mean of target value, and n to the category with the lowest target mean value.

Weight of evidence (WOE) is a technique used to encode categorical variables for classification. WOE is the natural logarithm of the probability of the target being 1 divided the probability of the target being 0.

WOE has the property that its value will be 0 if the phenomenon is random; it will be bigger than 0 if the probability of the target being 0 is bigger, and it will be smaller than 0 when the probability of the target being 1 is greater.

WOE transformation creates a nice visual representation of the variable, because by looking at the WOE encoded variable, we can see, category by category, whether it favors the outcome of 0, or of 1. In addition, WOE creates a monotonic relationship between variable and target, and leaves all the variables within the same value range.

Categories that are present only in a small proportion of the observations, tend to be grouped into an umbrella category like “Other” or “Rare”. This procedure tends to improve machine learning model generalization, in particular for tree based methods, and also operationalization of the models in production.

Grouping of infrequent categories — Image by the author

There are additional methods of categorical encoding, like Binary Encoding and Feature Hashing, which I will not cover in this post, but are covered in the course “ Feature Engineering for Machine Learning”. For more information on these techniques you can also visit Will McGinnis’ blog.

3. Variable transformation

Some machine learning models may benefit from a more homogeneous spread of values across the value range. If variables are not normally distributed, we can apply a mathematical transformation to “enforce” this distribution.

Typically used mathematical transformations are:

  1. Logarithm transformation — log(x)
  2. Reciprocal transformation — 1 / x
  3. Square root transformation — sqrt(x)
  4. Exponential transformation — exp(x)
  5. Yeo-Johnson transformation
  6. Box-Cox transformation
Mathematical transformations may improve the spread of values — Image by the author

Box-Cox and Yeo-Johnson are adaptations of exponential transformations that span over several exponents, and are therefore more likely to achieve the desired result. You can find the transformation formulas in this article.

When applying mathematical transformations we need to be mindful of the variable values. For example, logarithm and square root only support positive values, and the reciprocal transformation is not defined for 0.

4. Discretization

Discretization refers to sorting the values of the variable into bins or intervals, also called buckets. There are multiple ways to discretize variables:

  1. Equal width discretization
  2. Equal Frequency discretization
  3. Discretization using decision trees
Discretization of variables — Image by the author

In equal width discretization, the bins or interval limits are determined so that each interval is of the same width. This is accomplished by subtracting the minimum value from the maximum value of the variable, and dividing that range into the amount of bins desired, say 10. Next, we sort the observations in those bins. Note however, that if the distribution is skewed, this technique does not improve the spread of the values.

In equal frequency discretization, the boundaries of the intervals are determined so that each bin contains the same number of observations. This is a better solution if we want to spread the values evenly across all bins. The usual approach is to use the percentiles, or quartiles to determine the intervals.

Discretization with decision trees involves sorting the observations into the tree end leaves, after training a decision tree. Different leaves will contain different number of observations, so it does not preserve frequency like equal frequency discretization. And also, each node is not itself an interval, instead a prediction value.

Discretization with decision trees can improve model performance by creating monotonic relationships that already capture some of the predictive power of the variable. This engineering technique was used in a KDD competition in 2009.

5. Outliers

Outliers are values that are unusually high or unusually low respect to the rest of the observations of the variable. There are a few techniques for outlier handling:

  1. Outlier removal
  2. Treating outliers as missing values
  3. Top / bottom / zero coding — censoring outliers
  4. Discretization

Outlier removal refers to removing outlier observations from the data set. Outliers, by nature are not abundant, so this procedure should not distort the data set dramatically. But if there are outliers across multiple variables, we may end up removing a big portion of the data set.

We can treat outliers as missing information, and carry on any of the imputation methods described earlier in the post.

Top or bottom coding are also known as Winsorization, outlier capping our outlier censoring. The procedure involves capping the maximum and minimum values at a predefined value. This predefined value can be arbitrary, or it can be derived from the variable distribution.

How can we derive the maximum and minimum values? If the variable is normally distributed we can cap the maximum and minimum values at the mean plus or minus 3 times the standard deviation. If the variable is skewed, we can use the inter-quartile range proximity rule or cap at the top and bottom percentiles.

Discretization handles outliers automatically, as outliers are sorted into the terminal bins, together with the other higher or lower value observations. The best approaches are equal frequency and tree based discretization.

6. Feature Scaling

Many machine learning algorithms are sensitive to the magnitude of the variables, therefore it is common practice to set all features within the same scale. There are multiple ways of feature scaling:

Feature standardization involves subtracting the mean from each value and dividing by the standard deviation. Feature standardization makes the variables have 0 value mean and unit-variance and it is suitable if the variables are normally distributed.

Min-Max Scaling, or Min-Max normalization, consists in re-scaling the variable to 0–1, which is achieved by subtracting the minimum from each value and dividing by the value range. The value range is calculated as the maximum minus the minimum value of the variable. Min-Max Scaling offers a good alternative to Standardization when variables are skewed.

Maximum Absolute Scaling involves scaling the features between 0 and 1, by dividing each value of the variable by the maximum value.

Robust Scaling involves removing the median from each value and dividing by the inter-quartile range, which is given by the difference between the 75th and 25th percentiles. The procedure is similar in essence to Min-Max Scaling, but offers a better value spread for highly skewed variables.

In mean normalization, we remove from each value the mean value, and divide by the value range, that is the difference between the maximum and minimum value.

Scaling to unit length refers to transforming the values of variable so that the complete variable vector has length one. In scaling to unit length, we divide each value of the variable by the Euclidean length or the norm of the variable.

7. Date and Time Engineering

In dealing with date and time variables, normally we extract information like year, month, day, day of the week, is weekend, time of the day, is morning, is afternoon, among others. In addition, we normally extract information from multiple date time variables in combination, for example age from date of birth and date of transaction, or time elapsed between 2 dates, just to name a few.

8. Feature Creation

Feature creation refers to creating new features from existing ones. These can be done generally by aggregating features using the mean, maximum and minimum values, sum and differences. We could also perform polynomial and other non-linear combinations of the features.

Much of feature creation involves knowledge of the variables at hand to derive new features that are meaningful to people, if they are to be used in organizations. For data competitions, any brute force approach to create variables that are not necessarily comprehensible may give us and edge in the competition.

Feature creation is more commonly seen in Natural Language Processing, when creating bag of words or frequency tables from the words that appear in the text.

9. Aggregating Transaction Data

Transaction data refers to information recorded from transactions. For example, we can keep records of every sale done in a shop, or the balances in our bank and credit accounts throughout the months of the year. In order to use transaction data to predict static outcomes, we normally aggregate these variables into a static view.

Common ways of aggregating these variables include determining a time window, for example the last 6 months, and finding the maximum value transaction, the minimum value transaction, the mean, the sum, the standard deviation, among others.


Lead Data Scientist, author of “Python Feature Engineering Cookbook”, instructor of online courses on machine learning and developer of open-source Python code.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store