Feature-engine: A new open source Python package for feature engineering

Feature-engine — Python open source — Image by the author

Feature-engine is an open source Python library with the most exhaustive battery of transformers to engineer features for use in machine learning models. Feature-engine simplifies and streamlines the implementation of and end-to-end feature engineering pipeline, by allowing the selection of feature subsets within its transformers, and returning dataframes for easy data exploration. Feature-engine’s transformers preserve Scikit-learn functionality with the methods fit() and transform() to learn parameters from and then transform data.

Feature engineering

Feature engineering is the process of using domain knowledge of the data to transform existing features or to create new variables from existing ones, for use in machine learning. Feature engineering includes procedures to impute missing data, encode categorical variables, transform or discretize numerical variables, put features in the same scale, combine features into new variables, extract information from dates, aggregate transaction data, or to derive features from time series, text or even images.

There are many techniques that we can use at each of these feature engineering steps, and our choice depends on the characteristics of the variables in our data set, as well as, on the algorithms we intend to use.

For more details about feature engineering techniques visit:

You can also find more information here:

Challenges of deploying a machine learning pipeline

Machine learning models take in a bunch of input variables and output a prediction. Yet, the raw data collected and stored by multiple organizations is almost never suitable to be directly fed into a machine learning model. Instead, we perform an extensive amount of transformations to leave the variables in a shape that can be understood by these algorithms. The collection of variable transformations are commonly referred to as feature engineering.

By using well-established open source Python libraries, we can make model development and deployment more efficient and reproducible. Established open source packages provide quality tools, which use removes the task of coding from our hands, improving team performance and collaboration. In addition, open source packages tend to be extensively tested, and thus prevent the introduction of bugs and guarantee reproducibility.

For more info on how to streamline deployment pipelines with open source, check our article:

Feature-engine

Feature-engine is an open source Python library that simplifies and streamlines the implementation of and end-to-end feature engineering pipeline. Feature-engine preserves Scikit-learn functionality with the methods fit() and transform() to learn parameters from and then transform the data.

Many feature engineering techniques, need to learn parameters from the data, like statistical values or encoding mappings, to transform incoming data. The Scikit-learn functionality with the fit and transform methods makes Feature-engine easy use and easy to learn. Feature-engine’s transformers also store the learned parameters, and can be used within the Scikit-learn Pipeline.

Feature engine supports multiple transformers for missing data imputation, categorical variable encoding, discretization, variable transformation and outlier handling, thus providing the most exhaustive array of techniques for feature engineering. More specifically, Feature-engine supports the following techniques for each engineering aspect of a variable:

Missing data imputation

Categorical encoding

Discretization methods

Variable transformation methods

Outlier handling methods (Exclusive)

Feature Creation

Feature Selection

What is unique about Feature-engine?

Feature-engine has the following characteristics that differentiate it from other available open source packages:

  1. Feature-engine contains the most exhaustive battery of feature engineering transformations
  2. Feature-engine allows us to select the variables to transform within the transformer
  3. Feature-engine takes in a dataframe and returns a dataframe suitable both for data exploration and production or deployment
  4. Feature-engine is compatible with the Scikit-learn pipeline, thus all engineering transformations can be stored in a single Python pickle
  5. Feature-engine automatically recognizes numerical and categorical variables
  6. Feature-engine will alert when transformations are not possible, for example if applying logarithm to negative variables or divisions by variables with 0s as values

Feature-engine’s exhaustive variable transformation toolkit

Feature-engine hosts all-round transformations to leave the data ready for machine learning. In addition to the widely used imputation techniques like mean, median, mode and arbitrary imputation, which are also supported by Scikit-learn, Feature-engine also supports imputation with values at the end of the distribution, and imputation by random sampling.

Feature-engine also offers a variety of exclusive techniques for categorical variable encoding. On top of the widely used one hot encoding and ordinal encoding, supported by Scikit-learn, and of target mean encoding and weight of evidence, supported by category encoders, Feature-engine also offers count and frequency encoding, monotonic ordinal encoding, probability ratio encoding and encoding with decision trees.

Feature-engine also offers functionality to handle rare labels, like one hot encoding of frequent categories or grouping infrequent categories under a common new label defined by the user.

Feature-engine hosts most mathematical transformations and discretization techniques available in Scikit-learn, and it has the additional functionality to use decision trees to transform a variable into discrete numbers. Finally, Feature-engine is, to the best of our knowledge, the only open source library with functionality to remove or censor outliers.

In one of the latest releases, Feature-engine included an extensive battery of feature selection tools, in addition to those already available in Scikit-learn, providing hybrid methods of feature selections, developed in the industry or reported in data science competitions.

Feature engine allows the selection of variables within the transformer

One of the reasons why Feature-engine’s transformers are so convenient, is because they allow us to select which variables we wish to transform with each technique, directly at the transformer.

This way, we can specify the group of variables which, for example, we want to impute with the mean, and the group of variables to impute with the mode, directly within these transformers, without the need to slice the dataframe manually or use alternative transformers. Code examples will follow later on in this article.

Feature-engine returns a dataframe

All Feature-engine transformers return dataframes as outputs. This means that after transforming our data set, we do not need to worry about variable names and column order as we would do with the NumPy arrays returned by Scikit-learn.

With Feature-engine, we can continue to leverage the power of pandas for data analysis and visualization even after transforming our data set, allowing for data exploration before and after transforming the variables.

Feature-engine is compatible with the Scikit-learn pipeline

Feature-engine transformers are compatible with the Scikit-learn pipeline. This allows the implementation of many feature engineering steps within a single Scikit-learn pipeline prior to training a machine learning algorithm, or obtaining its predictions from raw data.

With Feature-engine, we can store an entire machine learning series of transformations into a single object that can be saved and retrieved at a later stage, or placed in memory, for live scoring. Code examples later on in this article.

Feature-engine automatically recognizes numerical and categorical variables

Feature-engine automatically recognizes numerical and categorical variables, thus, preventing the risk of inadvertently applying categorical encoding to numerical variables or numerical imputation techniques to categorical variables.

This functionality also allows to run the transformers without indicating which variables to transform; Feature-engine transformers are intelligent enough to apply numerical transformations to numerical variables and categorical transformations to categorical variables, so that, returning very quickly, and without a lot of data manipulation a benchmark machine learning pipeline on a given data set.

Feature-engine alerts when transformations are not possible for certain variables

Feature-engine will alert when transformations are not possible. For categorical encoding, for example, Feature-engine will signal the unexpected / unintended introduction of missing values. For variable transformations, Feature-engine will alert when logarithm is being applied on negative variables or when reciprocal transformations are applied on variables with 0s as values.

This way, Feature-engine helps identify issues with the variables early on during the development of a machine learning engineering pipeline, so that we can choose a more suitable technique.

How to use Feature-engine

In the rest of the article, we will show examples of how to use Feature-engine transformers for missing data imputation, categorical encoding, discretization and variable transformation. Let’s begin by missing data imputation, which is typically the first step of a machine learning pipeline.

Feature-engine transformers learn parameters from data when the method fit() is used, and store this parameters within their attributes. These values can then be retrieved to transform new data. In the following sections, we will show how to instantiate and fit a transformer, and how to use a trained transformer to transform a train and a test set.

Missing data imputation

Missing data imputation refers to replacing missing observations by a statistical parameter derived from the available values of the variable. As an example of Feature-engine’s imputation capabilities, we will perform median imputation. Feature-engine’s MeanMedianImputer automatically selects all numerical variables in the data set for imputation, ignoring the categorical variables. The transformer also offers the option to select the variables to impute, as we will show below.

In the walk through below, you can see the implementation of the imputer using the median as the imputation_method on predicting variables on both the test and train datasets. Mean imputation can be implemented similarly by simply replacing “median” with “mean” for imputation_method.

If you wish to run the code below, first download and prepare de data set as indicated here.

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from feature_engine.imputation import MeanMedianImputer

# Load dataset
data = pd.read_csv('creditApprovalUCI.csv')

# Separate into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
data.drop('A16', axis=1),
data['A16'],
test_size=0.3,
random_state=0
)

# Set up the imputer
median_imputer = MeanMedianImputer(
imputation_method='median',
variables=[‘A2’, ‘A3’, ‘A8’, ‘A11’, ‘A15’]
)
# fit the imputer
median_imputer.fit(X_train)

# transform the data
X_train = median_imputer.transform(X_train)
X_test = median_imputer.transform(X_test)

After running the above code, the training set will not contain missing values in the variables A2, A3, A8, A11 and A15, and the output will be a dataframe, that allow us to continue with data exploration, to for example, understand the effect of this transformation in the variables distribution.

Categorical encoding

Categorical encoding includes techniques to transform variables that contain strings as values, into numerical variables. To demonstrate how to use Feature-engine’s categorical encoders, we will perform Count encoding, that is, we will replace the categories by the number of times they appear in the train set. We will use the titanic data set, which is publicly available in OpenML.

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from feature_engine import encoding as ce

# Load dataset
def load_titanic():
data = pd.read_csv(
'https://www.openml.org/data/get_csv/16826755 /phpMYEkMl'
)
data = data.replace('?', np.nan)
data['cabin'] = data['cabin'].astype(str).str[0]
data['pclass'] = data['pclass'].astype('O')
data['embarked'].fillna('C', inplace=True)
return data

data = load_titanic()

# Separate into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
data.drop(['survived', 'name', 'ticket'], axis=1),
data['survived'], test_size=0.3, random_state=0
)

# set up the encoder
encoder = ce.CountFrequencyEncoder(
encoding_method='frequency',
variables=['cabin', 'pclass', 'embarked']
)

# fit the encoder
encoder.fit(X_train)

# transform the data
train_t = encoder.transform(X_train)
test_t = encoder.transform(X_test)

Feature-engine learns the category-to-string mappings from the train set, and stores them in the attribute encoder_dict_. The output is a dataframe, where the variables cabin, pclass and embarked are now numbers instead of strings.

Discretization

Discretization involves sorting the values of continuous variables into discrete intervals, also called bins or buckets. Here, we will show how to perform discretization using decision trees, a technique supported exclusively by Feature-engine.

We will use the house prices data set, which is available on Kaggle.

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from feature_engine import discretisation as dsc

# Load dataset
data = data = pd.read_csv('houseprice.csv')

# Separate into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
data.drop(['Id', 'SalePrice'], axis=1),
data['SalePrice'],
test_size=0.3,
random_state=0
)

# set up the discretisation transformer
disc = dsc.DecisionTreeDiscretiser(
cv=3,
scoring='neg_mean_squared_error',
variables=['LotArea', 'GrLivArea'],
regression=True
)

# fit the transformer
disc.fit(X_train, y_train)

# transform the data
train_t = disc.transform(X_train)
test_t = disc.transform(X_test)

The output of the variable transformation is a discrete variable, where each of the discrete values, is the prediction returned by the decision tree based of the variable original value.

Mathematical Transformation

Mathematical transformations refer to the transformation of the original variable by applying any mathematical function, typically to try and obtain a Gaussian distribution. Here, we will demonstrate how to implement the Box-Cox transformation with Feature-engine:

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from feature_engine import transformation as vt

# Load dataset
data = data = pd.read_csv('houseprice.csv')

# Separate into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
data.drop(['Id', 'SalePrice'], axis=1),
data['SalePrice'],
test_size=0.3,
random_state=0
)

# set up the variable transformer
tf = vt.BoxCoxTransformer(
variables = ['LotArea', 'GrLivArea']
)

# fit the transformer
tf.fit(X_train)

# transform the data
train_t = tf.transform(X_train)
test_t = tf.transform(X_test)

Outlier Handling

Outliers are those variables of the variable that are extremely unusual given the rest of the values of said variable. Among its functionality, Feature-engine allows us to remove or censor outliers, based on the Gaussian approximation, the inter-quartile range proximity rule or the percentiles. Here, we will demonstrate how to censor outliers by finding the variable limits using the IQR:

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from feature_engine import outlier as outr

# Load dataset
def load_titanic():
data = pd.read_csv(
'https://www.openml.org/data/get_csv/16826755/phpMYEkMl'
)
data = data.replace('?', np.nan)
data['cabin'] = data['cabin'].astype(str).str[0]
data['pclass'] = data['pclass'].astype('O')
data['embarked'].fillna('C', inplace=True)
data['fare'] = data['fare'].astype('float')
data['fare'].fillna(data['fare'].median(), inplace=True)
data['age'] = data['age'].astype('float')
data['age'].fillna(data['age'].median(), inplace=True)
return data

data = load_titanic()

# Separate into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
data.drop(['survived', 'name', 'ticket'], axis=1),
data['survived'],
test_size=0.3,
random_state=0
)

# set up the capper
capper = outr.Winsorizer(
distribution='gaussian',
tail='right',
fold=3,
variables=['age', 'fare']
)

# fit the capper
capper.fit(X_train)

# transform the data
train_t = capper.transform(X_train)
test_t = capper.transform(X_test)

The output is a dataframe, where the values of the variables age and fare that were beyond the boundaries of the distribution determined by the IQR, are now replaced by those boundaries.

Assembling Feature-engine transformers into the Scikit-learn pipeline

In the precedent sections, we showed how to implement each technique individually. When we build machine learning models, we usually perform various transformations to the variables. We can place all Feature-engine transformers within a Scikit-learn pipeline, to smooth data transformation and algorithm training, as well as easily score new raw data.

In the following code snippet, we perform a complete feature engineering pipeline to the house prices data set, and then build a Lasso regression to predict house price, leveraging the power of the Scikit-learn pipeline:

import pandas as pd
import numpy as np

from sklearn.linear_model import Lasso
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline as pipe
from sklearn.preprocessing import MinMaxScaler

from feature_engine import encoding as ce
from feature_engine import discretisation as dsc
from feature_engine import imputation as mdi

# load dataset
data = pd.read_csv('houseprice.csv')

# drop some variables
data.drop(
labels=['YearBuilt', 'YearRemodAdd', 'GarageYrBlt', 'Id'],
axis=1,
inplace=True
)

# make a list of categorical variables
categorical = [
var for var in data.columns if data[var].dtype == 'O'
]

# make a list of numerical variables
numerical = [
var for var in data.columns if data[var].dtype != 'O'
]

# make a list of discrete variables
discrete = [
var for var in numerical if len(data[var].unique()) < 20
]

# categorical encoders work only with object type variables
# to treat numerical variables as categorical, we need to
# re-cast them
data[discrete]= data[discrete].astype('O')

# continuous variables
numerical = [
var for var in numerical if var not in discrete
and var not in ['Id', 'SalePrice']
]

# separate into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
data.drop(labels=['SalePrice'], axis=1),
data.SalePrice,
test_size=0.1,
random_state=0
)

# set up the pipeline
price_pipe = pipe([
# add a binary missing indicator
('continuous_var_imputer', mdi.AddMissingIndicator(
variables = ['LotFrontage'])),

# replace NA by the median
('continuous_var_median_imputer', mdi.MeanMedianImputer(
imputation_method='median',
variables = ['LotFrontage', 'MasVnrArea'])),

# replace NA by adding the label "Missing"
('categorical_imputer', mdi.CategoricalVariableImputer(
variables = categorical)),

# disretise continuous variables using trees
('numerical_tree_discretiser', dsc.DecisionTreeDiscretiser(
cv = 3,
scoring='neg_mean_squared_error',
variables = numerical, regression=True)),

# remove rare labels in categorical and discrete variables
('rare_label_encoder', ce.RareLabelCategoricalEncoder(
tol = 0.03,
n_categories=1,
variables = categorical+discrete)),

# encode categorical and discrete variables using the target mean
('categorical_encoder', ce.MeanCategoricalEncoder(
variables = categorical+discrete)),

# scale features
('scaler', MinMaxScaler()),

# Lasso
('lasso', Lasso(random_state=2909, alpha=0.005))
])

# train feature engineering transformers and Lasso
price_pipe.fit(X_train, np.log(y_train))

# predict
pred_train = price_pipe.predict(X_train)
pred_test = price_pipe.predict(X_test)

Note in the code above, how we indicate which variables to transform within each of Feature-engine transformers. And also note, how easy it is to train the algorithm, and to obtain predictions, once all transformers are assembled within a pipeline. If we want to deploy these pipeline, we need only place 1 Python object in memory to do the job, or save and retrieve only 1 Python pickle, that contains the entire, pre-trained machine learning pipeline.

Bonus: Scikit-learn wrapper

Scikit-learn transformers like the SimpleImputer or any of the variable scalers like the StandardScaler or the MinMaxScaler, transform the entire input dataset and return a NumPy array. If we want to apply these transformers to a subset of features, we can use the Scikit-learn wrapper available in Feature-engine.

Here is how to do it:

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from feature_engine.wrappers import SklearnTransformerWrapper

# Load dataset
data = pd.read_csv('houseprice.csv')

# Separate into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
data.drop(['Id', 'SalePrice'], axis=1),
data['SalePrice'],
test_size=0.3,
random_state=0
)

# set up the wrapper with the SimpleImputer
imputer = SklearnTransformerWrapper(
transformer = SimpleImputer(strategy='mean'),
variables = ['LotFrontage', 'MasVnrArea'])

# fit the wrapper + SimpleImputer
imputer.fit(X_train)

# transform the data
X_train = imputer.transform(X_train)
X_test = imputer.transform(X_test)

Feature-engine’s Scikit-learn wrappers allows the application of most Scikit-learn transformers to a selected feature subspace, returning a dataframe.

Closing remarks

Feature engineering is the process of taking a data set and constructing explanatory variables, or predictor features, that are then passed onto the prediction model to train a machine learning algorithm. It is a crucial step in all machine learning models, but can be challenging and time consuming if you aren’t already deeply familiar with the knowledge domain.

Open source libraries with of-the-shelf algorithms for feature engineering and data transformation have a major edge over manually encoding the transformation steps, as they enhance reproducibility while minimizing the amount of coding required by the data scientist.

In this blog, we explored the salience of Feature-engine, and its exhaustive battery of techniques for missing data imputation, categorical variable encoding, variable transformation, discretization and outlier handling, and provided a few examples that show how easy it is to use.

Additional Resources

Lead Data Scientist, author of “Python Feature Engineering Cookbook”, instructor of online courses on machine learning and developer of open-source Python code.