Feature-engine: A new open source Python package for feature engineering

Feature-engine — Python open source — Image by the author

Feature-engine is an open source Python library with the most exhaustive battery of transformers to engineer features for use in machine learning models. Feature-engine simplifies and streamlines the implementation of and end-to-end feature engineering pipeline, by allowing the selection of feature subsets within its transformers, and returning dataframes for easy data exploration. Feature-engine’s transformers preserve Scikit-learn functionality with the methods fit() and transform() to learn parameters from and then transform data.

Feature engineering

There are many techniques that we can use at each of these feature engineering steps, and our choice depends on the characteristics of the variables in our data set, as well as, on the algorithms we intend to use.

For more details about feature engineering techniques visit:

You can also find more information here:

Challenges of deploying a machine learning pipeline

By using well-established open source Python libraries, we can make model development and deployment more efficient and reproducible. Established open source packages provide quality tools, which use removes the task of coding from our hands, improving team performance and collaboration. In addition, open source packages tend to be extensively tested, and thus prevent the introduction of bugs and guarantee reproducibility.

For more info on how to streamline deployment pipelines with open source, check our article:

Feature-engine

Many feature engineering techniques, need to learn parameters from the data, like statistical values or encoding mappings, to transform incoming data. The Scikit-learn functionality with the fit and transform methods makes Feature-engine easy use and easy to learn. Feature-engine’s transformers also store the learned parameters, and can be used within the Scikit-learn Pipeline.

Feature engine supports multiple transformers for missing data imputation, categorical variable encoding, discretization, variable transformation and outlier handling, thus providing the most exhaustive array of techniques for feature engineering. More specifically, Feature-engine supports the following techniques for each engineering aspect of a variable:

Missing data imputation

Categorical encoding

Discretization methods

Variable transformation methods

Outlier handling methods (Exclusive)

Feature Creation

Feature Selection

What is unique about Feature-engine?

  1. Feature-engine contains the most exhaustive battery of feature engineering transformations
  2. Feature-engine allows us to select the variables to transform within the transformer
  3. Feature-engine takes in a dataframe and returns a dataframe suitable both for data exploration and production or deployment
  4. Feature-engine is compatible with the Scikit-learn pipeline, thus all engineering transformations can be stored in a single Python pickle
  5. Feature-engine automatically recognizes numerical and categorical variables
  6. Feature-engine will alert when transformations are not possible, for example if applying logarithm to negative variables or divisions by variables with 0s as values

Feature-engine’s exhaustive variable transformation toolkit

Feature-engine also offers a variety of exclusive techniques for categorical variable encoding. On top of the widely used one hot encoding and ordinal encoding, supported by Scikit-learn, and of target mean encoding and weight of evidence, supported by category encoders, Feature-engine also offers count and frequency encoding, monotonic ordinal encoding, probability ratio encoding and encoding with decision trees.

Feature-engine also offers functionality to handle rare labels, like one hot encoding of frequent categories or grouping infrequent categories under a common new label defined by the user.

Feature-engine hosts most mathematical transformations and discretization techniques available in Scikit-learn, and it has the additional functionality to use decision trees to transform a variable into discrete numbers. Finally, Feature-engine is, to the best of our knowledge, the only open source library with functionality to remove or censor outliers.

In one of the latest releases, Feature-engine included an extensive battery of feature selection tools, in addition to those already available in Scikit-learn, providing hybrid methods of feature selections, developed in the industry or reported in data science competitions.

Feature engine allows the selection of variables within the transformer

This way, we can specify the group of variables which, for example, we want to impute with the mean, and the group of variables to impute with the mode, directly within these transformers, without the need to slice the dataframe manually or use alternative transformers. Code examples will follow later on in this article.

Feature-engine returns a dataframe

With Feature-engine, we can continue to leverage the power of pandas for data analysis and visualization even after transforming our data set, allowing for data exploration before and after transforming the variables.

Feature-engine is compatible with the Scikit-learn pipeline

With Feature-engine, we can store an entire machine learning series of transformations into a single object that can be saved and retrieved at a later stage, or placed in memory, for live scoring. Code examples later on in this article.

Feature-engine automatically recognizes numerical and categorical variables

This functionality also allows to run the transformers without indicating which variables to transform; Feature-engine transformers are intelligent enough to apply numerical transformations to numerical variables and categorical transformations to categorical variables, so that, returning very quickly, and without a lot of data manipulation a benchmark machine learning pipeline on a given data set.

Feature-engine alerts when transformations are not possible for certain variables

This way, Feature-engine helps identify issues with the variables early on during the development of a machine learning engineering pipeline, so that we can choose a more suitable technique.

How to use Feature-engine

Feature-engine transformers learn parameters from data when the method fit() is used, and store this parameters within their attributes. These values can then be retrieved to transform new data. In the following sections, we will show how to instantiate and fit a transformer, and how to use a trained transformer to transform a train and a test set.

Missing data imputation

In the walk through below, you can see the implementation of the imputer using the median as the imputation_method on predicting variables on both the test and train datasets. Mean imputation can be implemented similarly by simply replacing “median” with “mean” for imputation_method.

If you wish to run the code below, first download and prepare de data set as indicated here.

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from feature_engine.imputation import MeanMedianImputer

# Load dataset
data = pd.read_csv('creditApprovalUCI.csv')

# Separate into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
data.drop('A16', axis=1),
data['A16'],
test_size=0.3,
random_state=0
)

# Set up the imputer
median_imputer = MeanMedianImputer(
imputation_method='median',
variables=[‘A2’, ‘A3’, ‘A8’, ‘A11’, ‘A15’]
)
# fit the imputer
median_imputer.fit(X_train)

# transform the data
X_train = median_imputer.transform(X_train)
X_test = median_imputer.transform(X_test)

After running the above code, the training set will not contain missing values in the variables A2, A3, A8, A11 and A15, and the output will be a dataframe, that allow us to continue with data exploration, to for example, understand the effect of this transformation in the variables distribution.

Categorical encoding

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from feature_engine import encoding as ce

# Load dataset
def load_titanic():
data = pd.read_csv(
'https://www.openml.org/data/get_csv/16826755 /phpMYEkMl'
)
data = data.replace('?', np.nan)
data['cabin'] = data['cabin'].astype(str).str[0]
data['pclass'] = data['pclass'].astype('O')
data['embarked'].fillna('C', inplace=True)
return data

data = load_titanic()

# Separate into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
data.drop(['survived', 'name', 'ticket'], axis=1),
data['survived'], test_size=0.3, random_state=0
)

# set up the encoder
encoder = ce.CountFrequencyEncoder(
encoding_method='frequency',
variables=['cabin', 'pclass', 'embarked']
)

# fit the encoder
encoder.fit(X_train)

# transform the data
train_t = encoder.transform(X_train)
test_t = encoder.transform(X_test)

Feature-engine learns the category-to-string mappings from the train set, and stores them in the attribute encoder_dict_. The output is a dataframe, where the variables cabin, pclass and embarked are now numbers instead of strings.

Discretization

We will use the house prices data set, which is available on Kaggle.

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from feature_engine import discretisation as dsc

# Load dataset
data = data = pd.read_csv('houseprice.csv')

# Separate into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
data.drop(['Id', 'SalePrice'], axis=1),
data['SalePrice'],
test_size=0.3,
random_state=0
)

# set up the discretisation transformer
disc = dsc.DecisionTreeDiscretiser(
cv=3,
scoring='neg_mean_squared_error',
variables=['LotArea', 'GrLivArea'],
regression=True
)

# fit the transformer
disc.fit(X_train, y_train)

# transform the data
train_t = disc.transform(X_train)
test_t = disc.transform(X_test)

The output of the variable transformation is a discrete variable, where each of the discrete values, is the prediction returned by the decision tree based of the variable original value.

Mathematical Transformation

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from feature_engine import transformation as vt

# Load dataset
data = data = pd.read_csv('houseprice.csv')

# Separate into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
data.drop(['Id', 'SalePrice'], axis=1),
data['SalePrice'],
test_size=0.3,
random_state=0
)

# set up the variable transformer
tf = vt.BoxCoxTransformer(
variables = ['LotArea', 'GrLivArea']
)

# fit the transformer
tf.fit(X_train)

# transform the data
train_t = tf.transform(X_train)
test_t = tf.transform(X_test)

Outlier Handling

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from feature_engine import outlier as outr

# Load dataset
def load_titanic():
data = pd.read_csv(
'https://www.openml.org/data/get_csv/16826755/phpMYEkMl'
)
data = data.replace('?', np.nan)
data['cabin'] = data['cabin'].astype(str).str[0]
data['pclass'] = data['pclass'].astype('O')
data['embarked'].fillna('C', inplace=True)
data['fare'] = data['fare'].astype('float')
data['fare'].fillna(data['fare'].median(), inplace=True)
data['age'] = data['age'].astype('float')
data['age'].fillna(data['age'].median(), inplace=True)
return data

data = load_titanic()

# Separate into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
data.drop(['survived', 'name', 'ticket'], axis=1),
data['survived'],
test_size=0.3,
random_state=0
)

# set up the capper
capper = outr.Winsorizer(
distribution='gaussian',
tail='right',
fold=3,
variables=['age', 'fare']
)

# fit the capper
capper.fit(X_train)

# transform the data
train_t = capper.transform(X_train)
test_t = capper.transform(X_test)

The output is a dataframe, where the values of the variables age and fare that were beyond the boundaries of the distribution determined by the IQR, are now replaced by those boundaries.

Assembling Feature-engine transformers into the Scikit-learn pipeline

In the following code snippet, we perform a complete feature engineering pipeline to the house prices data set, and then build a Lasso regression to predict house price, leveraging the power of the Scikit-learn pipeline:

import pandas as pd
import numpy as np

from sklearn.linear_model import Lasso
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline as pipe
from sklearn.preprocessing import MinMaxScaler

from feature_engine import encoding as ce
from feature_engine import discretisation as dsc
from feature_engine import imputation as mdi

# load dataset
data = pd.read_csv('houseprice.csv')

# drop some variables
data.drop(
labels=['YearBuilt', 'YearRemodAdd', 'GarageYrBlt', 'Id'],
axis=1,
inplace=True
)

# make a list of categorical variables
categorical = [
var for var in data.columns if data[var].dtype == 'O'
]

# make a list of numerical variables
numerical = [
var for var in data.columns if data[var].dtype != 'O'
]

# make a list of discrete variables
discrete = [
var for var in numerical if len(data[var].unique()) < 20
]

# categorical encoders work only with object type variables
# to treat numerical variables as categorical, we need to
# re-cast them
data[discrete]= data[discrete].astype('O')

# continuous variables
numerical = [
var for var in numerical if var not in discrete
and var not in ['Id', 'SalePrice']
]

# separate into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
data.drop(labels=['SalePrice'], axis=1),
data.SalePrice,
test_size=0.1,
random_state=0
)

# set up the pipeline
price_pipe = pipe([
# add a binary missing indicator
('continuous_var_imputer', mdi.AddMissingIndicator(
variables = ['LotFrontage'])),

# replace NA by the median
('continuous_var_median_imputer', mdi.MeanMedianImputer(
imputation_method='median',
variables = ['LotFrontage', 'MasVnrArea'])),

# replace NA by adding the label "Missing"
('categorical_imputer', mdi.CategoricalVariableImputer(
variables = categorical)),

# disretise continuous variables using trees
('numerical_tree_discretiser', dsc.DecisionTreeDiscretiser(
cv = 3,
scoring='neg_mean_squared_error',
variables = numerical, regression=True)),

# remove rare labels in categorical and discrete variables
('rare_label_encoder', ce.RareLabelCategoricalEncoder(
tol = 0.03,
n_categories=1,
variables = categorical+discrete)),

# encode categorical and discrete variables using the target mean
('categorical_encoder', ce.MeanCategoricalEncoder(
variables = categorical+discrete)),

# scale features
('scaler', MinMaxScaler()),

# Lasso
('lasso', Lasso(random_state=2909, alpha=0.005))
])

# train feature engineering transformers and Lasso
price_pipe.fit(X_train, np.log(y_train))

# predict
pred_train = price_pipe.predict(X_train)
pred_test = price_pipe.predict(X_test)

Note in the code above, how we indicate which variables to transform within each of Feature-engine transformers. And also note, how easy it is to train the algorithm, and to obtain predictions, once all transformers are assembled within a pipeline. If we want to deploy these pipeline, we need only place 1 Python object in memory to do the job, or save and retrieve only 1 Python pickle, that contains the entire, pre-trained machine learning pipeline.

Bonus: Scikit-learn wrapper

Here is how to do it:

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from feature_engine.wrappers import SklearnTransformerWrapper

# Load dataset
data = pd.read_csv('houseprice.csv')

# Separate into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
data.drop(['Id', 'SalePrice'], axis=1),
data['SalePrice'],
test_size=0.3,
random_state=0
)

# set up the wrapper with the SimpleImputer
imputer = SklearnTransformerWrapper(
transformer = SimpleImputer(strategy='mean'),
variables = ['LotFrontage', 'MasVnrArea'])

# fit the wrapper + SimpleImputer
imputer.fit(X_train)

# transform the data
X_train = imputer.transform(X_train)
X_test = imputer.transform(X_test)

Feature-engine’s Scikit-learn wrappers allows the application of most Scikit-learn transformers to a selected feature subspace, returning a dataframe.

Closing remarks

Open source libraries with of-the-shelf algorithms for feature engineering and data transformation have a major edge over manually encoding the transformation steps, as they enhance reproducibility while minimizing the amount of coding required by the data scientist.

In this blog, we explored the salience of Feature-engine, and its exhaustive battery of techniques for missing data imputation, categorical variable encoding, variable transformation, discretization and outlier handling, and provided a few examples that show how easy it is to use.

Additional Resources

Lead Data Scientist, author of “Python Feature Engineering Cookbook”, instructor of online courses on machine learning and developer of open-source Python code.