Image from Pixabay

According to a survey by Forbes, data scientists and machine learning engineers spend around 60% of their time preparing data for analysis and machine learning. A large chunk of that time is spent on feature engineering.

Feature engineering is the process of taking a data set and constructing explanatory variables, or predictor features, that are then passed onto the prediction model to train a machine learning algorithm. It is a crucial step in all machine learning models, but can often be challenging and very time consuming.

Feature engineering involves aspects such as imputing missing values, encoding categorical variables, transforming and…


As companies and researches rush to implement more and more machine learning practices into their organizations, occasionally they sacrifice understanding the complexities of statistics practices in order to achieve results faster. People rush to implement statistical methods without fully understanding the intricacies of the methods themselves, or what they sacrifice by rushing through the processes without putting the right controls in place.

Subsequently, the public’s weariness of manipulated statistics has increased, and reproducibility in any methodology becomes extremely important. …


Getting Started

Find out the challenges encountered when developing machine learning pipelines for deployment and learn how open source software can help.

Streamlining Feature Engineering Pipelines — Image from Pixabay, no attribution required

In many organizations, we create machine learning models that process a group of input variables to output a prediction. Some of these models predict for example the likelihood of a loan being repaid, the probability of an application being fraudulent, if a car should be repaired or replaced after an accident, whether the customer is going to churn, and much more.

The raw data collected and stored by the organizations is almost never suitable to train new machine learning models, or be consumed by existing models to produce a prediction. Instead, we perform an extensive amount of transformation before the…


Logo for the Online Course Feature Selection for Machine Learning — Created by the Author

When building a machine learning model in a business setting, it’s rare that all the variables encompassing the available data will need to be incorporated in the model. Sure, adding more variables rarely makes a model less accurate, but there are certain disadvantages to including an excess of features.

In this article, I hope to discuss the importance of feature selection and how it works. We’ll go through the categories of feature selection and the most popular methods of each. …


Feature Engineering — Image from the author — All rights reserved

Data in its raw format is almost never ready to be used to train machine learning models. But, we can transform the data to build suitable features for machine learning. The process of transforming the variables and creating new features is called feature engineering, and it is typically the stage where data scientists devote most of their effort in a machine learning project.

As Pedro Domingos said in the article “A few useful things to know about machine learning”:

“At the end of the day, some machine learning projects succeed and some fail. What makes the difference? …


Feature Engineering for Machine Learning — Online Course — Image by the author

Feature engineering is the process of using domain knowledge of the data to transform existing features or to create new variables from existing ones, for use in machine learning.

Data in its raw format is almost never suitable for use to train machine learning algorithms. Instead, data scientists devote a substantial amount of time to pre-process the variables to use them in machine learning.

Why do we need to engineer features?

There are various reasons why we engineer features:

  1. Some machine learning libraries do not support missing values or strings as inputs, for example Scikit-learn.
  2. Some machine learning models are sensitive to the magnitude of the features…

Feature-engine — Python open source — Image by the author

Feature-engine is an open source Python library with the most exhaustive battery of transformers to engineer features for use in machine learning models. Feature-engine simplifies and streamlines the implementation of and end-to-end feature engineering pipeline, by allowing the selection of feature subsets within its transformers, and returning dataframes for easy data exploration. Feature-engine’s transformers preserve Scikit-learn functionality with the methods fit() and transform() to learn parameters from and then transform data.

Feature engineering

Feature engineering is the process of using domain knowledge of the data to transform existing features or to create new variables from existing ones, for use in machine learning…


Photo by Luis Gomes from Pexels

For years, businesses and developers have understood the importance of testing software before deployment. Before it can interface with customers in real time, a business naturally wants the software to function as expected. With the increasing demand for machine learning implemented in business, it’s reasonable to expect that machine learning models deployed into production need to be tested just as rigorously.

However, for many organizations, machine learning model deployments are relatively new, and some don’t have sufficient knowledge or a foundation in place to test them as rigorously as they test software. …


Image from Pixabay — no attribution required

You must have heard about data science a lot nowadays. And why not? It is one of the hottest jobs of the 21st century.

Do you want to be a data scientist? There are tons of courses available online that will help you to get started. But which one to choose? In this article we recommend some great courses and books that will help you become a data scientist. For our recommendation, we considered the cost of the course and the knowledge you get from the course, prioritizing those that can be taken for free.

What is Data Science?

Data Science is a multidisciplinary…


You are thinking about becoming a data scientist. You have looked around and soon enough realized that R is one of the languages you need to learn. But what is R?

R is a free software environment designed for statistical computing and data analysis, widely used among statisticians, data miners and data scientists, both in industry and academia. In fact, R is one of the most widely used languages for data science, along with Python.

Why is R good for Data Science?

R was designed for statistical analysis and therefore the repertoire of statistical tests available in R is unbeatable. In addition, there is a growing number…

Soledad Galli, PhD

Lead Data Scientist, author of “Python Feature Engineering Cookbook”, instructor of online courses on machine learning and developer of open-source Python code.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store