How To Build And Deploy A Reproducible Machine Learning Pipeline
As companies and researches rush to implement more and more machine learning practices into their organizations, occasionally they sacrifice understanding the complexities of statistics practices in order to achieve results faster. People rush to implement statistical methods without fully understanding the intricacies of the methods themselves, or what they sacrifice by rushing through the processes without putting the right controls in place.
Subsequently, the public’s weariness of manipulated statistics has increased, and reproducibility in any methodology becomes extremely important. Though on the surface, reproducibility in machine learning pipelines might seem as simple as documenting these processes, actually deploying these pipelines introduces many unforeseen, but non-negligible challenges in creating true reproducibility.
In this blog post, I will give a brief introduction as to what deploying reproducible machine learning pipelines actually means, its main hindrances, and one proposed solution for overcoming these challenges.
For details about the technical implementation visit our online course Deployment of Machine Learning Models.
What Is Model Deployment?
Deployment of machine learning models refers to making the models available in a production environment. Here, the models can provide predictions to other software systems and clients; they can take live data as inputs and give the most updated results.
Machine learning models can only add value once they are deployed into production. Before deployment, their reach is small, and their value is limited to only a few isolated predictions. With the time and effort required to generate models in the first place, naturally we’d want to maximize the benefits achieved by them. This is, however, potentially the most challenging step in a machine learning pipeline primarily because it involves coordination between the data scientists, IT teams, DevOps, software developers, and business professionals to get the model from research to production. The transfer between all these different teams and environments prompts the challenges of reproducibility in machine learning pipelines, as I’ll discuss later in this blog post.
Machine Learning Pipelines vs. Models
A machine learning pipeline encompasses all the steps required to get a prediction from data. A machine learning model, however, is only a piece of this pipeline. While a model describes a specific algorithm or method for using patterns in data to generate predictions, a pipeline outlines all the steps involved in a machine learning process, from gathering data to acquiring predictions.
The first step in the pipeline is to make data available to both the data scientists and software developers to build or implement the models. The data generally is gathered from in-house databases, the cloud, or third-party API’s.
Typically, the data is not immediately suitable for machine learning models, as it comes directly from the data sources. The pipeline needs a feature engineering process, where any cleaning or transforming of the data will occur so that it can be utilized properly. This can include removing or treating missing values, transforming variables, or even creating new features from existing ones.
Once this is complete, feature selection and model building make up the final step in the pipeline. Rather than these processes taking place separately, they tend to weave together into one iterative step. Feature selection and model building describe the act of selecting the most predictive features for the model and using historic data to train it. Narrowing down the features used in the final model makes it more robust, and multiple iterations of model building can help identify which subset of features gives the most accurate predictions without over-fitting.
The pipeline results in gaining predictions from the input data, hopefully producing the desired outcome and satisfying the required regulations. Reproducing the pipeline, however, can be a little tricky, as we’ll see in the remainder of this blog post.
Importance of Reproducibility in Machine Learning
Reproducibility, in this sense, refers to the ability to duplicate a model precisely, so that given the exact same raw data as an input, the reproduced model will return the same output.
Lack of reproducibility can have numerous negative effects. From a financial standpoint, if one were to invest significant resources into creating a model in a research environment, but they were unable to reproduce it in a production environment, little benefit would come of that model and its predictions. Similarly, this would result in wasted time and effort, in addition to the financial loss. The machine learning model serves little use outside of the production environment.
Most importantly, however, is that reproducibility is crucial to replicating prior results. Without this, one could never accurately determine if new models exceeded previous ones. Consequently, one would risk deteriorating their reputation upon deploying an unverified machine learning algorithm. If the model couldn’t be reliably reproduced, the users might foster distrust with the results.
Challenges to Reproducibility
Creating and developing a machine learning pipeline involves working on multiple different environments throughout the process. These typically consist of: the research environment, the development environment, and the production environment.
In the research environment, data scientists analyze the data, build the models, and evaluate them to ensure they produce the desired outcome. Thus, they build the first machine learning pipeline.
Afterwards, the software developers reproduce this same machine learning pipeline in the development environment. This incorporates the steps required to send the data into the proper software and take the prediction out of the model.
In the final step, the model goes into the production environment. At long last, the model is able to serve the customers. As depicted in the image below, each of these environments has its own pipeline that replicates the original, but each has different processes for getting there.
The transfer of these pipelines causes significant hindrances to reproducibility. Due to the multiple pipelines in different environments throughout the deployment process, challenges in reproducibility arise in almost every individual step of a pipeline between data gathering and prediction.
Since the machine learning model is highly dependent on the data used to train it, data gathering is one of the most significant and difficult challenges to address when it comes to reproducibility. A model will never come out exactly the same unless the exact same data and processes are used to train it.
Data scientists may use one training data set in the research environment, but often, the programmers trying to implement the model in the production environment won’t have access to the same data. Some databases update constantly and rewrite older data, making training data for a given model unavailable after a certain amount of time has passed. Additionally, if data scientists use SQL to query data, SQL loads data in randomly, so this must be accounted for when moving between pipelines.
A couple current solutions exist to address these issues, but neither are completely reliable for all situations. Data scientists could save a snapshot of the data used to train the model, loading it off of the database and saving it elsewhere for future use. However, this won’t be an option if the data is too large. Additionally, new regulations for data use, like the European GDPR, might not allow data storage elsewhere depending on the situation, so this solution is far from foolproof.
The ideal action would be to design the data sources with accurate timestamps. Here, the timestamps could clearly identify the data used for training. However, if the business’s databases are not designed to capture timestamps, then it may require a large effort to redesign them to do such. This also won’t address the problem of constantly updated and replaced data if it still exists.
Most of the problems with feature creation in reproducibility propagate from the problems with data gathering in the first place. Any parameters derived from the data will not remain the same unless the data is identical in both environments.
For example, a common method of missing value treatment is to replace missing data with the mean of that particular variable. In this scenario, if the training data differs between environments, then this difference carries on into the mean as well. Problems such as these disappear as long as the training data is reproducible, but as discussed above, solving this problem may not happen easily.
More complicated hindrances to reproducibility in feature creation occur if their extraction requires more complex equations. For example, if a certain feature is an aggregate of another over time, one can’t recalculate these values without all the past data present in each environment. Hyperparameters need to remain constant between environments as well to ensure reproducibility. The most common example occurs when features rely on random samples. In this case, reproducibility won’t hold unless the same seed is set for every environment and random samples are generated in the same order and manner. Programmers and data scientists can solve these problems-besides those that depend solely on reproducible training data-by tracking how they create features under version control. The three following principles outline best practices for feature provenance, and abiding by these will mitigate most risks to reproducibility.
One, data scientists should generate individual features with independent code. Two, implemented features should be immutable; once created, they should not change. Instead, any new feature dependent on existing ones should occupy a new column. Three, exercise caution when stacking multiple models or creating an ensemble.
These guidelines boil down to the general principle that minimizing dependencies will help ensure uniformity between environments and reduce errors when replicating pipelines in different environments. In the ideal situation with identical training data, abiding by these guidelines should support reproducible feature creation.
The machine learning algorithms themselves also cause significant challenges to reproducibility. Similarly to some instances of feature creation, certain machine learning models require randomness for training. Common examples of this scenario include tree ensembles, cross validation, and neural networks. Tree ensembles require random feature and data extraction, cross validation relies on random data partitions, and neural networks use randomness to initialize their weights. The randomness causes slight differences between models, even ones with the same training data; these models then won’t meet the requirements of reproducibility.
Another potential problem may arise when working with arrays. Certain APIs used to build models utilize arrays rather than data frames. Unlike data frames, arrays don’t have named features, so ordering of the columns is the only way to reliably identify them. In these cases, programmers and data scientists will need to devote additional attention to ensure to always pass the features in the correct order.
Similar to feature creation, simple solutions can combat most of these threats to reproducibility. Data scientists must give extra care to record orders in which they pass features, the hyper-parameters used, set the seeds when needed, and mind the structure if the final model is an ensemble of models.
Many of the challenges to reproducibility that occur in model deployment revolve around incomplete or erroneous integration with the other business systems. Data scientists creating the initial models and programmers implementing the final ones often don’t have a full understanding of each other’s needs or processes, which results in discrepancies in understanding of what the model integration needs on both sides. One of the primary manifestations of this that occurs more often than not: the population used to train the models doesn’t quite reflect what the data streams in the live environment provide. Either the live or research environment uses filters no one was aware of at the time of building the models or the data scientists creating the models initially don’t have a full understanding of how the business systems will consume the model.
Mismatches in programming languages between environments also cause obstacles to reproducibility. When the engineers receives code to implement in a separate language, the likelihood of human error and deployment error significantly increases while transferring the work to a new language. If programmers and data scientists use the same language throughout, and consequently the same code, they can avoid most of these errors.
Similarly, slightly different versions of a software between environments occasionally causes small differences in the pipelines. Though less common than most of the other challenges, these are more difficult to troubleshoot. Therefore, it’s best to always ensure the same exact software for all environments the pipeline goes through.
With so many challenges to reproducibility, it may seem like an impossible task to eliminate all of them. Some of these obstacles have simple solutions, but others might require significant internal changes; however, all need to be addressed to achieve true reproducibility.
Architecture for Reproducible Machine Learning Pipelines
Attaining true reproducibility-is it challenging? Yes, sure. But is it impossible? No. With the following proposed architecture, one can maximize reproducibility in deployment of machine learning pipelines. This is by no means the only solution, but it does tend to work well in finance, insurance, and health.
Reproducibility is the primary requirement, but other important ones include: compatibility with external libraries, generality, and scalability. I will address these briefly without lingering too much on them, as the main focus here is reproducibility.
Modularity, or splitting up the project into components, will help ensure generality and scalability. For compatibility, the architecture will follow Scikit-learn’s API conventions, which are considered the industry standards. Alone, these packages aren’t sufficient for reproducibility as they are limited to a single training model. However, this proposed architecture reaches outside the scope of these packages alone, covering data provenance, feature provenance, ensemble models, and saving software environments.
As mentioned above, the entire architecture centers around the need for reproducibility. Its central element is that the primary components are used in both the research and production environments.
This involves everything along the lines of libraries, packages, and versions. Additionally, any new software created to satisfy the business needs will be available in the research environment. This way, one can directly push the pipeline between environments. This ensures reproducibility, which subsequently increases re-usability and development speed.
The data layer provides access to all the necessary data sources, which will then serve to train the models in both the research and development environment.
Within the feature layer, feature engineering and creation occurs. This will contain a version-controlled collection of features with clearly defined requirements and transformations. The scoring layer performs the final feature transformations. It also builds the models and generates the predictions.
Finally, the evaluation layer assesses the models’ performance along with comparing to any other models. This doesn’t require creating the whole pipeline from scratch, but instead it utilizes libraries already in place, such as Scikit-learn, TensorFlow, and others. We can use Docker to containerize the packages used and Git for version control. Through the shared infrastructure in research and production, we maximize the probability of getting a reproducible pipeline.
Assessment of Reproducibility
To test the reproducibility of the architecture, one can refer back to the basic definition of reproducibility in the case of machine learning pipelines. Given the same inputs into the production and the research layer, both environments’ models should return the same outputs. If this is not the case, hopefully one can troubleshoot and rework the architecture until it is.
Once satisfied the results, the next step is to push these modules of the pipeline into an API. Here, it can receive live data and adhere to requests of relevant clients to provide live predictions. In this way, the company or its clients can utilize the full benefits of the machine learning model.
The architecture will maximize effectiveness of the model and allow for proved reproducibility through the reproducibility requirement. This ensures for more consistent, reliable, and verifiable machine learning pipelines. Though in some cases, restructuring won’t be easy, the long-term benefits will match the stringent data and statistics standards set by society as availability and demand for machine learning continues to grow.
References and additional resources