Machine Learning Tutorial on Pipelines

Machine Learning Pipelines performs a complete workflow with an ordered sequence of the process involved in a Machine Learning task. In most of the functions in Machine Learning, the data that you work with is barely in a format for training the model with it’s the best performance.

There are several steps in the process of training a machine learning model, like encoding, categorical variables, feature scaling, and normalization. The preprocessing package of Scikit-Learn provides all these functions that can be easily used as transformations.

On the other hand, you can create a function to apply all the transformations and reuse on the original data by calling the function, but you would still need to run this first and call the model separately. So to tackle this, we have Machine Learning Pipelines that is a method to simplify this process. The most essential benefits that Machine Learning Pipelines provides are:

  • Machine Learning Pipelines will make the workflow of your task very much easier to read and understand.
  • The Pipelines in Machine Learning enforce robust implementation of the process involved in your task.
  • In the end, it will make your work more reproducible.

In this article, I will take you through the implementation of Machine Learning Pipelines in a Machine Learning Project. First, I will transform the dataset according to our needs; then, I will move towards the implementation of the Machine Learning Pipelines.

Data Preparation (Transformation)

First I will transform the data by using the pandas package in Python:

import pandas as pd
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
train = train.drop('Loan_ID', axis=1)
Gender object
Married object
Dependents object
Education object
Self_Employed object
ApplicantIncome int64
CoapplicantIncome float64
LoanAmount float64
Loan_Amount_Term float64
Credit_History float64
Property_Area object
Loan_Status object
dtype: object

Before building a Machine Learning Pipeline, I will split the training data into train and test sets to validate the performance of our model.

X = train.drop('Loan_Status', axis=1)
y = train['Loan_Status']
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)Building Machine Learning Pipelines

Building Machine Learning Pipelines

The first step in building a pipeline is to define the type of each transformer. In simple words, it means to create transformers according to the type of their variables.

from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
numeric_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())])
categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
('onehot', OneHotEncoder(handle_unknown='ignore'))])

Now I will use a Column Transformer to apply all the transformations to their respective columns in the dataframe.

numeric_features = train.select_dtypes(include=['int64', 'float64']).columns
categorical_features = train.select_dtypes(include=['object']).drop(['Loan_Status'], axis=1).columns
from sklearn.compose import ColumnTransformer
preprocessor = ColumnTransformer(
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features)])

Fitting Classifiers in Machine Learning Pipelines

The next step is to build a pipeline that can easily combine the transformations created above with a Classifier. In this task, I will choose a Random Forest Classifier.

from sklearn.ensemble import RandomForestClassifier
rf = Pipeline(steps=[('preprocessor', preprocessor),
('classifier', RandomForestClassifier())])

Now you can easily call the fit() method on raw data, all the preprocessing process will be applied by doing so:, y_train)

Now to predict it on new data, it is straightforward. You just need to call the predict() method and all the process of preprocessing will be applied to it:

y_pred = rf.predict(X_test)

Model Selection with Machine Learning Pipelines

The Pipelines can also be used in the process of Model Selection. Below I will loop the code through a number of classification models provided by Scikit-Learn, for applying the transformations and training the Machine Learning model.

from sklearn.metrics import accuracy_score, log_loss
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC, LinearSVC, NuSVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
classifiers = [
SVC(kernel="rbf", C=0.025, probability=True),
for classifier in classifiers:
pipe = Pipeline(steps=[('preprocessor', preprocessor),
('classifier', classifier)]), y_train)
print("model score: %.3f" % pipe.score(X_test, y_test))

The Pipelines can also be used in finding the best performing parameters using the grid search algorithm. If you don’t know how grid search works, you can learn it from here. Now I will apply the pipeline with the grid search algorithm:

param_grid = { 
'classifier__n_estimators': [200, 500],
'classifier__max_features': ['auto', 'sqrt', 'log2'],
'classifier__max_depth' : [4,5,6,7,8],
'classifier__criterion' :['gini', 'entropy']}
from sklearn.model_selection import GridSearchCV
CV = GridSearchCV(rf, param_grid, n_jobs= 1), y_train)

{‘classifier__criterion’: ‘gini’, ‘classifier__max_depth’: 4, ‘classifier__max_features’: ‘auto’, ‘classifier__n_estimators’: 200} 0.8124922696351268

I work on a lot of Machine Learning Projects. At the initial phase of my career, I used to ignore pipelines in my tasks. But since I started using the pipelines in my models, I find it easy to work whenever I see the same kind of dataset. I hope you liked this article on Machine Learning Pipelines. Feel free to ask your valuable questions in the comments section below.

Coder with the ♥️ of a Writer || Data Scientist | Solopreneur | Founder | Top writer in Artificial Intelligence