MLP FU
Scikit-learn/Data Preprocessing

Chaining Transformers (Pipelines)

What is a Pipeline?

A pipeline lets you chain together multiple data processing steps (transformers) and a final model. This makes your workflow cleaner, more reproducible, and easier to tune.

Why Use Pipelines?

  • Organize preprocessing and modeling steps
  • Avoid data leakage
  • Make hyperparameter tuning easier
  • Cache steps for faster runs

Example: Creating a Pipeline

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

# Load data
data = load_iris()
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, random_state=42)

# Create a pipeline with scaling, PCA, and a classifier
pipe = Pipeline([
  ('scaler', StandardScaler()),
  ('pca', PCA(n_components=2)),
  ('clf', RandomForestClassifier())
])

# Fit the pipeline
pipe.fit(X_train, y_train)

# Predict
preds = pipe.predict(X_test)
print("Predictions:", preds)
  • Pipeline([...]): Chains steps together.
  • StandardScaler(): Scales features.
  • PCA(n_components=2): Reduces dimensions.
  • RandomForestClassifier(): Final model.

Caching Transformers for Speed

You can cache fitted transformers to speed up repeated runs:

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestClassifier
import joblib

pipe = Pipeline([
  ('scaler', StandardScaler()),
  ('pca', PCA(n_components=2)),
  ('clf', RandomForestClassifier())
], memory=joblib.Memory(location='cache_dir'))
  • memory=joblib.Memory(location='cache_dir'): Caches transformers in the given directory.

Hyperparameter Search with Pipelines

You can search parameters for any step in the pipeline:

from sklearn.model_selection import GridSearchCV

param_grid = {
  'pca__n_components': [2, 3],
  'clf__n_estimators': [50, 100]
}
grid = GridSearchCV(pipe, param_grid, cv=3)
grid.fit(X_train, y_train)
print("Best params:", grid.best_params_)
  • Use step__param to set parameters for steps in the pipeline.

Visual: Pipeline Flow

When to Use

  • You have multiple preprocessing steps
  • You want to tune parameters for all steps
  • You want to avoid data leakage

Summary Table

FeatureWhat it Does
PipelineChains steps together
CachingSpeeds up repeated runs
Hyperparameter SearchTunes all steps at once
VisualizationShows the flow of data through the pipeline

Pipelines make your machine learning workflow clean, fast, and easy to manage.