Chaining Transformers (Pipelines)

What is a Pipeline?

A pipeline lets you chain together multiple data processing steps (transformers) and a final model. This makes your workflow cleaner, more reproducible, and easier to tune.

Why Use Pipelines?

Organize preprocessing and modeling steps
Avoid data leakage
Make hyperparameter tuning easier
Cache steps for faster runs

Example: Creating a Pipeline

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

# Load data
data = load_iris()
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, random_state=42)

# Create a pipeline with scaling, PCA, and a classifier
pipe = Pipeline([
  ('scaler', StandardScaler()),
  ('pca', PCA(n_components=2)),
  ('clf', RandomForestClassifier())
])

# Fit the pipeline
pipe.fit(X_train, y_train)

# Predict
preds = pipe.predict(X_test)
print("Predictions:", preds)

Pipeline([...]): Chains steps together.
StandardScaler(): Scales features.
PCA(n_components=2): Reduces dimensions.
RandomForestClassifier(): Final model.

Caching Transformers for Speed

You can cache fitted transformers to speed up repeated runs:

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestClassifier
import joblib

pipe = Pipeline([
  ('scaler', StandardScaler()),
  ('pca', PCA(n_components=2)),
  ('clf', RandomForestClassifier())
], memory=joblib.Memory(location='cache_dir'))

memory=joblib.Memory(location='cache_dir'): Caches transformers in the given directory.

Hyperparameter Search with Pipelines

You can search parameters for any step in the pipeline:

from sklearn.model_selection import GridSearchCV

param_grid = {
  'pca__n_components': [2, 3],
  'clf__n_estimators': [50, 100]
}
grid = GridSearchCV(pipe, param_grid, cv=3)
grid.fit(X_train, y_train)
print("Best params:", grid.best_params_)

Use step__param to set parameters for steps in the pipeline.

Visual: Pipeline Flow

When to Use

You have multiple preprocessing steps
You want to tune parameters for all steps
You want to avoid data leakage

Summary Table

Feature	What it Does
Pipeline	Chains steps together
Caching	Speeds up repeated runs
Hyperparameter Search	Tunes all steps at once
Visualization	Shows the flow of data through the pipeline

Pipelines make your machine learning workflow clean, fast, and easy to manage.

Chaining Transformers (Pipelines)

On this page