Scikit-learn/Data Preprocessing
Chaining Transformers (Pipelines)
What is a Pipeline?
A pipeline lets you chain together multiple data processing steps (transformers) and a final model. This makes your workflow cleaner, more reproducible, and easier to tune.
Why Use Pipelines?
- Organize preprocessing and modeling steps
- Avoid data leakage
- Make hyperparameter tuning easier
- Cache steps for faster runs
Example: Creating a Pipeline
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
# Load data
data = load_iris()
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, random_state=42)
# Create a pipeline with scaling, PCA, and a classifier
pipe = Pipeline([
('scaler', StandardScaler()),
('pca', PCA(n_components=2)),
('clf', RandomForestClassifier())
])
# Fit the pipeline
pipe.fit(X_train, y_train)
# Predict
preds = pipe.predict(X_test)
print("Predictions:", preds)Pipeline([...]): Chains steps together.StandardScaler(): Scales features.PCA(n_components=2): Reduces dimensions.RandomForestClassifier(): Final model.
Caching Transformers for Speed
You can cache fitted transformers to speed up repeated runs:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestClassifier
import joblib
pipe = Pipeline([
('scaler', StandardScaler()),
('pca', PCA(n_components=2)),
('clf', RandomForestClassifier())
], memory=joblib.Memory(location='cache_dir'))memory=joblib.Memory(location='cache_dir'): Caches transformers in the given directory.
Hyperparameter Search with Pipelines
You can search parameters for any step in the pipeline:
from sklearn.model_selection import GridSearchCV
param_grid = {
'pca__n_components': [2, 3],
'clf__n_estimators': [50, 100]
}
grid = GridSearchCV(pipe, param_grid, cv=3)
grid.fit(X_train, y_train)
print("Best params:", grid.best_params_)- Use
step__paramto set parameters for steps in the pipeline.
Visual: Pipeline Flow
When to Use
- You have multiple preprocessing steps
- You want to tune parameters for all steps
- You want to avoid data leakage
Summary Table
| Feature | What it Does |
|---|---|
| Pipeline | Chains steps together |
| Caching | Speeds up repeated runs |
| Hyperparameter Search | Tunes all steps at once |
| Visualization | Shows the flow of data through the pipeline |
Pipelines make your machine learning workflow clean, fast, and easy to manage.