Feature Scaling
What is Feature Scaling?
Feature scaling is the process of normalizing the range of features in a dataset. Since machine learning algorithms use distance calculations, features with larger ranges can dominate those with smaller ranges, even if they're not more important for prediction.
Why Scale Features?
| Benefit | Description |
|---|---|
| Algorithm Performance | Many algorithms like SVM, KNN, and neural networks perform better with scaled features |
| Convergence Speed | Gradient descent converges faster when features are on similar scales |
| Equal Importance | Prevents features with larger values from dominating smaller but equally important features |
| Numerical Stability | Avoids computational issues with very large or small numbers |
Common Scaling Techniques
Scaling Methods Comparison
| Method | Formula | Output Range | Preserves Distribution | Handles Outliers |
|---|---|---|---|---|
| StandardScaler | z = (x - μ) / σ | Unbounded | Yes | No |
| MinMaxScaler | x' = (x - min) / (max - min) | [0, 1] | No | No |
| MaxAbsScaler | x' = x / max(|x|) | [-1, 1] | No | No |
| RobustScaler | z = (x - median) / IQR | Unbounded | Yes | Yes |
StandardScaler (Standardization)
Transforms features to have mean=0 and standard deviation=1 (z-score normalization).
from sklearn.preprocessing import StandardScaler
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# Sample data with different scales
data = {
'height': [165, 180, 175, 160, 185], # in cm
'weight': [60, 85, 75, 55, 90], # in kg
'age': [25, 30, 35, 40, 45] # in years
}
df = pd.DataFrame(data)
# Create and apply scaler
scaler = StandardScaler()
scaled_data = scaler.fit_transform(df)
# Convert to DataFrame for better display
scaled_df = pd.DataFrame(
scaled_data,
columns=df.columns
)
print("Original data:")
print(df)
print("
Scaled data:")
print(scaled_df)
print(f"
Mean: {scaled_df.mean()}")
print(f"Std: {scaled_df.std()}")Formula:
z = (x - μ) / σWhere:
- z is the standardized value
- x is the original value
- μ is the mean of the feature
- σ is the standard deviation of the feature
MinMaxScaler (Normalization)
Scales features to a specific range, typically [0,1].
from sklearn.preprocessing import MinMaxScaler
import numpy as np
import pandas as pd
# Sample data with different scales
data = {
'height': [165, 180, 175, 160, 185], # in cm
'weight': [60, 85, 75, 55, 90], # in kg
'age': [25, 30, 35, 40, 45] # in years
}
df = pd.DataFrame(data)
# Create and apply scaler
min_max_scaler = MinMaxScaler()
normalized = min_max_scaler.fit_transform(df)
# Convert to DataFrame for better display
normalized_df = pd.DataFrame(
normalized,
columns=df.columns
)
print("Original data:")
print(df)
print("
Normalized data:")
print(normalized_df)
print(f"
Min: {normalized_df.min()}")
print(f"Max: {normalized_df.max()}")Formula:
x_scaled = (x - min(x)) / (max(x) - min(x))MaxAbsScaler
Scales features by dividing by the maximum absolute value in each feature. Preserves zero values and does not shift/center the data.
from sklearn.preprocessing import MaxAbsScaler
import numpy as np
import pandas as pd
# Sample data with zeros and negative values
data = {
'feature1': [1, -2, 3, -4, 5],
'feature2': [0, 10, -10, 20, -20]
}
df = pd.DataFrame(data)
# Create and apply scaler
max_abs_scaler = MaxAbsScaler()
scaled = max_abs_scaler.fit_transform(df)
# Convert to DataFrame for better display
scaled_df = pd.DataFrame(
scaled,
columns=df.columns
)
print("Original data:")
print(df)
print("
Max Abs scaled data:")
print(scaled_df)
print(f"
Max absolute values: {max_abs_scaler.max_abs_}")Formula:
x_scaled = x / max(|x|)RobustScaler
Uses statistics that are robust to outliers (median and interquartile range).
from sklearn.preprocessing import RobustScaler
import numpy as np
import pandas as pd
# Sample data with outliers
data = {
'salary': [50000, 55000, 60000, 65000, 500000], # last value is outlier
'age': [25, 30, 35, 40, 90] # last value is outlier
}
df = pd.DataFrame(data)
# Create and apply scaler
robust_scaler = RobustScaler()
robust_scaled = robust_scaler.fit_transform(df)
# Convert to DataFrame for better display
robust_df = pd.DataFrame(
robust_scaled,
columns=df.columns
)
print("Original data:")
print(df)
print("
Robust scaled data:")
print(robust_df)
print(f"
Center (median): {robust_scaler.center_}")
print(f"Scale (IQR): {robust_scaler.scale_}")Formula:
z = (x - median) / IQRWhere:
- IQR is the interquartile range (75th percentile - 25th percentile)
Comparing Scaling Methods
Effect of Scaling on Outliers
| Value | Original | StandardScaler | MinMaxScaler | RobustScaler |
|---|---|---|---|---|
| Normal | 1.5 | 0.13 | 0.55 | 0.20 |
| Normal | 2.5 | 0.65 | 0.75 | 0.80 |
| Outlier | 100 | 33.2 | 1.00 | 24.5 |
from sklearn.preprocessing import StandardScaler, MinMaxScaler, MaxAbsScaler, RobustScaler
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# Generate sample data with outliers
np.random.seed(42)
data = np.random.normal(loc=0, scale=1, size=100)
data = np.append(data, [10, -10, 15, -15]) # Add outliers
# Reshape for sklearn
X = data.reshape(-1, 1)
# Apply different scalers
scalers = {
'Standard': StandardScaler(),
'MinMax': MinMaxScaler(),
'MaxAbs': MaxAbsScaler(),
'Robust': RobustScaler()
}
scaled_data = {}
for name, scaler in scalers.items():
scaled_data[name] = scaler.fit_transform(X).flatten()
# Create DataFrame for comparison
results = pd.DataFrame({
'Original': data,
**scaled_data
})
# Print statistics
print("Data statistics:")
print(results.describe().round(2))
# Print how outliers are handled
print("
Outlier values after scaling:")
outlier_idx = np.abs(data) > 5
print(results.loc[outlier_idx].round(2))When to Use Each Scaler
| Scaler | Best For | Preserves Zero | Handles Outliers | Range |
|---|---|---|---|---|
| StandardScaler | Normal distributions, PCA, clustering | No | No | Unbounded |
| MinMaxScaler | Neural networks, algorithms requiring bounded values | No | No | [0, 1] or custom |
| MaxAbsScaler | Sparse data with zeros | Yes | No | [-1, 1] |
| RobustScaler | Data with outliers | No | Yes | Unbounded |
Algorithm-Specific Recommendations
| Algorithm | Recommended Scaler | Reason |
|---|---|---|
| Linear Regression | StandardScaler | Assumes normally distributed features |
| Logistic Regression | StandardScaler | Sensitive to feature scales |
| SVM | StandardScaler or MinMaxScaler | Distance-based algorithm |
| Neural Networks | MinMaxScaler | Bounded activation functions work better with [0,1] data |
| K-means | StandardScaler | Distance-based algorithm |
| PCA | StandardScaler | Variance-based technique |
| Decision Trees | No scaling needed | Invariant to feature scales |
| Random Forest | No scaling needed | Invariant to feature scales |
Feature Scaling in Pipelines
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer
# Load data
data = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(
data.data, data.target, test_size=0.2, random_state=42
)
# Create pipeline with scaling
pipeline = Pipeline([
('scaler', StandardScaler()),
('pca', PCA(n_components=5)),
('classifier', RandomForestClassifier(random_state=42))
])
# Train pipeline
pipeline.fit(X_train, y_train)
# Evaluate
score = pipeline.score(X_test, y_test)
print(f"Pipeline accuracy: {score:.4f}")
# Access transformed data
X_scaled = pipeline.named_steps['scaler'].transform(X_test[:5])
print("
Scaled data (first 5 samples, first 3 features):")
print(X_scaled[:, :3].round(2))Common Mistakes with Feature Scaling
Correct vs. Incorrect Scaling Workflow
| Step | Correct Approach | Incorrect Approach |
|---|---|---|
| 1 | Split data into train/test | Scale the entire dataset |
| 2 | Fit scaler on training data | Split into train/test |
| 3 | Transform training data | Train model on scaled data |
| 4 | Transform test data using same scaler | Make predictions |
| 5 | Train model on scaled training data | |
| 6 | Make predictions on scaled test data |
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
import numpy as np
# Generate data
np.random.seed(42)
X = np.random.normal(loc=0, scale=10, size=(1000, 5))
y = np.random.randint(0, 2, size=1000)
# CORRECT: Split first, then scale
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Fit scaler on training data only
scaler = StandardScaler()
scaler.fit(X_train)
# Transform both training and test data
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)
print("CORRECT approach:")
print(f"Training data mean: {X_train_scaled.mean(axis=0)[:3].round(3)}")
print(f"Test data mean: {X_test_scaled.mean(axis=0)[:3].round(3)}")
# INCORRECT: Scale before splitting (data leakage)
X_scaled_full = StandardScaler().fit_transform(X)
X_train_wrong, X_test_wrong, _, _ = train_test_split(X_scaled_full, y, test_size=0.2, random_state=42)
print("
INCORRECT approach:")
print(f"Training data mean: {X_train_wrong.mean(axis=0)[:3].round(3)}")
print(f"Test data mean: {X_test_wrong.mean(axis=0)[:3].round(3)}")2. Forgetting to Scale New Data
When making predictions on new data, you must use the same scaler that was fit on the training data.
3. Scaling Target Variables
In regression, be careful when scaling target variables, as you'll need to inverse transform predictions.
Best Practices
Checklist for Feature Scaling
| Task | Description |
|---|---|
| ✓ Analyze your data | Check for outliers and distribution shapes |
| ✓ Choose appropriate scaler | Based on data characteristics and algorithm requirements |
| ✓ Split data first | Always split before scaling to prevent data leakage |
| ✓ Use pipelines | Ensure consistent preprocessing for all data |
| ✓ Save your scaler | Store the fitted scaler for future predictions |
| ✓ Check scaled data | Verify scaling worked as expected |
| ✓ Handle outliers | Consider RobustScaler or removing outliers if appropriate |