MLP FU
Scikit-learn/Data Preprocessing

Feature Scaling

What is Feature Scaling?

Feature scaling is the process of normalizing the range of features in a dataset. Since machine learning algorithms use distance calculations, features with larger ranges can dominate those with smaller ranges, even if they're not more important for prediction.

Why Scale Features?

BenefitDescription
Algorithm PerformanceMany algorithms like SVM, KNN, and neural networks perform better with scaled features
Convergence SpeedGradient descent converges faster when features are on similar scales
Equal ImportancePrevents features with larger values from dominating smaller but equally important features
Numerical StabilityAvoids computational issues with very large or small numbers

Common Scaling Techniques

Scaling Methods Comparison

MethodFormulaOutput RangePreserves DistributionHandles Outliers
StandardScalerz = (x - μ) / σUnboundedYesNo
MinMaxScalerx' = (x - min) / (max - min)[0, 1]NoNo
MaxAbsScalerx' = x / max(|x|)[-1, 1]NoNo
RobustScalerz = (x - median) / IQRUnboundedYesYes

StandardScaler (Standardization)

Transforms features to have mean=0 and standard deviation=1 (z-score normalization).

from sklearn.preprocessing import StandardScaler
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Sample data with different scales
data = {
  'height': [165, 180, 175, 160, 185],  # in cm
  'weight': [60, 85, 75, 55, 90],       # in kg
  'age': [25, 30, 35, 40, 45]           # in years
}
df = pd.DataFrame(data)

# Create and apply scaler
scaler = StandardScaler()
scaled_data = scaler.fit_transform(df)

# Convert to DataFrame for better display
scaled_df = pd.DataFrame(
  scaled_data,
  columns=df.columns
)

print("Original data:")
print(df)
print("
Scaled data:")
print(scaled_df)
print(f"
Mean: {scaled_df.mean()}")
print(f"Std: {scaled_df.std()}")

Formula:

z = (x - μ) / σ

Where:

  • z is the standardized value
  • x is the original value
  • μ is the mean of the feature
  • σ is the standard deviation of the feature

MinMaxScaler (Normalization)

Scales features to a specific range, typically [0,1].

from sklearn.preprocessing import MinMaxScaler
import numpy as np
import pandas as pd

# Sample data with different scales
data = {
  'height': [165, 180, 175, 160, 185],  # in cm
  'weight': [60, 85, 75, 55, 90],       # in kg
  'age': [25, 30, 35, 40, 45]           # in years
}
df = pd.DataFrame(data)

# Create and apply scaler
min_max_scaler = MinMaxScaler()
normalized = min_max_scaler.fit_transform(df)

# Convert to DataFrame for better display
normalized_df = pd.DataFrame(
  normalized,
  columns=df.columns
)

print("Original data:")
print(df)
print("
Normalized data:")
print(normalized_df)
print(f"
Min: {normalized_df.min()}")
print(f"Max: {normalized_df.max()}")

Formula:

x_scaled = (x - min(x)) / (max(x) - min(x))

MaxAbsScaler

Scales features by dividing by the maximum absolute value in each feature. Preserves zero values and does not shift/center the data.

from sklearn.preprocessing import MaxAbsScaler
import numpy as np
import pandas as pd

# Sample data with zeros and negative values
data = {
  'feature1': [1, -2, 3, -4, 5],
  'feature2': [0, 10, -10, 20, -20]
}
df = pd.DataFrame(data)

# Create and apply scaler
max_abs_scaler = MaxAbsScaler()
scaled = max_abs_scaler.fit_transform(df)

# Convert to DataFrame for better display
scaled_df = pd.DataFrame(
  scaled,
  columns=df.columns
)

print("Original data:")
print(df)
print("
Max Abs scaled data:")
print(scaled_df)
print(f"
Max absolute values: {max_abs_scaler.max_abs_}")

Formula:

x_scaled = x / max(|x|)

RobustScaler

Uses statistics that are robust to outliers (median and interquartile range).

from sklearn.preprocessing import RobustScaler
import numpy as np
import pandas as pd

# Sample data with outliers
data = {
  'salary': [50000, 55000, 60000, 65000, 500000],  # last value is outlier
  'age': [25, 30, 35, 40, 90]                      # last value is outlier
}
df = pd.DataFrame(data)

# Create and apply scaler
robust_scaler = RobustScaler()
robust_scaled = robust_scaler.fit_transform(df)

# Convert to DataFrame for better display
robust_df = pd.DataFrame(
  robust_scaled,
  columns=df.columns
)

print("Original data:")
print(df)
print("
Robust scaled data:")
print(robust_df)
print(f"
Center (median): {robust_scaler.center_}")
print(f"Scale (IQR): {robust_scaler.scale_}")

Formula:

z = (x - median) / IQR

Where:

  • IQR is the interquartile range (75th percentile - 25th percentile)

Comparing Scaling Methods

Effect of Scaling on Outliers

ValueOriginalStandardScalerMinMaxScalerRobustScaler
Normal1.50.130.550.20
Normal2.50.650.750.80
Outlier10033.21.0024.5
from sklearn.preprocessing import StandardScaler, MinMaxScaler, MaxAbsScaler, RobustScaler
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Generate sample data with outliers
np.random.seed(42)
data = np.random.normal(loc=0, scale=1, size=100)
data = np.append(data, [10, -10, 15, -15])  # Add outliers

# Reshape for sklearn
X = data.reshape(-1, 1)

# Apply different scalers
scalers = {
  'Standard': StandardScaler(),
  'MinMax': MinMaxScaler(),
  'MaxAbs': MaxAbsScaler(),
  'Robust': RobustScaler()
}

scaled_data = {}
for name, scaler in scalers.items():
  scaled_data[name] = scaler.fit_transform(X).flatten()

# Create DataFrame for comparison
results = pd.DataFrame({
  'Original': data,
  **scaled_data
})

# Print statistics
print("Data statistics:")
print(results.describe().round(2))

# Print how outliers are handled
print("
Outlier values after scaling:")
outlier_idx = np.abs(data) > 5
print(results.loc[outlier_idx].round(2))

When to Use Each Scaler

ScalerBest ForPreserves ZeroHandles OutliersRange
StandardScalerNormal distributions, PCA, clusteringNoNoUnbounded
MinMaxScalerNeural networks, algorithms requiring bounded valuesNoNo[0, 1] or custom
MaxAbsScalerSparse data with zerosYesNo[-1, 1]
RobustScalerData with outliersNoYesUnbounded

Algorithm-Specific Recommendations

AlgorithmRecommended ScalerReason
Linear RegressionStandardScalerAssumes normally distributed features
Logistic RegressionStandardScalerSensitive to feature scales
SVMStandardScaler or MinMaxScalerDistance-based algorithm
Neural NetworksMinMaxScalerBounded activation functions work better with [0,1] data
K-meansStandardScalerDistance-based algorithm
PCAStandardScalerVariance-based technique
Decision TreesNo scaling neededInvariant to feature scales
Random ForestNo scaling neededInvariant to feature scales

Feature Scaling in Pipelines

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer

# Load data
data = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(
  data.data, data.target, test_size=0.2, random_state=42
)

# Create pipeline with scaling
pipeline = Pipeline([
  ('scaler', StandardScaler()),
  ('pca', PCA(n_components=5)),
  ('classifier', RandomForestClassifier(random_state=42))
])

# Train pipeline
pipeline.fit(X_train, y_train)

# Evaluate
score = pipeline.score(X_test, y_test)
print(f"Pipeline accuracy: {score:.4f}")

# Access transformed data
X_scaled = pipeline.named_steps['scaler'].transform(X_test[:5])
print("
Scaled data (first 5 samples, first 3 features):")
print(X_scaled[:, :3].round(2))

Common Mistakes with Feature Scaling

Correct vs. Incorrect Scaling Workflow

StepCorrect ApproachIncorrect Approach
1Split data into train/testScale the entire dataset
2Fit scaler on training dataSplit into train/test
3Transform training dataTrain model on scaled data
4Transform test data using same scalerMake predictions
5Train model on scaled training data
6Make predictions on scaled test data
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
import numpy as np

# Generate data
np.random.seed(42)
X = np.random.normal(loc=0, scale=10, size=(1000, 5))
y = np.random.randint(0, 2, size=1000)

# CORRECT: Split first, then scale
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Fit scaler on training data only
scaler = StandardScaler()
scaler.fit(X_train)

# Transform both training and test data
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

print("CORRECT approach:")
print(f"Training data mean: {X_train_scaled.mean(axis=0)[:3].round(3)}")
print(f"Test data mean: {X_test_scaled.mean(axis=0)[:3].round(3)}")

# INCORRECT: Scale before splitting (data leakage)
X_scaled_full = StandardScaler().fit_transform(X)
X_train_wrong, X_test_wrong, _, _ = train_test_split(X_scaled_full, y, test_size=0.2, random_state=42)

print("
INCORRECT approach:")
print(f"Training data mean: {X_train_wrong.mean(axis=0)[:3].round(3)}")
print(f"Test data mean: {X_test_wrong.mean(axis=0)[:3].round(3)}")

2. Forgetting to Scale New Data

When making predictions on new data, you must use the same scaler that was fit on the training data.

3. Scaling Target Variables

In regression, be careful when scaling target variables, as you'll need to inverse transform predictions.

Best Practices

Checklist for Feature Scaling

TaskDescription
✓ Analyze your dataCheck for outliers and distribution shapes
✓ Choose appropriate scalerBased on data characteristics and algorithm requirements
✓ Split data firstAlways split before scaling to prevent data leakage
✓ Use pipelinesEnsure consistent preprocessing for all data
✓ Save your scalerStore the fitted scaler for future predictions
✓ Check scaled dataVerify scaling worked as expected
✓ Handle outliersConsider RobustScaler or removing outliers if appropriate