Scikit-learn/Data Preprocessing
Handling Missing Data
What are Missing Values?
Missing values are data points that have no recorded value. They appear as NaN (Not a Number), None, or blank cells in datasets. Missing values can occur due to:
- Data entry errors
- Sensor malfunctions
- When subjects don't provide information
- Data corruption
For pandas-specific approaches to handling missing data, see the Pandas Missing Data section.
Imputation Methods
SimpleImputer
SimpleImputer is a basic but effective tool for handling missing data by replacing missing values with a calculated placeholder.
Key Parameters:
- strategy: Method for imputation ('mean', 'median', 'most_frequent', 'constant')
- missing_values: What to consider as missing (default is np.nan)
- fill_value: Value to use when strategy is 'constant'
from sklearn.impute import SimpleImputer
import numpy as np
# Sample data with missing values
X = np.array([
[1, 2, np.nan],
[3, np.nan, 0],
[np.nan, 4, 5]
])
# Replace missing values with mean
imputer = SimpleImputer(strategy='mean')
X_imputed = imputer.fit_transform(X)
print(X_imputed)When to Use Different Strategies
- Mean: Best for normally distributed numerical data
- Median: Better for numerical data with outliers
- Most Frequent: Good for categorical data or when you want the most common value
- Constant: When you want to fill with a specific value (like -1 or "Unknown")
KNNImputer
KNNImputer uses k-Nearest Neighbors to fill in missing values by finding the k samples closest to the sample with missing values and averaging their values.
Key Parameters:
- n_neighbors: Number of neighbors to use (default is 5)
- weights: Weight function ('uniform', 'distance', or callable)
- metric: Distance metric to use (default is 'nan_euclidean')
from sklearn.impute import KNNImputer
import numpy as np
# Sample data with missing values
X = np.array([
[1, 2, np.nan, 0],
[3, np.nan, 0, 1],
[np.nan, 4, 5, 2],
[2, 3, 1, 8]
])
# Replace missing values using KNN
imputer = KNNImputer(n_neighbors=2)
X_imputed = imputer.fit_transform(X)
print("Original data:")
print(X)
print("
Imputed data:")
print(X_imputed)When to Use KNNImputer
- When there are relationships between features
- When you want to preserve the distribution of the data
- For datasets where similar samples have similar values
- When simple statistical measures (mean, median) don't capture the complexity of your data
Comparing Imputation Methods
| Method | Advantages | Disadvantages | Best For |
|---|---|---|---|
| SimpleImputer (mean) | Fast, easy to understand | Ignores correlations, can distort distribution | Simple datasets, normally distributed data |
| SimpleImputer (median) | Robust to outliers | Ignores correlations | Data with outliers |
| SimpleImputer (most_frequent) | Works with categorical data | May not represent true distribution | Categorical data, skewed distributions |
| KNNImputer | Preserves relationships between features | Computationally expensive, sensitive to feature scaling | Complex datasets with correlations between features |
Handling Missing Data in Pipelines
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer
# Load data
data = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(
data.data, data.target, test_size=0.2, random_state=42
)
# Introduce some missing values for demonstration
import numpy as np
rng = np.random.RandomState(42)
mask = rng.rand(*X_train.shape) < 0.1
X_train[mask] = np.nan
# Create pipeline with KNNImputer
pipeline = Pipeline([
('imputer', KNNImputer(n_neighbors=5)),
('scaler', StandardScaler()),
('classifier', RandomForestClassifier())
])
# Train pipeline
pipeline.fit(X_train, y_train)
# Evaluate
score = pipeline.score(X_test, y_test)
print(f"Pipeline accuracy: {score:.4f}")Best Practices for Handling Missing Data
- Understand why data is missing - Is it missing completely at random, missing at random, or missing not at random?
- Visualize missing data patterns - Tools like missingno can help identify patterns
- Consider the impact of imputation - Different methods can lead to different model outcomes
- Use domain knowledge - Sometimes the best imputation strategy comes from understanding the data
- Validate your approach - Compare model performance with different imputation strategies