MLP FU
Scikit-learn/Data Preprocessing

Handling Missing Data

What are Missing Values?

Missing values are data points that have no recorded value. They appear as NaN (Not a Number), None, or blank cells in datasets. Missing values can occur due to:

  • Data entry errors
  • Sensor malfunctions
  • When subjects don't provide information
  • Data corruption

For pandas-specific approaches to handling missing data, see the Pandas Missing Data section.

Imputation Methods

SimpleImputer

SimpleImputer is a basic but effective tool for handling missing data by replacing missing values with a calculated placeholder.

Key Parameters:

  • strategy: Method for imputation ('mean', 'median', 'most_frequent', 'constant')
  • missing_values: What to consider as missing (default is np.nan)
  • fill_value: Value to use when strategy is 'constant'
from sklearn.impute import SimpleImputer
import numpy as np

# Sample data with missing values
X = np.array([
  [1, 2, np.nan],
  [3, np.nan, 0],
  [np.nan, 4, 5]
])

# Replace missing values with mean
imputer = SimpleImputer(strategy='mean')
X_imputed = imputer.fit_transform(X)
print(X_imputed)

When to Use Different Strategies

  • Mean: Best for normally distributed numerical data
  • Median: Better for numerical data with outliers
  • Most Frequent: Good for categorical data or when you want the most common value
  • Constant: When you want to fill with a specific value (like -1 or "Unknown")

KNNImputer

KNNImputer uses k-Nearest Neighbors to fill in missing values by finding the k samples closest to the sample with missing values and averaging their values.

Key Parameters:

  • n_neighbors: Number of neighbors to use (default is 5)
  • weights: Weight function ('uniform', 'distance', or callable)
  • metric: Distance metric to use (default is 'nan_euclidean')
from sklearn.impute import KNNImputer
import numpy as np

# Sample data with missing values
X = np.array([
  [1, 2, np.nan, 0],
  [3, np.nan, 0, 1],
  [np.nan, 4, 5, 2],
  [2, 3, 1, 8]
])

# Replace missing values using KNN
imputer = KNNImputer(n_neighbors=2)
X_imputed = imputer.fit_transform(X)
print("Original data:")
print(X)
print("
Imputed data:")
print(X_imputed)

When to Use KNNImputer

  • When there are relationships between features
  • When you want to preserve the distribution of the data
  • For datasets where similar samples have similar values
  • When simple statistical measures (mean, median) don't capture the complexity of your data

Comparing Imputation Methods

MethodAdvantagesDisadvantagesBest For
SimpleImputer (mean)Fast, easy to understandIgnores correlations, can distort distributionSimple datasets, normally distributed data
SimpleImputer (median)Robust to outliersIgnores correlationsData with outliers
SimpleImputer (most_frequent)Works with categorical dataMay not represent true distributionCategorical data, skewed distributions
KNNImputerPreserves relationships between featuresComputationally expensive, sensitive to feature scalingComplex datasets with correlations between features

Handling Missing Data in Pipelines

from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer

# Load data
data = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(
  data.data, data.target, test_size=0.2, random_state=42
)

# Introduce some missing values for demonstration
import numpy as np
rng = np.random.RandomState(42)
mask = rng.rand(*X_train.shape) < 0.1
X_train[mask] = np.nan

# Create pipeline with KNNImputer
pipeline = Pipeline([
  ('imputer', KNNImputer(n_neighbors=5)),
  ('scaler', StandardScaler()),
  ('classifier', RandomForestClassifier())
])

# Train pipeline
pipeline.fit(X_train, y_train)

# Evaluate
score = pipeline.score(X_test, y_test)
print(f"Pipeline accuracy: {score:.4f}")

Best Practices for Handling Missing Data

  1. Understand why data is missing - Is it missing completely at random, missing at random, or missing not at random?
  2. Visualize missing data patterns - Tools like missingno can help identify patterns
  3. Consider the impact of imputation - Different methods can lead to different model outcomes
  4. Use domain knowledge - Sometimes the best imputation strategy comes from understanding the data
  5. Validate your approach - Compare model performance with different imputation strategies