Handling Missing Data

What are Missing Values?

Missing values are data points that have no recorded value. They appear as NaN (Not a Number), None, or blank cells in datasets. Missing values can occur due to:

Data entry errors
Sensor malfunctions
When subjects don't provide information
Data corruption

For pandas-specific approaches to handling missing data, see the Pandas Missing Data section.

Imputation Methods

SimpleImputer

SimpleImputer is a basic but effective tool for handling missing data by replacing missing values with a calculated placeholder.

Key Parameters:

strategy: Method for imputation ('mean', 'median', 'most_frequent', 'constant')
missing_values: What to consider as missing (default is np.nan)
fill_value: Value to use when strategy is 'constant'

from sklearn.impute import SimpleImputer
import numpy as np

# Sample data with missing values
X = np.array([
  [1, 2, np.nan],
  [3, np.nan, 0],
  [np.nan, 4, 5]
])

# Replace missing values with mean
imputer = SimpleImputer(strategy='mean')
X_imputed = imputer.fit_transform(X)
print(X_imputed)

When to Use Different Strategies

Mean: Best for normally distributed numerical data
Median: Better for numerical data with outliers
Most Frequent: Good for categorical data or when you want the most common value
Constant: When you want to fill with a specific value (like -1 or "Unknown")

KNNImputer

KNNImputer uses k-Nearest Neighbors to fill in missing values by finding the k samples closest to the sample with missing values and averaging their values.

Key Parameters:

n_neighbors: Number of neighbors to use (default is 5)
weights: Weight function ('uniform', 'distance', or callable)
metric: Distance metric to use (default is 'nan_euclidean')

from sklearn.impute import KNNImputer
import numpy as np

# Sample data with missing values
X = np.array([
  [1, 2, np.nan, 0],
  [3, np.nan, 0, 1],
  [np.nan, 4, 5, 2],
  [2, 3, 1, 8]
])

# Replace missing values using KNN
imputer = KNNImputer(n_neighbors=2)
X_imputed = imputer.fit_transform(X)
print("Original data:")
print(X)
print("
Imputed data:")
print(X_imputed)

When to Use KNNImputer

When there are relationships between features
When you want to preserve the distribution of the data
For datasets where similar samples have similar values
When simple statistical measures (mean, median) don't capture the complexity of your data

Comparing Imputation Methods

Method	Advantages	Disadvantages	Best For
SimpleImputer (mean)	Fast, easy to understand	Ignores correlations, can distort distribution	Simple datasets, normally distributed data
SimpleImputer (median)	Robust to outliers	Ignores correlations	Data with outliers
SimpleImputer (most_frequent)	Works with categorical data	May not represent true distribution	Categorical data, skewed distributions
KNNImputer	Preserves relationships between features	Computationally expensive, sensitive to feature scaling	Complex datasets with correlations between features

Handling Missing Data in Pipelines

from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer

# Load data
data = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(
  data.data, data.target, test_size=0.2, random_state=42
)

# Introduce some missing values for demonstration
import numpy as np
rng = np.random.RandomState(42)
mask = rng.rand(*X_train.shape) < 0.1
X_train[mask] = np.nan

# Create pipeline with KNNImputer
pipeline = Pipeline([
  ('imputer', KNNImputer(n_neighbors=5)),
  ('scaler', StandardScaler()),
  ('classifier', RandomForestClassifier())
])

# Train pipeline
pipeline.fit(X_train, y_train)

# Evaluate
score = pipeline.score(X_test, y_test)
print(f"Pipeline accuracy: {score:.4f}")

Best Practices for Handling Missing Data

Understand why data is missing - Is it missing completely at random, missing at random, or missing not at random?
Visualize missing data patterns - Tools like missingno can help identify patterns
Consider the impact of imputation - Different methods can lead to different model outcomes
Use domain knowledge - Sometimes the best imputation strategy comes from understanding the data
Validate your approach - Compare model performance with different imputation strategies

Handling Missing Data

On this page