Pandas
Filling Missing Data
1. Identifying Missing Data
import pandas as pd
import numpy as np
# Sample dataframe with missing values
df = pd.DataFrame({
'A': [1, 2, np.nan, 4],
'B': [5, np.nan, np.nan, 8],
'C': ['x', 'y', np.nan, 'z']
})
# Check for missing values
df.isna() # Boolean mask
df.isna().sum() # Count per column
df.isnull().mean() # Percentage missing2. Basic Imputation Methods
Simple Fill Methods
# Fill with specific value
df.fillna(0)
# Forward fill (last valid observation)
df.ffill()
# Backward fill (next valid observation)
df.bfill()
# Fill with column mean
df.fillna(df.mean())
# Fill with column median
df.fillna(df.median())
# Different fill per column
df.fillna({'A': df['A'].mean(), 'B': 0, 'C': 'missing'})Interpolation Methods
# Linear interpolation (for numeric series)
df['A'].interpolate()
# Time-based interpolation (for datetime indexes)
df['A'].interpolate(method='time')3. Custom Imputation Functions
# Custom function to fill with mean of top 3 values
def top3_mean(series):
top3 = series.nlargest(3)
return top3.mean()
df['A'].fillna(top3_mean(df['A']))
# Fill using group means
df['B'] = df.groupby('C')['B'].transform(
lambda x: x.fillna(x.mean())
)4. Advanced Pandas-Only Techniques
Conditional Imputation
# Fill based on other column's value
df.loc[df['A'].isna() & (df['C'] == 'x'), 'A'] = df['A'].mean()Rolling Window Imputation
# Fill with rolling window average
df['A'] = df['A'].fillna(
df['A'].rolling(window=2, min_periods=1).mean()
)5. Dropping Missing Data
# Drop rows with any missing values
df.dropna()
# Drop columns with any missing values
df.dropna(axis=1)
# Drop rows where specific columns are missing
df.dropna(subset=['A', 'B'])
# Drop only if all values are missing
df.dropna(how='all')6. Creating Missing Data Indicators
# Add columns indicating missingness
for col in df.columns:
df[f'{col}_missing'] = df[col].isna().astype(int)7. Working with Nulls in Operations
# Arithmetic with nulls
df['A'] + df['B'] # Any null makes result null
# Skipping nulls in calculations
df['A'].sum(skipna=True)
df['B'].mean(skipna=False)Best Practices Cheat Sheet
- Always check
df.info()first to see null counts - For numeric data:
.fillna(df.mean())is good first approach - For categorical data:
.fillna('Missing')or.fillna(df.mode()[0]) - Consider keeping null indicators with
isna().astype(int) - Document your null handling strategy for reproducibility
Pro Tip: Chain operations for clean code:
df = (df
.fillna({'A': df['A'].median()})
.ffill()
.dropna(subset=['B']))