MLP FU
Pandas

Filling Missing Data

1. Identifying Missing Data

import pandas as pd
import numpy as np

# Sample dataframe with missing values
df = pd.DataFrame({
'A': [1, 2, np.nan, 4],
'B': [5, np.nan, np.nan, 8],
'C': ['x', 'y', np.nan, 'z']
})

# Check for missing values
df.isna()          # Boolean mask
df.isna().sum()    # Count per column
df.isnull().mean() # Percentage missing

2. Basic Imputation Methods

Simple Fill Methods

# Fill with specific value
df.fillna(0)

# Forward fill (last valid observation)
df.ffill()

# Backward fill (next valid observation)
df.bfill()

# Fill with column mean
df.fillna(df.mean())

# Fill with column median
df.fillna(df.median())

# Different fill per column
df.fillna({'A': df['A'].mean(), 'B': 0, 'C': 'missing'})

Interpolation Methods

# Linear interpolation (for numeric series)
df['A'].interpolate()

# Time-based interpolation (for datetime indexes)
df['A'].interpolate(method='time')

3. Custom Imputation Functions

# Custom function to fill with mean of top 3 values
def top3_mean(series):
top3 = series.nlargest(3)
return top3.mean()

df['A'].fillna(top3_mean(df['A']))

# Fill using group means
df['B'] = df.groupby('C')['B'].transform(
lambda x: x.fillna(x.mean())
)

4. Advanced Pandas-Only Techniques

Conditional Imputation

# Fill based on other column's value
df.loc[df['A'].isna() & (df['C'] == 'x'), 'A'] = df['A'].mean()

Rolling Window Imputation

# Fill with rolling window average
df['A'] = df['A'].fillna(
df['A'].rolling(window=2, min_periods=1).mean()
)

5. Dropping Missing Data

# Drop rows with any missing values
df.dropna()

# Drop columns with any missing values
df.dropna(axis=1)

# Drop rows where specific columns are missing
df.dropna(subset=['A', 'B'])

# Drop only if all values are missing
df.dropna(how='all')

6. Creating Missing Data Indicators

# Add columns indicating missingness
for col in df.columns:
df[f'{col}_missing'] = df[col].isna().astype(int)

7. Working with Nulls in Operations

# Arithmetic with nulls
df['A'] + df['B']  # Any null makes result null

# Skipping nulls in calculations
df['A'].sum(skipna=True)
df['B'].mean(skipna=False)

Best Practices Cheat Sheet

  1. Always check df.info() first to see null counts
  2. For numeric data: .fillna(df.mean()) is good first approach
  3. For categorical data: .fillna('Missing') or .fillna(df.mode()[0])
  4. Consider keeping null indicators with isna().astype(int)
  5. Document your null handling strategy for reproducibility

Pro Tip: Chain operations for clean code:

df = (df
.fillna({'A': df['A'].median()})
.ffill()
.dropna(subset=['B']))