MLP FU
Scikit-learn/Data Preprocessing

Categorical Transformers

What are Categorical Features?

Categorical features represent qualitative data that can be divided into groups or categories. Examples include:

  • Colors (red, green, blue)
  • Product types (electronics, clothing, food)
  • Education levels (high school, bachelor's, master's)

Machine learning algorithms require numerical input, so categorical features must be transformed.

Encoding Methods

Ordinal Encoding

Ordinal encoding transforms categories into ordered integers. Use this when categories have a natural order.

from sklearn.preprocessing import OrdinalEncoder
import pandas as pd

# Sample data
data = {
  'size': ['small', 'medium', 'large', 'medium', 'small'],
  'quality': ['low', 'high', 'medium', 'high', 'medium']
}
df = pd.DataFrame(data)

# Define category orders
size_categories = [['small', 'medium', 'large']]  # small=0, medium=1, large=2
quality_categories = [['low', 'medium', 'high']]  # low=0, medium=1, high=2

# Create and fit encoder
encoder = OrdinalEncoder(categories=size_categories + quality_categories)
encoded_data = encoder.fit_transform(df)

print("Original data:")
print(df)
print("
Ordinal encoded data:")
print(encoded_data)

When to use: For ordinal categories where order matters (small < medium < large).

One-Hot Encoding

One-hot encoding creates binary columns for each category. Each column represents the presence (1) or absence (0) of a category.

from sklearn.preprocessing import OneHotEncoder
import pandas as pd

# Sample data
data = {
  'color': ['red', 'green', 'blue', 'red', 'green']
}
df = pd.DataFrame(data)

# Create and fit encoder
encoder = OneHotEncoder(sparse_output=False)
encoded_data = encoder.fit_transform(df[['color']])

# Create DataFrame with encoded values
encoded_df = pd.DataFrame(
  encoded_data,
  columns=encoder.get_feature_names_out(['color'])
)

print("Original data:")
print(df)
print("
One-hot encoded data:")
print(encoded_df)

When to use: For nominal categories where order doesn't matter (red, green, blue).

Dummy Encoding (Drop First)

Similar to one-hot encoding but drops the first category to avoid the "dummy variable trap" (multicollinearity).

from sklearn.preprocessing import OneHotEncoder
import pandas as pd

# Sample data
data = {
  'color': ['red', 'green', 'blue', 'red', 'green']
}
df = pd.DataFrame(data)

# Create and fit encoder with drop='first'
encoder = OneHotEncoder(sparse_output=False, drop='first')
encoded_data = encoder.fit_transform(df[['color']])

# Create DataFrame with encoded values
encoded_df = pd.DataFrame(
  encoded_data,
  columns=encoder.get_feature_names_out(['color'])
)

print("Original data:")
print(df)
print("
Dummy encoded data (first category dropped):")
print(encoded_df)

When to use: When using models sensitive to multicollinearity, like linear regression.

Adding Dummy Features

Sometimes adding a binary feature that indicates the presence of a certain condition can be useful.

import pandas as pd
import numpy as np

# Sample data
data = {
  'age': [25, 30, 45, 40, 35],
  'income': [50000, 70000, 90000, 65000, 80000]
}
df = pd.DataFrame(data)

# Add dummy feature for high income (>70000)
df['high_income'] = (df['income'] > 70000).astype(int)

# Add dummy feature for age group
df['young_adult'] = ((df['age'] >= 18) & (df['age'] < 35)).astype(int)

print("Data with dummy features:")
print(df)

When to use: To highlight specific conditions or thresholds that might be important for your model.

Handling High Cardinality

When a categorical feature has many unique values (high cardinality), one-hot encoding creates too many columns.

Target Encoding

Replace categories with the mean of the target variable for that category.

import pandas as pd
import numpy as np

# Sample data with high cardinality
data = {
  'zipcode': ['10001', '20001', '30001', '10001', '20001', '40001', '50001'],
  'price': [100, 150, 200, 120, 160, 180, 220]
}
df = pd.DataFrame(data)

# Calculate mean target value per category
target_means = df.groupby('zipcode')['price'].mean().to_dict()

# Apply target encoding
df['zipcode_encoded'] = df['zipcode'].map(target_means)

print("Original data:")
print(df[['zipcode', 'price']])
print("
Target encoded data:")
print(df[['zipcode', 'zipcode_encoded', 'price']])

When to use: For high cardinality features like ZIP codes, product IDs, or user IDs.

Handling Unknown Categories

When new categories appear in test data that weren't in training data, encoders can fail.

from sklearn.preprocessing import OneHotEncoder
import pandas as pd

# Training data
train_data = {
  'color': ['red', 'green', 'blue']
}
train_df = pd.DataFrame(train_data)

# Test data with new category
test_data = {
  'color': ['red', 'yellow', 'green']  # 'yellow' is new
}
test_df = pd.DataFrame(test_data)

# Create encoder with handle_unknown='ignore'
encoder = OneHotEncoder(sparse_output=False, handle_unknown='ignore')
encoder.fit(train_df[['color']])

# Transform test data
test_encoded = encoder.transform(test_df[['color']])

print("Test data encoded:")
print(test_encoded)
print(f"Feature names: {encoder.get_feature_names_out(['color'])}")
print("Note: 'yellow' gets all zeros since it's unknown")

When to use: In production environments where new categories may appear over time.

Best Practices

  1. Choose the right encoding: Use ordinal for ordered categories and one-hot for nominal categories
  2. Handle high cardinality: Consider target encoding or feature hashing for features with many unique values
  3. Handle unknown values: Set handle_unknown='ignore' in OneHotEncoder to avoid errors
  4. Combine rare categories: Group infrequent categories into an "Other" category
  5. Use domain knowledge: Some categories might be better represented in specific ways based on domain expertise