MLP FU
Scikit-learn/Data Preprocessing/Feature Selection

Filter Based Feature Selection

What is Filter Based Feature Selection?

Filter methods help you pick the most useful features (columns) in your data before you use any machine learning model. They are fast, simple, and work by checking each feature's relationship with the target.

Example: If a column doesn't change much or isn't related to what you want to predict, you can remove it.

Why Use Filter Methods?

  • They make your model faster and easier to understand.
  • They help avoid overfitting (when your model learns noise instead of patterns).

Comparison Table

MethodHow it WorksSpeedUses Model?Example Techniques
FilterUses stats to pick featuresFastNoCorrelation, Variance
WrapperTries feature sets with a modelSlowYesRFE, Forward Selection
EmbeddedPicks features during model trainingMediumYesLasso, Decision Trees

Common Filter Methods

  • Correlation: For numbers. Remove features that don't change with the target.
  • Variance Threshold: Remove features that are almost always the same value.
  • Chi-Squared Test: For categories (not shown here).

How Filter Methods Work

Example 1: Correlation Filter (Step-by-Step)

import pandas as pd

# Small example dataset
data = pd.DataFrame({
  'height': [150, 160, 170, 180, 190],
  'weight': [50, 60, 70, 80, 90],
  'shoe_size': [6, 7, 8, 9, 10],
  'random_noise': [1, 1, 1, 1, 1],
  'target': [0, 1, 1, 1, 0]
})

# Step 1: Calculate correlation with the target
correlations = data.corr()['target'].drop('target')
print(correlations)

# Step 2: Keep features with absolute correlation above 0.3
selected = correlations[abs(correlations) > 0.3].index.tolist()
print("Selected features:", selected)
  • data: Our example table.
  • data.corr(): Finds how much each column is related to the target.
  • We keep features with a correlation above 0.3 (you can change this number).

Example 2: Variance Threshold (Step-by-Step)

from sklearn.feature_selection import VarianceThreshold
import pandas as pd

# Same example data as above
data = pd.DataFrame({
  'height': [150, 160, 170, 180, 190],
  'weight': [50, 60, 70, 80, 90],
  'shoe_size': [6, 7, 8, 9, 10],
  'random_noise': [1, 1, 1, 1, 1],
  'target': [0, 1, 1, 1, 0]
})

# Step 1: Remove features with variance below 0.1
selector = VarianceThreshold(threshold=0.1)
selected_data = selector.fit_transform(data.drop('target', axis=1))
selected_columns = data.drop('target', axis=1).columns[selector.get_support()].tolist()
print("Selected features:", selected_columns)
  • VarianceThreshold: Removes columns that don't change much.
  • threshold=0.1: Only keep columns that vary more than this.
  • get_support(): Tells us which columns were kept.

When to Use Filter Methods

  • When you want a quick, simple way to reduce features.
  • When you have lots of features.