Scikit-learn/Data Preprocessing/Feature Selection
Wrapper Based Feature Selection
What is Wrapper Based Feature Selection?
Wrapper methods help you find the best set of features by actually testing them with a machine learning model. They try different combinations and keep the ones that work best.
Example: Try using just 'height' and 'weight' to predict something, then try 'height' and 'shoe_size', and so on.
Why Use Wrapper Methods?
- They can find the best features for your specific model.
- They may take more time, but can give better results.
Comparison Table
| Method | How it Works | Speed | Uses Model? | Example Techniques |
|---|---|---|---|---|
| Filter | Uses stats to pick features | Fast | No | Correlation, Variance |
| Wrapper | Tries feature sets with a model | Slow | Yes | RFE, Forward Selection |
| Embedded | Picks features during model training | Medium | Yes | Lasso, Decision Trees |
Common Wrapper Methods
- Forward Selection: Start with no features, add one at a time.
- Backward Elimination: Start with all features, remove one at a time.
- Recursive Feature Elimination (RFE): Remove the least important features step by step.
How Wrapper Methods Work
Example 1: Recursive Feature Elimination (RFE)
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
import pandas as pd
# Small example dataset
data = pd.DataFrame({
'height': [150, 160, 170, 180, 190],
'weight': [50, 60, 70, 80, 90],
'shoe_size': [6, 7, 8, 9, 10],
'target': [0, 1, 1, 1, 0]
})
X = data[['height', 'weight', 'shoe_size']]
y = data['target']
# Step 1: Create a model
model = LogisticRegression()
# Step 2: Use RFE to select 2 best features
selector = RFE(model, n_features_to_select=2)
selector = selector.fit(X, y)
# Step 3: Get selected feature names
selected = X.columns[selector.support_].tolist()
print("Selected features:", selected)RFE: Tries removing features one by one to see which are best.LogisticRegression(): The model used to test features.selector.support_: Shows which features were chosen.
Example 2: Forward Selection (Step-by-Step)
from sklearn.feature_selection import SequentialFeatureSelector
from sklearn.linear_model import LogisticRegression
import pandas as pd
# Same example data as above
data = pd.DataFrame({
'height': [150, 160, 170, 180, 190],
'weight': [50, 60, 70, 80, 90],
'shoe_size': [6, 7, 8, 9, 10],
'target': [0, 1, 1, 1, 0]
})
X = data[['height', 'weight', 'shoe_size']]
y = data['target']
# Step 1: Create a model
model = LogisticRegression()
# Step 2: Use forward selection to pick 2 best features
selector = SequentialFeatureSelector(model, n_features_to_select=2, direction='forward')
selector = selector.fit(X, y)
# Step 3: Get selected feature names
selected = X.columns[selector.get_support()].tolist()
print("Selected features:", selected)SequentialFeatureSelector: Adds features one by one to find the best set.direction='forward': Means we start with no features and add them.get_support(): Tells us which features were kept.