Scikit-learn/Data Preprocessing
Outliers
What Are Outliers?
Outliers are numbers in your data that are very different from the rest. They are much bigger or smaller than most values.
| Example Data | 1 | 2 | 2 | 3 | 100 |
|---|
Here, 100 is an outlier because it is much larger than the other numbers.
Why Do Outliers Matter?
- They can make averages and results misleading.
- They might be mistakes or rare events.
- Some machine learning models are sensitive to them.
What Can You Do With Outliers?
- Remove them if they are mistakes or not useful.
- Replace them with a typical value (like the median).
- Keep them if they are important.
How Can You Find Outliers?
A common way is the Interquartile Range (IQR) method. Here's how you can do it in Python:
import pandas as pd
import numpy as np
# Example data
data = pd.Series([1, 2, 2, 3, 100])
# Step 1: Find the 25th and 75th percentiles (Q1 and Q3)
Q1 = data.quantile(0.25)
Q3 = data.quantile(0.75)
# Step 2: Calculate the IQR (the range between Q1 and Q3)
IQR = Q3 - Q1
# Step 3: Find outliers (values outside 1.5 * IQR from Q1 or Q3)
outliers = data[(data < Q1 - 1.5 * IQR) | (data > Q3 + 1.5 * IQR)]
print("Outliers:", outliers.tolist())Explanation:
Q1andQ3are the 25th and 75th percentiles.IQRis the difference between Q3 and Q1.- Outliers are values less than
Q1 - 1.5 * IQRor greater thanQ3 + 1.5 * IQR.
How to Handle Outliers
1. Remove Outliers
# Keep only values that are NOT outliers
filtered_data = data[(data >= Q1 - 1.5 * IQR) & (data <= Q3 + 1.5 * IQR)]
print("Data without outliers:", filtered_data.tolist())2. Replace Outliers
# Replace outliers with the median value
median = data.median()
data_replaced = data.copy()
data_replaced[(data < Q1 - 1.5 * IQR) | (data > Q3 + 1.5 * IQR)] = median
print("Data with outliers replaced:", data_replaced.tolist())When Should You Handle Outliers?
- If they are mistakes or not useful, handle them.
- If they are important (like rare but real events), keep them.
- Always check why outliers exist before removing them.
Outlier Handling Process (Visual)
Here's a simple flowchart to help you decide what to do with outliers: