MLP FU
Scikit-learn/Data Preprocessing

Outliers

What Are Outliers?

Outliers are numbers in your data that are very different from the rest. They are much bigger or smaller than most values.

Example Data1223100

Here, 100 is an outlier because it is much larger than the other numbers.

Why Do Outliers Matter?

  • They can make averages and results misleading.
  • They might be mistakes or rare events.
  • Some machine learning models are sensitive to them.

What Can You Do With Outliers?

  • Remove them if they are mistakes or not useful.
  • Replace them with a typical value (like the median).
  • Keep them if they are important.

How Can You Find Outliers?

A common way is the Interquartile Range (IQR) method. Here's how you can do it in Python:

import pandas as pd
import numpy as np

# Example data
data = pd.Series([1, 2, 2, 3, 100])

# Step 1: Find the 25th and 75th percentiles (Q1 and Q3)
Q1 = data.quantile(0.25)
Q3 = data.quantile(0.75)

# Step 2: Calculate the IQR (the range between Q1 and Q3)
IQR = Q3 - Q1

# Step 3: Find outliers (values outside 1.5 * IQR from Q1 or Q3)
outliers = data[(data < Q1 - 1.5 * IQR) | (data > Q3 + 1.5 * IQR)]
print("Outliers:", outliers.tolist())

Explanation:

  • Q1 and Q3 are the 25th and 75th percentiles.
  • IQR is the difference between Q3 and Q1.
  • Outliers are values less than Q1 - 1.5 * IQR or greater than Q3 + 1.5 * IQR.

How to Handle Outliers

1. Remove Outliers

# Keep only values that are NOT outliers
filtered_data = data[(data >= Q1 - 1.5 * IQR) & (data <= Q3 + 1.5 * IQR)]
print("Data without outliers:", filtered_data.tolist())

2. Replace Outliers

# Replace outliers with the median value
median = data.median()
data_replaced = data.copy()
data_replaced[(data < Q1 - 1.5 * IQR) | (data > Q3 + 1.5 * IQR)] = median
print("Data with outliers replaced:", data_replaced.tolist())

When Should You Handle Outliers?

  • If they are mistakes or not useful, handle them.
  • If they are important (like rare but real events), keep them.
  • Always check why outliers exist before removing them.

Outlier Handling Process (Visual)

Here's a simple flowchart to help you decide what to do with outliers: