Data Preprocessing: Step-by-Step Summary

Welcome! This guide walks you through the main steps of data preprocessing in machine learning. Each step is explained simply, with code and visuals to help you understand the process and why it matters.

1. Extract Data

Start by loading your data into Python, usually with pandas. This is the first step because you need your data in a format you can work with:

import pandas as pd
data = pd.read_csv('data.csv')

import pandas as pd: Loads the pandas library for data tables.
pd.read_csv('data.csv'): Reads your CSV file into a DataFrame.

Why?

You need to get your data into Python so you can explore, clean, and prepare it for machine learning.

2. Fill Missing Values

Missing data is common. Fill or drop missing values to keep your data clean. Models can't handle missing values directly:

data.fillna(0, inplace=True)

fillna(0): Replaces missing values with 0.
inplace=True: Changes the data directly.

Why?

Machine learning models need complete data. Filling or removing missing values prevents errors and improves model quality.

3. Scale Numeric Features

Scaling puts all numeric features on a similar scale. This helps many algorithms work better and faster:

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
data_scaled = scaler.fit_transform(data)

StandardScaler(): Scales features to have mean 0 and variance 1.
fit_transform(data): Learns scaling from data and applies it.

Why?

Features with different scales can confuse models. Scaling ensures each feature contributes equally.

4. Visualize Feature Distribution

Visualizing helps you understand your data. You can spot outliers, skewed distributions, or errors:

import matplotlib.pyplot as plt
data.hist(figsize=(8,6))
plt.show()

data.hist(): Plots histograms for each feature.
plt.show(): Displays the plot.

Why?

Seeing your data helps you catch problems early and decide on the right transformations.

5. Data Transformation

Transform features to improve learning. Some models work best when data is more "normal" (bell-shaped):

from sklearn.preprocessing import PowerTransformer
pt = PowerTransformer()
data_trans = pt.fit_transform(data)

PowerTransformer(): Makes data more normal (bell-shaped).

Why?

Normalizing data can improve model accuracy and make training more stable.

6. Create Composite Transformers

Combine steps using a ColumnTransformer. This lets you apply different preprocessing to different columns in one go:

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder

preprocessor = ColumnTransformer([
("num", StandardScaler(), [0, 1]),
("cat", OneHotEncoder(), [2])
])

ColumnTransformer: Applies different transforms to columns.
StandardScaler(): Scales numeric columns.
OneHotEncoder(): Encodes categorical columns.

Why?

Real datasets have both numbers and categories. Composite transformers let you handle both at once, keeping your code tidy.

7. Feature Selection Demonstration

Keep only the most important features. This reduces noise and can make your model faster and more accurate:

from sklearn.feature_selection import SelectKBest, f_classif
selector = SelectKBest(f_classif, k=5)
data_selected = selector.fit_transform(data, target)

SelectKBest: Selects top features.
f_classif: Statistical test for classification.

Why?

Not all features help your model. Selecting the best ones can boost performance and reduce overfitting.

8. Dimensionality Reduction with PCA

Reduce the number of features while keeping information. This helps with visualization and can speed up training:

from sklearn.decomposition import PCA
pca = PCA(n_components=2)
data_pca = pca.fit_transform(data)

PCA: Principal Component Analysis, reduces dimensions.
n_components=2: Keeps 2 main features.

Why?

Too many features can slow down learning and cause overfitting. PCA keeps the most important information while reducing size.

9. Create Pipelines

Pipelines chain steps together for cleaner code. This makes your workflow repeatable and less error-prone:

from sklearn.pipeline import Pipeline
pipe = Pipeline([
("scale", StandardScaler()),
("pca", PCA(n_components=2))
])
data_pipe = pipe.fit_transform(data)

Pipeline: Chains steps together.
fit_transform: Runs all steps in order.

Why?

Pipelines keep your code organized and make it easy to apply the same steps to new data.

10. Handling Class Imbalance

If one class is much more common, balance the data. This helps your model learn to predict all classes, not just the majority:

from imblearn.over_sampling import SMOTE
smote = SMOTE()
data_bal, target_bal = smote.fit_resample(data, target)

SMOTE: Creates synthetic samples for the minority class.

Why?

If your data is imbalanced, your model might ignore rare classes. Balancing helps it learn from all classes.

Summary Table

Step	Purpose	Why It Matters
Extract Data	Load data into Python	Needed to start any analysis
Fill Missing Values	Handle missing data	Prevents errors, improves model quality
Scale Numeric Features	Standardize numeric columns	Ensures fair treatment of all features
Visualize Feature Distribution	Understand feature shapes	Spot problems and guide preprocessing
Data Transformation	Make data more normal	Improves model accuracy and stability
Composite Transformers	Apply transforms to columns	Handles mixed data types efficiently
Feature Selection	Keep important features	Reduces noise, speeds up learning
Dimensionality Reduction	Reduce number of features	Prevents overfitting, aids visualization
Pipelines	Chain steps together	Keeps workflow organized and repeatable
Class Imbalance	Balance target classes	Ensures all classes are learned equally

You now have a step-by-step overview of data preprocessing! Each step helps prepare your data for better machine learning results.

Data Preprocessing: Step-by-Step Summary

On this page