Data Preprocessing: Step-by-Step Summary
Data Preprocessing: Step-by-Step Summary
Welcome! This guide walks you through the main steps of data preprocessing in machine learning. Each step is explained simply, with code and visuals to help you understand the process and why it matters.
1. Extract Data
Start by loading your data into Python, usually with pandas. This is the first step because you need your data in a format you can work with:
import pandas as pd
data = pd.read_csv('data.csv')import pandas as pd: Loads the pandas library for data tables.pd.read_csv('data.csv'): Reads your CSV file into a DataFrame.
Why?
You need to get your data into Python so you can explore, clean, and prepare it for machine learning.
2. Fill Missing Values
Missing data is common. Fill or drop missing values to keep your data clean. Models can't handle missing values directly:
data.fillna(0, inplace=True)fillna(0): Replaces missing values with 0.inplace=True: Changes the data directly.
Why?
Machine learning models need complete data. Filling or removing missing values prevents errors and improves model quality.
3. Scale Numeric Features
Scaling puts all numeric features on a similar scale. This helps many algorithms work better and faster:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
data_scaled = scaler.fit_transform(data)StandardScaler(): Scales features to have mean 0 and variance 1.fit_transform(data): Learns scaling from data and applies it.
Why?
Features with different scales can confuse models. Scaling ensures each feature contributes equally.
4. Visualize Feature Distribution
Visualizing helps you understand your data. You can spot outliers, skewed distributions, or errors:
import matplotlib.pyplot as plt
data.hist(figsize=(8,6))
plt.show()data.hist(): Plots histograms for each feature.plt.show(): Displays the plot.
Why?
Seeing your data helps you catch problems early and decide on the right transformations.
5. Data Transformation
Transform features to improve learning. Some models work best when data is more "normal" (bell-shaped):
from sklearn.preprocessing import PowerTransformer
pt = PowerTransformer()
data_trans = pt.fit_transform(data)PowerTransformer(): Makes data more normal (bell-shaped).
Why?
Normalizing data can improve model accuracy and make training more stable.
6. Create Composite Transformers
Combine steps using a ColumnTransformer. This lets you apply different preprocessing to different columns in one go:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
preprocessor = ColumnTransformer([
("num", StandardScaler(), [0, 1]),
("cat", OneHotEncoder(), [2])
])ColumnTransformer: Applies different transforms to columns.StandardScaler(): Scales numeric columns.OneHotEncoder(): Encodes categorical columns.
Why?
Real datasets have both numbers and categories. Composite transformers let you handle both at once, keeping your code tidy.
7. Feature Selection Demonstration
Keep only the most important features. This reduces noise and can make your model faster and more accurate:
from sklearn.feature_selection import SelectKBest, f_classif
selector = SelectKBest(f_classif, k=5)
data_selected = selector.fit_transform(data, target)SelectKBest: Selects top features.f_classif: Statistical test for classification.
Why?
Not all features help your model. Selecting the best ones can boost performance and reduce overfitting.
8. Dimensionality Reduction with PCA
Reduce the number of features while keeping information. This helps with visualization and can speed up training:
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
data_pca = pca.fit_transform(data)PCA: Principal Component Analysis, reduces dimensions.n_components=2: Keeps 2 main features.
Why?
Too many features can slow down learning and cause overfitting. PCA keeps the most important information while reducing size.
9. Create Pipelines
Pipelines chain steps together for cleaner code. This makes your workflow repeatable and less error-prone:
from sklearn.pipeline import Pipeline
pipe = Pipeline([
("scale", StandardScaler()),
("pca", PCA(n_components=2))
])
data_pipe = pipe.fit_transform(data)Pipeline: Chains steps together.fit_transform: Runs all steps in order.
Why?
Pipelines keep your code organized and make it easy to apply the same steps to new data.
10. Handling Class Imbalance
If one class is much more common, balance the data. This helps your model learn to predict all classes, not just the majority:
from imblearn.over_sampling import SMOTE
smote = SMOTE()
data_bal, target_bal = smote.fit_resample(data, target)SMOTE: Creates synthetic samples for the minority class.
Why?
If your data is imbalanced, your model might ignore rare classes. Balancing helps it learn from all classes.
Summary Table
| Step | Purpose | Why It Matters |
|---|---|---|
| Extract Data | Load data into Python | Needed to start any analysis |
| Fill Missing Values | Handle missing data | Prevents errors, improves model quality |
| Scale Numeric Features | Standardize numeric columns | Ensures fair treatment of all features |
| Visualize Feature Distribution | Understand feature shapes | Spot problems and guide preprocessing |
| Data Transformation | Make data more normal | Improves model accuracy and stability |
| Composite Transformers | Apply transforms to columns | Handles mixed data types efficiently |
| Feature Selection | Keep important features | Reduces noise, speeds up learning |
| Dimensionality Reduction | Reduce number of features | Prevents overfitting, aids visualization |
| Pipelines | Chain steps together | Keeps workflow organized and repeatable |
| Class Imbalance | Balance target classes | Ensures all classes are learned equally |
You now have a step-by-step overview of data preprocessing! Each step helps prepare your data for better machine learning results.