MLP FU
Scikit-learn/Data Preprocessing

Data Preprocessing: Step-by-Step Summary

Data Preprocessing: Step-by-Step Summary

Welcome! This guide walks you through the main steps of data preprocessing in machine learning. Each step is explained simply, with code and visuals to help you understand the process and why it matters.

1. Extract Data

Start by loading your data into Python, usually with pandas. This is the first step because you need your data in a format you can work with:

import pandas as pd
data = pd.read_csv('data.csv')
  • import pandas as pd: Loads the pandas library for data tables.
  • pd.read_csv('data.csv'): Reads your CSV file into a DataFrame.

Why?

You need to get your data into Python so you can explore, clean, and prepare it for machine learning.

2. Fill Missing Values

Missing data is common. Fill or drop missing values to keep your data clean. Models can't handle missing values directly:

data.fillna(0, inplace=True)
  • fillna(0): Replaces missing values with 0.
  • inplace=True: Changes the data directly.

Why?

Machine learning models need complete data. Filling or removing missing values prevents errors and improves model quality.

3. Scale Numeric Features

Scaling puts all numeric features on a similar scale. This helps many algorithms work better and faster:

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
data_scaled = scaler.fit_transform(data)
  • StandardScaler(): Scales features to have mean 0 and variance 1.
  • fit_transform(data): Learns scaling from data and applies it.

Why?

Features with different scales can confuse models. Scaling ensures each feature contributes equally.

4. Visualize Feature Distribution

Visualizing helps you understand your data. You can spot outliers, skewed distributions, or errors:

import matplotlib.pyplot as plt
data.hist(figsize=(8,6))
plt.show()
  • data.hist(): Plots histograms for each feature.
  • plt.show(): Displays the plot.

Why?

Seeing your data helps you catch problems early and decide on the right transformations.

5. Data Transformation

Transform features to improve learning. Some models work best when data is more "normal" (bell-shaped):

from sklearn.preprocessing import PowerTransformer
pt = PowerTransformer()
data_trans = pt.fit_transform(data)
  • PowerTransformer(): Makes data more normal (bell-shaped).

Why?

Normalizing data can improve model accuracy and make training more stable.

6. Create Composite Transformers

Combine steps using a ColumnTransformer. This lets you apply different preprocessing to different columns in one go:

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder

preprocessor = ColumnTransformer([
("num", StandardScaler(), [0, 1]),
("cat", OneHotEncoder(), [2])
])
  • ColumnTransformer: Applies different transforms to columns.
  • StandardScaler(): Scales numeric columns.
  • OneHotEncoder(): Encodes categorical columns.

Why?

Real datasets have both numbers and categories. Composite transformers let you handle both at once, keeping your code tidy.

7. Feature Selection Demonstration

Keep only the most important features. This reduces noise and can make your model faster and more accurate:

from sklearn.feature_selection import SelectKBest, f_classif
selector = SelectKBest(f_classif, k=5)
data_selected = selector.fit_transform(data, target)
  • SelectKBest: Selects top features.
  • f_classif: Statistical test for classification.

Why?

Not all features help your model. Selecting the best ones can boost performance and reduce overfitting.

8. Dimensionality Reduction with PCA

Reduce the number of features while keeping information. This helps with visualization and can speed up training:

from sklearn.decomposition import PCA
pca = PCA(n_components=2)
data_pca = pca.fit_transform(data)
  • PCA: Principal Component Analysis, reduces dimensions.
  • n_components=2: Keeps 2 main features.

Why?

Too many features can slow down learning and cause overfitting. PCA keeps the most important information while reducing size.

9. Create Pipelines

Pipelines chain steps together for cleaner code. This makes your workflow repeatable and less error-prone:

from sklearn.pipeline import Pipeline
pipe = Pipeline([
("scale", StandardScaler()),
("pca", PCA(n_components=2))
])
data_pipe = pipe.fit_transform(data)
  • Pipeline: Chains steps together.
  • fit_transform: Runs all steps in order.

Why?

Pipelines keep your code organized and make it easy to apply the same steps to new data.

10. Handling Class Imbalance

If one class is much more common, balance the data. This helps your model learn to predict all classes, not just the majority:

from imblearn.over_sampling import SMOTE
smote = SMOTE()
data_bal, target_bal = smote.fit_resample(data, target)
  • SMOTE: Creates synthetic samples for the minority class.

Why?

If your data is imbalanced, your model might ignore rare classes. Balancing helps it learn from all classes.


Summary Table

StepPurposeWhy It Matters
Extract DataLoad data into PythonNeeded to start any analysis
Fill Missing ValuesHandle missing dataPrevents errors, improves model quality
Scale Numeric FeaturesStandardize numeric columnsEnsures fair treatment of all features
Visualize Feature DistributionUnderstand feature shapesSpot problems and guide preprocessing
Data TransformationMake data more normalImproves model accuracy and stability
Composite TransformersApply transforms to columnsHandles mixed data types efficiently
Feature SelectionKeep important featuresReduces noise, speeds up learning
Dimensionality ReductionReduce number of featuresPrevents overfitting, aids visualization
PipelinesChain steps togetherKeeps workflow organized and repeatable
Class ImbalanceBalance target classesEnsures all classes are learned equally

You now have a step-by-step overview of data preprocessing! Each step helps prepare your data for better machine learning results.