MLP FU
Models/Regression

Linear Regression

Where to Use Linear Regression

Linear regression is best used when you want to predict a number (like price, temperature, or score) based on one or more features. It works well when the relationship between the input features and the output is roughly a straight line (linear).

Common use cases:

  • Predicting house prices from features like size and location
  • Estimating sales based on advertising spend
  • Forecasting temperature from weather data

Why Use Linear Regression?

  • Simplicity: Easy to understand and implement
  • Interpretability: You can see how each feature affects the prediction
  • Speed: Fast to train, even on large datasets
  • Baseline: Good starting point before trying more complex models

How to Use Linear Regression

  1. Prepare your data: Make sure your features (X) and target (y) are numbers. Handle missing values and scale features if needed.
  2. Split your data: Use train_test_split to separate training and test sets.
  3. Choose a model: Start with LinearRegression for small/medium data, or SGDRegressor for large data.
  4. Train the model: Call .fit(X_train, y_train).
  5. Make predictions: Use .predict(X_test).
  6. Evaluate: Check how well your model predicts using metrics like mean squared error.

What are the Inputs and Outputs?

  • Input (X): Table of numbers (features). Each row is a sample, each column is a feature (e.g., size, age, price).
  • Output (y): A single number for each sample (the value you want to predict).
  • Prediction: The model outputs a number for each input row, which is its guess for the target value.

How Does Linear Regression Work?

Linear regression finds the best straight line (or hyperplane for many features) that fits your data. It does this by adjusting weights (coefficients) so the line is as close as possible to the real data points.

  • For one feature, it's a line: y = weight * x + bias
  • For many features: y = w1*x1 + w2*x2 + ... + bias

The model learns the weights and bias during training. After training, you can use these to understand which features matter most.

What is Linear Regression?

Linear regression is a simple and widely used method for predicting a continuous value based on one or more input features.


Step 1: Create a Baseline with Dummy Regressor

Before building a real model, it's helpful to create a baseline. A Dummy Regressor is a simple model that just predicts the average value from the training data. This helps you check if your real model is actually learning something useful.

import numpy as np
from sklearn.dummy import DummyRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# 1. Create some example data
y = np.array([1, 2, 3, 4, 5])  # Target values
X = np.arange(5).reshape(-1, 1)  # Features: [[0], [1], [2], [3], [4]]

# 2. Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

# 3. Create and fit the dummy regressor
baseline = DummyRegressor(strategy='mean')
baseline.fit(X_train, y_train)

# 4. Predict and evaluate
preds = baseline.predict(X_test)
mse = mean_squared_error(y_test, preds)
print('Baseline predictions:', preds)
print('Baseline MSE:', mse)

Explanation:

  • DummyRegressor(strategy='mean'): Always predicts the average of the training targets.
  • fit(X_train, y_train): Learns the mean from the training data.
  • predict(X_test): Predicts the mean for all test samples.
  • mean_squared_error(y_test, preds): Measures how far off the predictions are from the real values.

Step 2: Train a Real Linear Regression Model

Now let's train a real model that tries to find the best line through the data.

from sklearn.linear_model import LinearRegression

# 1. Create the model
model = LinearRegression()

# 2. Train the model on the training data
model.fit(X_train, y_train)

# 3. Make predictions on the test data
preds = model.predict(X_test)
print('Predictions:', preds)

Explanation:

  • LinearRegression(): Makes a model that will try to fit a straight line.
  • fit(X_train, y_train): Finds the best line using the training data.
  • predict(X_test): Uses the line to predict values for the test data.

Step 3: Use SGDRegressor for Large Datasets

SGDRegressor is another way to fit a linear model. It uses a method called stochastic gradient descent, which is good for large datasets.

from sklearn.linear_model import SGDRegressor

# 1. Create the SGDRegressor model
sgd = SGDRegressor(max_iter=1000, tol=1e-3, random_state=0)

# 2. Train the model
sgd.fit(X_train, y_train)

# 3. Make predictions
preds = sgd.predict(X_test)
print('SGD Predictions:', preds)

Explanation:

  • SGDRegressor(max_iter=1000, tol=1e-3): Uses stochastic gradient descent to fit the model. max_iter is the maximum number of passes over the data. tol is the stopping criterion.
  • fit(X_train, y_train): Trains the model.
  • predict(X_test): Makes predictions.

Key Parameters of SGDRegressor

ParameterPurpose
max_iterMaximum number of passes over the data
tolTolerance for stopping criterion
learning_rateHow fast the model updates weights
penaltyRegularization (e.g., 'l2', 'l1', or 'elasticnet')
eta0Initial learning rate

Step 4: Accessing Model Weights

After training, you can look at the weights (also called coefficients) and the intercept (the bias or starting value) to see what the model learned.

print('LinearRegression weights:', model.coef_)
print('LinearRegression intercept:', model.intercept_)
print('SGDRegressor weights:', sgd.coef_)
print('SGDRegressor intercept:', sgd.intercept_)

Explanation:

  • coef_: The weights for each feature. Higher values mean that feature is more important.
  • intercept_: The bias term. It's the value predicted when all features are zero.

Visualizing the Linear Regression Process