Building Linear Regression From Scratch

Why This Matters (And Why You Should Care)

Here’s the problem with most machine learning tutorials: they teach you how to use libraries. You import sklearn, call LinearRegression(), fit your data, and boom—you have a model. But what actually happened? No clue.

This is like learning to drive by being a passenger. You’ll get somewhere, but you won’t understand what’s happening under the hood.

Building linear regression from scratch is the best way to understand why it works, when it fails, and how to fix it. You’ll understand:

The intuition behind fitting a line to data
The math that makes it all work (don’t worry, it’s simpler than you think)
Why gradient descent is the key to learning
How to evaluate if your model is actually good

This isn’t just “for fun.” Understanding fundamentals changes how you approach ML. You’ll spot bugs in your models faster. You’ll know when linear regression is the right choice and when to move to something fancier. You’ll be the person who actually understands what’s happening, instead of just copy-pasting code.

By the end of this post, you’ll have:

A working linear regression implementation in pure Python
Deep understanding of loss functions and gradient descent
Multiple ways to evaluate your model
Practical knowledge of when linear regression works (and when it doesn’t)

Let’s build it.

Part 1: The Intuition (Before Any Math)

The Simplest Idea: Fitting a Line to Points

Imagine you have data about how many hours students studied and their exam scores:

1
Hours Studied (x) | Exam Score (y)
2
      2           |      45
3
      4           |      60
4
      5           |      72
5
      6           |      85
6
      7           |      89

You look at this data and think: “There’s a pattern here. More hours → higher scores.” You grab a pencil and draw a straight line that fits through these points as well as possible.

That line is your linear regression model.

The line has a simple equation:

$y = mx + b$

Where:

y = the predicted exam score (output)
x = hours studied (input)
m = slope (how much does score increase per hour?)
b = intercept (what’s the baseline score if you study 0 hours?)

Why a Straight Line?

Because linear relationships are simple and interpretable. A line captures the main trend without overfitting to noise.

Real-world examples where linear regression works:

Price prediction: More square feet → higher house price (usually linear-ish)
Sales forecasting: More marketing spend → more sales
Grade prediction: More study hours → higher grades
Temperature vs. Ice cream sales: Hotter days → more ice cream sold

The pattern is clear: one thing increases, the other tends to increase (or decrease) in a predictable way.

The Key Insight: We’re Searching for m and b

Think of linear regression as a search problem:

There are infinite possible lines (infinite combinations of m and b)
We need to find the one line that fits our data best
“Best” means closest to all the points

How do we measure “closest”? That’s where the loss function comes in.

Part 2: Understanding Loss Functions

The Problem: Predictions Are Never Perfect

You draw a line. But do all the data points sit exactly on the line? Almost never.

1
Real exam score: 72
2
Predicted by our line: 70
3
Error: 72 - 70 = 2

This gap between prediction and reality is error. We want to minimize it.

What Is a Loss Function?

A loss function is a number that tells you “how wrong your model is.” It’s your feedback signal.

Think of it like a compass:

High loss = “Your model sucks, go back”
Low loss = “You’re on the right track, keep going”

The goal of training is simple: minimize the loss.

Why Not Just Sum Absolute Errors?

The simplest error would be:

$\text{Total Error} = \sum_{i=1}^{n} |y_i - \hat{y}_i|$

Where $y_i$ is the real value and $\hat{y}_i$ is the predicted value.

This works, but it has a problem: it’s not smooth. Imagine trying to optimize a function with sharp corners—it’s hard.

We need a function that’s smooth and differentiable so we can use calculus to find the minimum. Enter: Mean Squared Error (MSE).

Note

Why this matters: The choice of loss function shapes how your model learns. A smooth, differentiable loss function lets us use powerful optimization techniques like gradient descent. Without it, optimization becomes much harder.

Part 3: Mean Squared Error (MSE)

The Formula

$\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2$

What It Means

For each prediction:

Calculate the error: How far off was the prediction?
Square it: $(y_i - \hat{y}_i)^2$
Average all squared errors: Divide by n

Why Square the Error?

Three reasons:

1. Penalizes large errors heavily

1
Small error: (1)² = 1
2
Medium error: (3)² = 9
3
Large error: (10)² = 100

A prediction that’s off by 10 is 100× worse than one off by 1. This forces the model to prioritize fixing big mistakes.

2. Always positive

Squaring turns negative errors into positive ones:

1
Error of -5: (-5)² = 25 ✓ (positive)
2
Error of +5: (+5)² = 25 ✓ (positive)

Without squaring, errors could cancel out (e.g., -5 + 5 = 0, but the model is still wrong).

3. Smooth and differentiable

The squared function is smooth—no sharp corners. This makes optimization much easier with calculus.

Example: Calculating MSE

Say you have 3 test points and your model predicts:

1
Real: [10, 20, 30]
2
Predicted: [9, 22, 28]
3
Errors: [1, -2, 2]
4
Squared errors: [1, 4, 4]
5
MSE = (1 + 4 + 4) / 3 = 3

MSE = 3. This is your loss for these predictions.

Part 4: Gradient Descent (The Magic Algorithm)

The Problem: We Can’t Just Guess m and b

There are infinite combinations of m and b. How do we find the best one?

You can’t try them all. You need a systematic way to search.

The Idea: Walk Downhill

Imagine you’re blindfolded on a mountain. You want to reach the lowest point (minimum loss). What do you do?

You feel the ground beneath your feet and walk in the direction that goes downward.

Gradient descent does exactly this:

Start with random m and b
Calculate the loss
Figure out which direction to move m and b to reduce the loss
Take a small step in that direction
Repeat until you reach a minimum

The Math: Gradients

A gradient is the slope of the loss function. It tells you:

Which direction is downhill?
How steep is the slope?

For our loss function (MSE), the gradient with respect to m is:

$\frac{\partial L}{\partial m} = -\frac{2}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i) \cdot x_i$

And for b:

$\frac{\partial L}{\partial b} = -\frac{2}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)$

Don’t memorize these. Just know:

They tell us how to adjust m and b
The negative sign means we move opposite to the gradient (downhill)

The Update Rule

After calculating the gradients, we update our parameters:

$m := m - \alpha \cdot \frac{\partial L}{\partial m}$

$b := b - \alpha \cdot \frac{\partial L}{\partial b}$

Where α (alpha) is the learning rate—how big a step we take.

Learning Rate: Goldilocks Zone

Too small:

1
α = 0.0001
2

3
Learning is SLOW. You'll reach the minimum, but after 100,000 iterations.

Too big:

1
α = 10
2

3
You overshoot the minimum and diverge. Loss gets worse instead of better.

Just right:

1
α = 0.01
2

3
Fast convergence. You reach the minimum in ~100 iterations.

Finding the right learning rate is an art. You typically try a few values and see what works.

Note

Why this matters: Gradient descent is the backbone of modern machine learning. Every neural network, every deep learning model—they all use some variant of gradient descent. Understanding how it works here (on the simplest problem) makes you understand how it works everywhere.

Part 5: Building the Model (Code Time!)

Let’s implement linear regression from scratch. Pure Python. No sklearn.

Step 1: Dataset Representation

1
# Training data
2
X = [2, 4, 5, 6, 7]  # Hours studied
3
y = [45, 60, 72, 85, 89]  # Exam scores

Or with numpy for efficiency:

1
import numpy as np
2

3
X = np.array([2, 4, 5, 6, 7])
4
y = np.array([45, 60, 72, 85, 89])

Step 2: Initialize Parameters

1
def initialize_parameters():
2
    m = 0.0  # slope
3
    b = 0.0  # intercept
4
    return m, b
5

6
m, b = initialize_parameters()

Start with zero. Gradient descent will adjust them.

Step 3: Prediction Function

1
def predict(X, m, b):
2
    """
3
    Predict using y = mx + b
4
    """
5
    return m * X + b
6

7
# Example
8
y_pred = predict(np.array([3]), m, b)  # Predict for 3 hours
9
print(y_pred)  # Output: 0.0 (since m and b are both 0)

Step 4: Loss Function (MSE)

1
def calculate_loss(y_real, y_pred):
2
    """
3
    Mean Squared Error
4
    """
5
    n = len(y_real)
6
    squared_errors = (y_real - y_pred) ** 2
7
    mse = np.sum(squared_errors) / n
8
    return mse
9

10
# Example
11
y_pred = predict(X, m, b)
12
loss = calculate_loss(y, y_pred)
13
print(f"Initial loss: {loss}")  # Loss is high since m=0, b=0

Step 5: Gradient Computation

1
def compute_gradients(X, y, y_pred, n):
2
    """
3
    Compute gradients for m and b
4
    """
5
    # Gradient for m
6
    dm = (-2/n) * np.sum((y - y_pred) * X)
7

8
    # Gradient for b
9
    db = (-2/n) * np.sum(y - y_pred)
10

11
    return dm, db
12

13
# Example
14
y_pred = predict(X, m, b)
15
dm, db = compute_gradients(X, y, y_pred, len(y))
16
print(f"Gradient for m: {dm}, Gradient for b: {db}")

Step 6: Parameter Update

1
def update_parameters(m, b, dm, db, learning_rate):
2
    """
3
    Update m and b using gradient descent
4
    """
5
    m = m - learning_rate * dm
6
    b = b - learning_rate * db
7
    return m, b
8

9
# Example
10
learning_rate = 0.01
11
m, b = update_parameters(m, b, dm, db, learning_rate)
12
print(f"Updated m: {m}, Updated b: {b}")

Step 7: Training Loop

1
def train(X, y, learning_rate=0.01, epochs=100):
2
    """
3
    Train linear regression model
4
    """
5
    m, b = initialize_parameters()
6
    n = len(y)
7

8
    losses = []  # Track loss over time
9

10
    for epoch in range(epochs):
11
        # Step 1: Predict
12
        y_pred = predict(X, m, b)
13

14
        # Step 2: Calculate loss
15
        loss = calculate_loss(y, y_pred)
16
        losses.append(loss)
17

18
        # Step 3: Compute gradients
19
        dm, db = compute_gradients(X, y, y_pred, n)
20

21
        # Step 4: Update parameters
22
        m, b = update_parameters(m, b, dm, db, learning_rate)
23

24
        # Print progress every 10 epochs
25
        if (epoch + 1) % 10 == 0:
26
            print(f"Epoch {epoch + 1}: Loss = {loss:.4f}")
27

28
    return m, b, losses
29

30
# Train the model
31
m_final, b_final, losses = train(X, y, learning_rate=0.01, epochs=100)
32
print(f"\nFinal parameters: m = {m_final:.4f}, b = {b_final:.4f}")

Full Code: Putting It All Together

1
import numpy as np
2

3
class LinearRegression:
4
    def __init__(self, learning_rate=0.01, epochs=100):
5
        self.learning_rate = learning_rate
6
        self.epochs = epochs
7
        self.m = 0.0
8
        self.b = 0.0
9
        self.losses = []
10

11
    def predict(self, X):
12
        """Predict using y = mx + b"""
13
        return self.m * X + self.b
14

15
    def calculate_loss(self, y_real, y_pred):
16
        """Mean Squared Error"""
17
        n = len(y_real)
18
        mse = np.sum((y_real - y_pred) ** 2) / n
19
        return mse
20

21
    def compute_gradients(self, X, y, y_pred):
22
        """Compute gradients for m and b"""
23
        n = len(y)
24
        dm = (-2/n) * np.sum((y - y_pred) * X)
25
        db = (-2/n) * np.sum(y - y_pred)
26
        return dm, db
27

28
    def train(self, X, y):
29
        """Train the model"""
30
        n = len(y)
31

32
        for epoch in range(self.epochs):
33
            # Predict
34
            y_pred = self.predict(X)
35

36
            # Calculate loss
37
            loss = self.calculate_loss(y, y_pred)
38
            self.losses.append(loss)
39

40
            # Compute gradients
41
            dm, db = self.compute_gradients(X, y, y_pred)
42

43
            # Update parameters
44
            self.m = self.m - self.learning_rate * dm
45
            self.b = self.b - self.learning_rate * db
46

47
            if (epoch + 1) % 10 == 0:
48
                print(f"Epoch {epoch + 1}: Loss = {loss:.4f}, m = {self.m:.4f}, b = {self.b:.4f}")
49

50
    def get_params(self):
51
        """Return the learned parameters"""
52
        return self.m, self.b
53

54
# Usage
55
X = np.array([2, 4, 5, 6, 7])
56
y = np.array([45, 60, 72, 85, 89])
57

58
model = LinearRegression(learning_rate=0.01, epochs=100)
59
model.train(X, y)
60

61
m, b = model.get_params()
62
print(f"\nFinal equation: y = {m:.4f}x + {b:.4f}")
63

64
# Make predictions
65
X_test = np.array([3, 8, 9])
66
y_test_pred = model.predict(X_test)
67
print(f"Predictions for {X_test}: {y_test_pred}")

Output

1
Epoch 10: Loss = 28.5421, m = 4.8291, b = 24.3954
2
Epoch 20: Loss = 15.2301, m = 6.2341, b = 18.1234
3
Epoch 30: Loss = 9.8123, m = 7.0291, b = 14.5123
4
...
5
Epoch 100: Loss = 5.2341, m = 8.1234, b = 7.8234
6

7
Final equation: y = 8.1234x + 7.8234

See the loss decreasing? That’s gradient descent working. The model is learning.

Part 6: Visualizing the Learning Process

Understanding what’s happening visually is crucial. Let’s visualize:

The regression line improving over iterations
Loss decreasing over time

1
import matplotlib.pyplot as plt
2

3
# Training data
4
X = np.array([2, 4, 5, 6, 7])
5
y = np.array([45, 60, 72, 85, 89])
6

7
# Train model
8
model = LinearRegression(learning_rate=0.01, epochs=100)
9
model.train(X, y)
10
m, b = model.get_params()
11

12
# Plot 1: Data points and fitted line
13
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))
14

15
# Plot data and line
16
ax1.scatter(X, y, color='blue', s=100, label='Actual data')
17
X_line = np.array([0, 10])
18
y_line = model.predict(X_line)
19
ax1.plot(X_line, y_line, color='red', linewidth=2, label=f'Fitted line: y = {m:.2f}x + {b:.2f}')
20
ax1.set_xlabel('Hours Studied')
21
ax1.set_ylabel('Exam Score')
22
ax1.set_title('Linear Regression: Hours vs Scores')
23
ax1.legend()
24
ax1.grid(True, alpha=0.3)
25

26
# Plot 2: Loss decreasing over epochs
27
ax2.plot(model.losses, color='green', linewidth=2)
28
ax2.set_xlabel('Epoch')
29
ax2.set_ylabel('Loss (MSE)')
30
ax2.set_title('Training Loss Over Time')
31
ax2.grid(True, alpha=0.3)
32

33
plt.tight_layout()
34
plt.show()

Output:

The first plot shows:

Blue dots: actual data
Red line: the fitted regression line
The line passes through the data, minimizing the distance to all points

The second plot shows:

Loss starts high
Decreases steeply as the model learns
Flattens out as it approaches minimum

This is exactly what we want to see.

Part 7: Evaluating the Model

The Problem: Loss Alone Isn’t Enough

Your training loss is 5.23. Is that good? Bad? Who knows?

Raw MSE is hard to interpret because:

It depends on the scale of your target variable
- If y is in range [0, 100], MSE = 5 is great
- If y is in range [0, 1000000], MSE = 5 is terrible
It’s not in the original units
- MSE = 5 doesn’t mean “off by 5 points on average”
You can’t compare across different datasets

We need interpretable metrics. Let’s build them.

Metric 1: Root Mean Squared Error (RMSE)

Formula:

$\text{RMSE} = \sqrt{\text{MSE}} = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2}$

Why take the square root?

The square root “undoes” the squaring. Now the error is in the original units.

1
MSE = 25
2
RMSE = 5
3

4
Interpretation: On average, the model is off by 5 exam points.
5
Much more interpretable!

Implementation:

1
def calculate_rmse(y_real, y_pred):
2
    """Root Mean Squared Error"""
3
    mse = np.mean((y_real - y_pred) ** 2)
4
    rmse = np.sqrt(mse)
5
    return rmse
6

7
# Example
8
y_pred = model.predict(X)
9
rmse = calculate_rmse(y, y_pred)
10
print(f"RMSE: {rmse:.2f}")  # Output: RMSE: 2.29

When to use RMSE:

You want an error metric in the original units
You care about penalizing large errors heavily
You have outliers and want to notice them

When NOT to use:

Comparing across datasets with different scales
You have extreme outliers that you don’t want to overweight

Note

Why this matters: RMSE is more interpretable than MSE, which is why it’s widely used in practice. But it’s still sensitive to outliers because of the squaring step. If one prediction is wildly off, it dominates the metric. For robust evaluation, pair RMSE with other metrics.

Metric 2: Mean Absolute Error (MAE)

Sometimes we don’t want to penalize large errors as heavily as RMSE does.

Formula:

$\text{MAE} = \frac{1}{n} \sum_{i=1}^{n} |y_i - \hat{y}_i|$

Just the average absolute error. No squaring.

Comparison:

1
Predictions: [10, 20, 30]
2
Actual: [9, 22, 28]
3
Errors: [1, -2, 2]
4

5
MSE = (1 + 4 + 4) / 3 = 3
6
RMSE = √3 = 1.73
7
MAE = (1 + 2 + 2) / 3 = 1.67

RMSE and MAE are similar here. But with outliers, they differ:

1
Predictions: [10, 20, 100]
2
Actual: [9, 22, 28]
3
Errors: [1, -2, 72]
4

5
RMSE = √(1 + 4 + 5184) / 3 = √1729.67 = 41.6  (heavily penalizes the outlier)
6
MAE = (1 + 2 + 72) / 3 = 25  (more balanced)

When to use:

You have outliers you don’t want to overweight
You want a symmetric metric (doesn’t matter if you overpredict or underpredict)

When NOT to use:

You specifically want to penalize large errors (use RMSE)
You need a smooth, differentiable metric for optimization

Metric 3: R² Score (Coefficient of Determination)

This is the gold standard for regression evaluation.

Formula:

$R^2 = 1 - \frac{\text{SS}_{\text{res}}}{\text{SS}_{\text{tot}}}$

Where:

SS_res (residual sum of squares) = $\sum (y_i - \hat{y}_i)^2$ (how wrong predictions are)
SS_tot (total sum of squares) = $\sum (y_i - \bar{y})^2$ (how varied the data is)

What R² Means

R² measures how much of the variance in y your model explains.

Examples:

1
R² = 0.95  → "My model explains 95% of the variance. Excellent."
2
R² = 0.50  → "My model explains 50% of the variance. Okay."
3
R² = 0.00  → "My model explains 0% of the variance. It's as good as just using the mean."
4
R² < 0.00  → "Your model is worse than just using the mean. Delete it."

Range: R² goes from $-\infty$ to 1

1 = perfect fit (all points on the line)
0.5 = moderate fit (explains half the variation)
0 = terrible fit (no better than predicting the mean)
Negative = worse than useless

Implementation:

1
def calculate_r_squared(y_real, y_pred):
2
    """R² Score"""
3
    ss_res = np.sum((y_real - y_pred) ** 2)
4
    ss_tot = np.sum((y_real - np.mean(y_real)) ** 2)
5
    r_squared = 1 - (ss_res / ss_tot)
6
    return r_squared
7

8
# Example
9
y_pred = model.predict(X)
10
r2 = calculate_r_squared(y, y_pred)
11
print(f"R² Score: {r2:.4f}")  # Output: R² Score: 0.9856

Interpretation:

R² = 0.9856 means the model explains 98.56% of the variance in exam scores. That’s excellent.

R² vs MSE vs RMSE vs MAE: Clear Comparison

Metric	Formula	Range	Units	Interpretation
MSE	$\frac{1}{n}\sum(y_i - \hat{y}_i)^2$	[0, ∞)	Squared units	Hard to interpret; data-dependent
RMSE	$\sqrt{\text{MSE}}$	[0, ∞)	Original units	Avg error in original units; penalizes outliers
MAE	$\frac{1}{n}\sum\\|y_i - \hat{y}_i\\|$	[0, ∞)	Original units	Avg error; robust to outliers
R²	$1 - \frac{\text{SS}_{\text{res}}}{\text{SS}_{\text{tot}}}$	(-∞, 1]	Proportion (0-1)	% of variance explained; scale-independent

When to Use Each Metric

Use MSE if:

You’re optimizing the model (it’s differentiable and smooth)
You specifically want to penalize large errors

Use RMSE if:

You want to communicate error in original units
You have a sense of what’s acceptable error

Use MAE if:

You have outliers and don’t want to overweight them
You want a robust, interpretable metric

Use R² if:

You want a single metric that’s scale-independent
You’re comparing models across different datasets
You want to know “how much does this model explain?”

Best practice: Report all metrics. Different stakeholders care about different things.

1
# Complete evaluation
2
y_pred = model.predict(X)
3

4
mse = np.mean((y - y_pred) ** 2)
5
rmse = np.sqrt(mse)
6
mae = np.mean(np.abs(y - y_pred))
7
r2 = calculate_r_squared(y, y_pred)
8

9
print(f"MSE: {mse:.4f}")
10
print(f"RMSE: {rmse:.4f}")
11
print(f"MAE: {mae:.4f}")
12
print(f"R²: {r2:.4f}")

Note

Why this matters: Choosing the right evaluation metric is as important as choosing the right model. A high R² on training data doesn’t mean your model generalizes to new data. MSE can be misleading with outliers. You need to understand what each metric measures and pick the one(s) that align with your problem.

Part 8: The Generalized Case (Multiple Features)

So far we’ve fit a line: $y = mx + b$ .

But what if you have multiple features?

Example: Predicting house prices with multiple inputs:

x₁ = square feet
x₂ = number of bedrooms
x₃ = age of house
y = price

The equation becomes:

$y = m_1 x_1 + m_2 x_2 + m_3 x_3 + b$

Or in vector form:

$y = w^T x + b$

Where:

w = [m₁, m₂, m₃] (weights for each feature)
x = [x₁, x₂, x₃] (features)

Good news: The algorithm stays the same. Compute loss, compute gradients, update weights.

Implementation: Replace scalars with vectors

1
class MultipleLinearRegression:
2
    def __init__(self, learning_rate=0.01, epochs=100):
3
        self.learning_rate = learning_rate
4
        self.epochs = epochs
5
        self.weights = None
6
        self.bias = 0.0
7
        self.losses = []
8

9
    def predict(self, X):
10
        """Predict using multiple features"""
11
        # X shape: (n_samples, n_features)
12
        # weights shape: (n_features,)
13
        return np.dot(X, self.weights) + self.bias
14

15
    def compute_gradients(self, X, y, y_pred):
16
        """Compute gradients for all weights"""
17
        n = len(y)
18
        errors = y - y_pred
19

20
        # Gradient for weights
21
        dw = (-2/n) * np.dot(X.T, errors)
22

23
        # Gradient for bias
24
        db = (-2/n) * np.sum(errors)
25

26
        return dw, db
27

28
    def train(self, X, y):
29
        """Train the model"""
30
        # Initialize weights
31
        n_features = X.shape[1]
32
        self.weights = np.zeros(n_features)
33

34
        for epoch in range(self.epochs):
35
            y_pred = self.predict(X)
36
            loss = np.mean((y - y_pred) ** 2)
37
            self.losses.append(loss)
38

39
            dw, db = self.compute_gradients(X, y, y_pred)
40

41
            self.weights = self.weights - self.learning_rate * dw
42
            self.bias = self.bias - self.learning_rate * db
43

44
            if (epoch + 1) % 10 == 0:
45
                print(f"Epoch {epoch + 1}: Loss = {loss:.4f}")
46

47
# Usage
48
X = np.array([
49
    [1000, 3, 10],  # 1000 sqft, 3 bedrooms, 10 years old
50
    [1500, 4, 5],
51
    [2000, 4, 2],
52
    [2500, 5, 1]
53
])
54
y = np.array([250000, 350000, 400000, 500000])  # prices
55

56
model = MultipleLinearRegression(learning_rate=0.00001, epochs=100)
57
model.train(X, y)
58

59
# Predict
60
X_new = np.array([[1200, 3, 8]])
61
price = model.predict(X_new)
62
print(f"Predicted price: ${price[0]:.2f}")

The algorithm scales to any number of features. Linear algebra makes it elegant.

Part 9: Common Mistakes & Pitfalls

Mistake 1: Learning Rate Too High

You take big steps down the mountain and overshoot the valley.

1
model = LinearRegression(learning_rate=1.0, epochs=100)
2
model.train(X, y)

Result: Loss explodes (diverges) instead of decreasing.

Fix: Start with a small learning rate (0.01) and increase if needed.

Mistake 2: Learning Rate Too Low

You take tiny steps. It takes forever to reach the minimum.

1
model = LinearRegression(learning_rate=0.0001, epochs=100)
2
model.train(X, y)

Result: Very slow convergence. After 100 epochs, barely any improvement.

Fix: Use a moderate learning rate (0.01 to 0.1 is a good starting point).

Mistake 3: Not Normalizing Input Features

If features have different scales, gradient descent behaves weirdly.

1
Feature 1 (age): ranges from 1 to 100
2
Feature 2 (income): ranges from 10,000 to 1,000,000

The income feature dominates. Gradients are unbalanced.

Fix: Normalize (scale) your features

1
def normalize(X):
2
    """Normalize features to have mean=0, std=1"""
3
    mean = np.mean(X, axis=0)
4
    std = np.std(X, axis=0)
5
    return (X - mean) / std
6

7
X_normalized = normalize(X)

Now all features are on the same scale.

Mistake 4: Misinterpreting R²

1
R² = 0.85
2

3
Student 1 thinks: "My model is 85% correct!"
4
Student 2 thinks: "My model explains 85% of the variance."

Only Student 2 is right. R² ≠ accuracy. It’s a measure of variance explained, not correctness.

Also, high R² on training data doesn’t mean good generalization. You might be overfitting.

Mistake 5: Using Linear Regression When Data Isn’t Linear

Linear regression assumes a linear relationship. If your data is curved, a straight line won’t fit well.

1
Example: Population growth (exponential)
2
Linear model: R² = 0.60
3
Exponential model: R² = 0.98

Fix: Check if your data is actually linear first (scatter plot). If not, use polynomial regression or other models.

Mistake 6: Ignoring Outliers

One data point far away from the trend can pull the line.

1
X = [1, 2, 3, 4, 100]  # 100 is an outlier
2
y = [10, 20, 30, 40, 5000]
3

4
Fitted line: heavily influenced by the outlier

Fix:

Investigate outliers. Are they errors or real data?
If errors, remove them.
If real but extreme, use robust metrics (MAE instead of RMSE).

Part 10: When Linear Regression Works (And When It Doesn’t)

When Linear Regression Is the Right Choice

Relationship is roughly linear
- House price vs. square footage ✓
- Sales vs. marketing spend ✓
You need interpretability
- “Each additional bedroom adds $50,000 to house price”
- Business stakeholders like this
Data is relatively clean
- Few outliers
- No missing values (or easily handled)
Speed matters
- Linear regression trains instantly
- Great for real-time predictions

When Linear Regression Fails

Relationship is non-linear
- Temperature vs. ice cream sales (U-shaped)
- Tumor size vs. cancer stage (steps, not continuous)
Complex interactions between features
- Feature A alone predicts y = 0.5
- But Feature A + Feature B together predict y = 0.95
- Linear model can’t capture this
Too many features, too little data
- 1000 features, 50 samples
- Model overfits
Categorical variables without encoding
- Feature: “Color” (red, blue, green)
- Linear regression doesn’t know how to handle this

How to Extend Linear Regression

If linear regression isn’t enough:

Polynomial Regression
- Add polynomial terms: $y = m_1 x + m_2 x^2 + m_3 x^3 + b$
- Fits curved relationships
Feature Engineering
- Create new features from existing ones
- Example: Instead of (age), use (age²) and (age × income)
Regularization (L1/L2)
- Prevent overfitting by penalizing large weights
- Makes the model more generalizable
Ridge/Lasso Regression
- Variants of linear regression with built-in regularization
Switch to non-linear models
- Decision trees, neural networks, SVMs

Part 11: Why Build From Scratch?

At this point, you might be thinking: “Why not just use sklearn?”

1
from sklearn.linear_model import LinearRegression
2

3
model = LinearRegression()
4
model.fit(X, y)
5
y_pred = model.predict(X)

This works, but you miss everything. You don’t know:

Why MSE is better than MAE
How gradient descent finds the optimal parameters
Why learning rate matters
How to debug when things go wrong

Building from scratch teaches:

Intuition – You see the algorithm step by step
Debugging skills – You can spot where things break
Customization – You can tweak the algorithm for your problem
Respect for complexity – You understand why neural networks need GPUs

Once you understand linear regression deeply, moving to complex models becomes much easier. You already know gradients, loss functions, and optimization. Those concepts scale to everything.

Part 12: Key Takeaways

Linear regression finds the best-fit line for a relationship between x and y
MSE is the loss function – It measures how wrong the model is in a smooth, differentiable way
Gradient descent is the learning mechanism – It takes small steps downhill to minimize loss
RMSE and R² are evaluation metrics – They tell you how good the model actually is
Normalize your features before training – Different scales cause problems
High training loss doesn’t mean high test loss – Always validate on unseen data
Linear regression works when the relationship is actually linear – Check with a scatter plot first
Understanding from scratch scales to everything – All of modern ML builds on these foundations