2025-10-04 11:29 Tags:

Linear Regression — Gradients and Gradient Descent

1. Cost Function Reminder

We use Mean Squared Error (MSE):

where:

  • = number of samples
  • = true value
  • (prediction from our model)

2. Take Derivative (Gradient)

We want to minimize , so we take derivatives.

For one parameter :

  • This tells us how sensitive the cost is to changing .
  • If derivative is positive → cost increases as grows.
  • If derivative is negative → cost decreases as grows.

3. Gradient Vector

Instead of doing one parameter at a time, we collect all derivatives:

This is called the gradient.
It points in the direction of steepest increase of the cost.


Pasted image 20251004114943.png

Pasted image 20251004114954.png

Pasted image 20251004115001.png

Pasted image 20251004115042.png

Pasted image 20251004115050.png

Pasted image 20251004115101.png

4. Matrix Form (Vectorization)

We can rewrite using matrices:

  • = feature matrix (size , with 1’s column for intercept)
  • = vector of actual outputs ()
  • = vector of parameters ()

Prediction:

Gradient:

This is a compact and efficient way to compute the gradient.


5. Gradient Descent Update Rule

We iteratively update:

  • = learning rate (step size).
  • We keep moving in the opposite direction of the gradient until convergence.

6. Intuition (Why Steps Change Size)

  • At the start:
    Gradient is large → big steps downhill.

  • Near the minimum:
    Gradient is small → smaller, finer steps.

So the algorithm naturally slows down as it gets closer to the best solution.


7. Algorithm Process

  1. Initialize randomly (or zeros).
  2. Compute gradient .
  3. Update .
  4. Repeat until convergence (cost stops decreasing).