2025-10-04 11:29 Tags:

Linear Regression — Gradients and Gradient Descent

1. Cost Function Reminder

We use Mean Squared Error (MSE):

J (β) = \frac{1}{2 m} j = 1 \sum m (y^{j} - \overset{y}{^}^{j})^{2}

where:

$m$ = number of samples
$y^{j}$ = true value
$\overset{y}{^}^{j} = \sum_{i = 0}^{n} β_{i} x_{i}^{j}$ (prediction from our model)

We want to minimize $J (β)$ , so we take derivatives.

For one parameter $β_{k}$ :

\frac{\partial J}{\partial β _{k}} = \frac{1}{m} j = 1 \sum m (y^{j} - i = 0 \sum n β_{i} x_{i}^{j}) (- x_{k}^{j})

Instead of doing one parameter at a time, we collect all derivatives:

\nabla_{β} J = \frac{\partial J}{\partial β _{0}} \frac{\partial J}{\partial β _{1}} ⋮ \frac{\partial J}{\partial β _{n}}

This is called the gradient.
It points in the direction of steepest increase of the cost.

Pasted image 20251004114943.png

\nabla_{β} J = - \frac{1}{m} X^{T} (y - X β)

Pasted image 20251004114954.png

Pasted image 20251004115001.png

Pasted image 20251004115042.png

Pasted image 20251004115050.png

Pasted image 20251004115101.png

We can rewrite using matrices:

Prediction:

\overset{y}{^} = X β

Gradient:

\nabla_{β} J = - \frac{1}{m} X^{T} (y - X β)

This is a compact and efficient way to compute the gradient.

We iteratively update:

β := β - η \nabla_{β} J

So the algorithm naturally slows down as it gets closer to the best solution.