2025-10-04 11:20 Tags:

Linear Regression and the Cost Function

1. Prediction Function (Hypothesis)

We assume $y$ can be predicted as a linear combination of inputs:

\overset{y}{^} = i = 0 \sum n β_{i} x_{i}

For each data point $j$ , the error is:

error^{j} = y^{j} - \overset{y}{^}^{j}

We want to measure how “bad” our predictions are.
So we square the errors and average them:

Mean Squared Error (MSE):

\frac{1}{m} j = 1 \sum m (y^{j} - \overset{y}{^}^{j})^{2}

where $m$ = number of rows (data points).

To make derivative math cleaner, we add $\frac{1}{2}$ :

Cost function:

J (β) = \frac{1}{2 m} j = 1 \sum m (y^{j} - \overset{y}{^}^{j})^{2}

Dividing by $m$ gives us the average error.
The $\frac{1}{2}$ is just for convenience:
when we differentiate, the “2” from squaring cancels out.

We want to minimize $J (β)$ .
From calculus: take derivative, set = 0.

Gradient for parameter $β_{k}$ :

\frac{\partial J}{\partial β _{k}} = \frac{1}{m} j = 1 \sum m (y^{j} - i = 0 \sum n β_{i} x_{i}^{j}) (- x_{k}^{j})