Linear and polynomial regression are supervised machine learning algorithms used, as their names indicate, to perform regression tasks, i.e., to estimate the values of a continuous response/target variable $y$.

Linear Regression

To obtain the estimate $\hat{y}$ of $y$, given a sample $x$ with $n$ features, the model simply proposes to linearly combine the $n$ components of $x$

\[\hat{y} = w^\intercal x + b = w_1 x_1 + w_2 x_2 + \ldots + w_n x_n + b\]


  • $w = (w_1, \ldots, w_n)$ is a weighting vector of $n$ parameters, and;
  • $b$ is a called a bias or intercept.

As a result, Linear regression generates a model that assumes both an additive and linear relationship between the features and the target values. Indeed, if a feature changes its value, then the change that this produces on the estimation is always the same, independently of the value that other features (additive) and the feature itself (linear) may have.

In particular, it must be noted that $w^\intercal x + b - \hat{y} = 0$, hence the values that $\hat{y}$ takes stand on an hyperplane in the sub-space generated by $(x_1, \ldots, x_n, y)$. Moreover, $b$ is the quantity by which this hyperplane is shifted from the origin.

Given a training set of $m$ samples, the values of the components of $w$ and $b$ are estimated so that, in average for the $m$ samples, the distance between $\hat{y}$ and $y$ is minimized, i.e., between the hyperplane and the response for each sample. In other words, with Linear regression the cost function $J(w, b)$ that is minimized is the Root Mean Squared Error (RMSE) over the training set:

\[\mathit{RMSE}(w, b) = \sqrt{\frac{1}{m} \sum_{i=1}^{m_t} (y_i - \hat{y_i})^2}\]

Since minimizing $\mathit{RMSE}(w, b)$ actually also minimizes $\mathit{MSE}(w, b)$, then in general we define the cost function as

\[J(w, b) = \mathit{MSE}(w, b) = \frac{1}{m} \sum_{i=1}^{m_t} (y_i - \hat{y_i})^2\]

The optimal values that $w$ and $b$ should take for this to happen are the least squares regression coefficients. In general, these can be calculated relying on the normal equations, or the gradient descent algorithm.

Polynomial Regression

This model is an extension of Linear Regression that allows to obtain estimations $\hat{y}$ other than hyperplanes in the sub-space generated by $(x_1, \ldots, x_n)$. For this, non-linear relationships between $x$ and $y$ need to be incorporated to the model. In other words, $x$ needs to be extended with polynomial functions of each feature, i.e., adding terms of higher degree.

\[\hat{y} = \sum_{i=1}^n \sum_{j=1}^d w_{ij} x_i^j\]

where $d$ is the highest polynomial order we want to consider. For example, all the terms related to feature $x_i$ in the expression above are $w_{i1} x_i + w_{i2} x_i^2 +, \ldots, + w_{id} x_i^d$.

In addition, usually interaction terms between the features $\forall i, j, \, x_i x_j$ are also added to the model, to account for possible synergy effects, i.e., cases where the impact of changes in the value a of a feature affect differently the response depending of the value that other feature have at that moment. Moreover, if $d>1$, additionally interaction terms between the polynomial functions are considered.

Finally, despite the fact that we add interaction terms and of higher order, the model is still linear since all terms are linearly combined (e.g. think $x_i^j$ and $x_i x_j$ as new features $x_{ij}$ and $x_{ij}^\prime$, respectively). As a consequence, the optimal values of $w$ and $b$ are found in the same way as with Linear Regression, i.e., minimizing $J(w, b) = \mathit{MSE}(w, b)$ solving the normal equations or using gradient descent.