Understanding overfitting (the right, mathematical way)

📌

For more content check out the Circuit of Knowledge.

Overfitting is a common problem in (not only) machine learning where a model is almost perfectly fit (sometimes perfectly) to the training data to the extent that it negatively impacts the performance of the model on new / test data. Oftentimes, it’s said that the model is “learning” the noise, although that a hand-wavey textual way of saying it’s fitting perfectly to the training data. In this tutorial, we'll explore the concept of overfitting using polynomial (linear) regression and provide a mathematical explanation of why choosing a higher-order polynomial decreases the error on the training set.

Introduction

Polynomial regression is a form of regression analysis in which the relationship between the independent variable x and the dependent variable y is modeled as an nth degree polynomial. While polynomial regression can fit a wide range of curvatures, it's prone to overfitting, especially with higher degree polynomials.

Let's demonstrate this with an example and delve into the mathematical reasoning behind it.

Python Code

# %% Overfitting Example
# This script demonstrates the concept of overfitting using polynomial regression.
# It generates a quadratic dataset with added noise and fits linear, quadratic, and cubic polynomials to a subset of the data.
# The fitted curves are then evaluated on the entire dataset to illustrate overfitting.

import numpy as np
import matplotlib.pyplot as plt

# %% Generate Dataset
# Define the quadratic function coefficients
a = 10
b = 2
c = 1
# Generate random x values and calculate the corresponding y values with added noise
x = np.sort(np.random.rand(100))
y = a * x**2 + b * x + c + 0.2 * np.random.randn(x.shape[0])
# Plot the generated dataset
plt.figure()
plt.plot(x, y, 'bo')
plt.grid(True)
plt.show()

# %% Select Test Points
# Select a subset of the data points as test points
x_test = x[::25]
y_test = y[::25]

# %% Fit Polynomials
# Fit linear, quadratic, and cubic polynomials to the test points
p1 = np.polyfit(x_test, y_test, 1)
p2 = np.polyfit(x_test, y_test, 2)
p3 = np.polyfit(x_test, y_test, 3)

# %% Generate Plotting Points
# Generate evenly spaced x values for plotting the fitted curves
x_plot = np.linspace(x.min(), x.max(), 100)
# Evaluate the fitted curves at the x_plot values
y_plot_p1 = np.polyval(p1, x_plot)
y_plot_p2 = np.polyval(p2, x_plot)
y_plot_p3 = np.polyval(p3, x_plot)

# %% Plot Fitted Curves and Test Points
# Plot the data points, fitted curves, and test points
plt.figure()
plt.plot(x, y, 'bo', markersize=6, label='Data Points')
plt.plot(x_plot, y_plot_p1, 'r-', linewidth=2, label='Linear Fit')
plt.plot(x_plot, y_plot_p2, 'g-', linewidth=2, label='Quadratic Fit')
plt.plot(x_plot, y_plot_p3, 'k-', linewidth=2, label='Cubic Fit')
plt.plot(x_test, y_test, 'ro', markersize=8, linewidth=2, label='Test Points')
plt.xlabel('x')
plt.ylabel('y')
plt.title('Overfitting Example')
plt.grid(True)
plt.legend()
plt.show()

# %% Calculate Mean Squared Error (MSE)
# Calculate the MSE for each fitted curve on the test points
mse_p1_train = np.mean((y_test - np.polyval(p1, x_test))**2)
mse_p2_train = np.mean((y_test - np.polyval(p2, x_test))**2)
mse_p3_train = np.mean((y_test - np.polyval(p3, x_test))**2)
# Calculate the MSE for each fitted curve on the entire dataset
mse_p1_test = np.mean((y - np.polyval(p1, x))**2)
mse_p2_test = np.mean((y - np.polyval(p2, x))**2)
mse_p3_test = np.mean((y - np.polyval(p3, x))**2)

# %% Display MSE Values
# Display the MSE values for the training points and the test set
print('Mean Squared Error (MSE) on Training Points:')
print(f'Linear Fit: {mse_p1_train:.4f}')
print(f'Quadratic Fit: {mse_p2_train:.4f}')
print(f'Cubic Fit: {mse_p3_train:.4f}')
print('Mean Squared Error (MSE) on Test Set:')
print(f'Linear Fit: {mse_p1_test:.4f}')
print(f'Quadratic Fit: {mse_p2_test:.4f}')
print(f'Cubic Fit: {mse_p3_test:.4f}')

Explanation

We start by generating a quadratic dataset with added noise. The true underlying function is $y = 10x^2 + 2x + 1$ , and we add random Gaussian noise to the y values.
We select a subset of the data points as test points. These test points will be used to fit the polynomials.
We fit linear, quadratic, and cubic polynomials to the test points using numpy.polyfit(). This function returns the coefficients of the fitted polynomials.
We generate evenly spaced x values for plotting the fitted curves using numpy.linspace(). We then evaluate the fitted curves at these x values using numpy.polyval().
We plot the data points, the fitted curves, and the test points using matplotlib. This allows us to visually observe how well each polynomial fits the data.
We calculate the mean squared error (MSE) for each fitted curve on both the test points and the entire dataset. The MSE measures the average squared difference between the predicted values and the actual values. A lower MSE indicates a better fit.
Finally, we display the MSE values for the training points and the test set.

Output

Overfitting example, Red circles are the 4 training data points which is perfectly described (fit) with a cubic polynomial.

Mean Squared Error (MSE) on Training Points:
Linear Fit: 0.3749
Quadratic Fit: 0.0102
Cubic Fit: 0.0000
Mean Squared Error (MSE) on Test Set:
Linear Fit: 0.8831
Quadratic Fit: 0.0384
Cubic Fit: 0.1675

Observations

From the plot, we can observe that:

The linear fit (red line) doesn't capture the curvature of the data points well.
The quadratic fit (green line) closely follows the underlying quadratic function.
The cubic fit (black line) perfectly fits the test points but deviates from the true function in the regions between the test points.

Looking at the MSE values, we can see that:

On the training points, the cubic fit achieves the lowest MSE (0), followed by the quadratic fit and then the linear fit.
However, on the entire dataset (test set), the quadratic fit has the lowest MSE, indicating that it generalizes well to unseen data. The cubic fit, despite having a low MSE on the training points, has a higher MSE on the test set due to overfitting.

This example demonstrates that while a higher degree polynomial can fit the training data better, it may overfit and perform poorly on new, unseen data. The quadratic fit, which matches the true underlying function, generalizes well to the test set.

Mathematical Reasoning

Let's explore why choosing a higher-order polynomial decreases the error on the training set from a mathematical perspective.

Consider a polynomial regression model of degree n:

$y = a_0 + a_1x + a_2x^2 + ... + a_nx^n + ε$

where $a_0, a_1, ..., a_n$ are the coefficients, and $ε$ is the error term.

The objective of polynomial regression is to find the coefficients that minimize the sum of squared errors (SSE) between the predicted values and the actual values:

$SSE = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2$

where $y_i$ is the actual value and $\hat{y}_i$ is the predicted value for the i-th data point.

As we increase the degree of the polynomial ( $n$ ), the model becomes more flexible and can fit the training data more closely. In fact, a polynomial of degree $n$ can perfectly fit $n+1$ data points.

Mathematically, this can be explained using the concept of degrees of freedom. A polynomial of degree $n$ has $n+1$ coefficients (degrees of freedom). With $n+1$ data points, we can construct a system of $n+1$ linear equations to solve for the coefficients that perfectly fit the data, in which case the solution will be a full rank matrix inverse and lot the Moore-Penrose Pseudo inverse used in least squares, which suggest also a unique solution.

For example, with 4 data points, a cubic polynomial (degree 3) with 4 coefficients can perfectly fit the data. The system of linear equations can be solved to find the unique set of coefficients that result in zero error on the training set.

However, this comes at the cost of overfitting. While a higher-degree polynomial can fit the training data perfectly, it may capture the noise in the data and fail to generalize well to unseen data. The model becomes overly complex and starts to fit the random fluctuations in the training set.

Conclusion

Overfitting is a common pitfall in data science and machine learning, where a model is perfectly fitted to the training data (either model order is too high or training data points are two low) and fails to generalize well to new data. It's important to strike a balance between model complexity and generalization performance. Techniques like cross-validation and regularization can help mitigate overfitting.

Mathematically, choosing a higher-order polynomial decreases the error on the training set because it has more degrees of freedom and can fit the training data more closely. However, this comes at the cost of overfitting and poor generalization to unseen data.

This tutorial provided a practical example of overfitting using polynomial regression in Python and explained the mathematical reasoning behind why higher-order polynomials can lead to overfitting. By understanding and recognizing overfitting, you can make informed decisions when selecting and tuning models for your machine learning tasks.