Tutorial: Understanding R.A. Fisher's Foundations of Theoretical Statistics

R.A Fisher:

For my own part I should gladly have withheld publication until a rigorously complete proof could have been formulated; but the number and variety of the new results which the method discloses press for publication, and at the same time I am not insensible of the advantage which accrues to Applied Mathematics from the co-operation of the Pure Mathematician, and this co operation is not infrequently called forth by the very imperfections of writers on Applied Mathematics.

1. Introduction to Fisher's Theoretical Statistics

R.A. Fisher's 1921 paper, "On the Mathematical Foundations of Theoretical Statistics," laid the groundwork for modern statistical methods. This tutorial will explore the key concepts introduced by Fisher and their relevance to contemporary data science and machine learning.

2. The Three Main Problems in Statistics

Fisher identified three fundamental problems in statistics:

a) Specification b) Estimation c) Distribution

Let's examine each of these in detail.

2.1. Specification

Specification involves choosing the appropriate mathematical form of the population distribution. This is a crucial step in statistical analysis, as it determines the underlying assumptions about the data-generating process.

Example: Let's consider a simple case where we want to model the heights of adults in a population. We might specify a normal distribution as our population model:

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm

# Generate a sample of heights (in cm) from a normal distribution
mu, sigma = 170, 10
heights = np.random.normal(mu, sigma, 1000)

# Plot the histogram and the specified normal distribution
plt.hist(heights, bins=30, density=True, alpha=0.7)
x = np.linspace(mu - 3*sigma, mu + 3*sigma, 100)
plt.plot(x, norm.pdf(x, mu, sigma), 'r-', lw=2)
plt.title('Height Distribution: Specification Example')
plt.xlabel('Height (cm)')
plt.ylabel('Density')
plt.show()

This code generates a sample of heights from a normal distribution and plots both the histogram of the sample and the specified normal distribution.

2.2. Estimation

Once we've specified a model, we need to estimate its parameters from the sample data. Fisher introduced the concept of maximum likelihood estimation (MLE) as an efficient method for parameter estimation.

The likelihood function f(xθ)f(x|θ) is the probability of observing the data x given the parameters θ. The maximum likelihood estimate is the value of θ that maximizes this function:

θ^MLE=arg{maxθ[f(xθ)]}.\hat{\theta}_{MLE} = \arg \left\{ \underset{\theta}{\max}\left [ f(x|\theta) \right ] \right\}.

For our height example, let's estimate the parameters using MLE:

def log_likelihood(params, data):
    mu, sigma = params
    return np.sum(norm.logpdf(data, mu, sigma))

from scipy.optimize import minimize

# Estimate parameters using MLE
initial_guess = [100, 5]
result = minimize(lambda params, data: -log_likelihood(params, data),
                  initial_guess, args=(heights,), method='Nelder-Mead')

mu_mle, sigma_mle = result.x
print(f"MLE estimates: mu = {mu_mle:.2f}, sigma = {sigma_mle:.2f}")
MLE estimates: mu = 169.81, sigma = 10.27

Now, this is a naive example, in which we minimize the negative log-likelihood that is equivalent to maximizing the likelihood function, by employing the Nedler-Mead Algorithm starting from some initial guess. However, as we know, in the Gaussian case above this is not needed as the MLE for the mean and standard deviation is merely the sample mean and sample standard deviation (square root of the biased sample-variance).

mu_mle_via_sample_mean = np.mean(heights)
sigma_mle_via_sample_std =  np.std(heights)
print(f"MLE estimates using the sample mean and sample std: mu = {mu_mle_via_sample_mean:.2f}, sigma = {sigma_mle_via_sample_std:.2f}")
MLE estimates using the sample mean and sample std: mu = 169.53, sigma = 10.07

2.3. Distribution

Understanding the distribution of sample statistics is crucial for making inferences about the population. Fisher's work laid the foundation for sampling distributions and hypothesis testing.

Let's focus on the sample mean as a sufficient statistic for the true mean of a normal distribution.

Given independent X1,X2,...,XnN(μ,σ2)X_1, X_2, ..., X_n \sim N(\mu, \sigma^2), the maximum likelihood estimator for μ\mu is the sample mean:

μ^ML=Xˉ=1ni=1nXi.\hat{\mu}_{ML} = \bar{X} = \frac{1}{n} \sum_{i=1}^n X_i.

Let’s calculate the expected value and variance of this estimator. The Expected value of μ^ML\hat{\mu}_{ML}:

E[μ^ML]=E[Xˉ]=1ni=1nE[Xi]=1nnμ=μ.E[\hat{\mu}_{ML}] = E[\bar{X}] = \frac{1}{n} \sum_{i=1}^n E[X_i] = \frac{1}{n} \cdot n \cdot \mu = \mu.

This shows that μ^ML\hat{\mu}_{ML} is an unbiased estimator of μ\mu.

The Variance of μ^ML\hat{\mu}{ML}:

Var(μ^ML)=Var(Xˉ)=1n2i=1nVar(Xi)=1n2nσ2=σ2n.Var(\hat{\mu}_{ML}) = Var(\bar{X}) = \frac{1}{n^2} \sum_{i=1}^n Var(X_i) = \frac{1}{n^2} \cdot n \cdot \sigma^2 = \frac{\sigma^2}{n}.

The distribution of μ^ML\hat{\mu}_{ML}: Since μ^ML\hat{\mu}_{ML} is a weighted sum (mean) of independent Gaussian random variables, it follows a Gaussian distribution:, thus, μ^MLN(μ,σ2n)\hat{\mu}_{ML} \sim N(\mu, \frac{\sigma^2}{n})

Let’s see that in Python.

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm

# True parameters
mu_true = 170
sigma_true = 10
n_samples = 100
n_simulations = 10000

# Function to generate sample and compute MLE
def generate_and_estimate(n, mu, sigma):
    sample = np.random.normal(mu, sigma, n)
    return np.mean(sample)

# Run Monte Carlo simulation
mle_estimates = [generate_and_estimate(n_samples, mu_true, sigma_true) for _ in range(n_simulations)]

# Compute mean and standard deviation of MLE estimates
mle_mean = np.mean(mle_estimates)
mle_std = np.std(mle_estimates)

# Theoretical standard deviation
theoretical_std = sigma_true / np.sqrt(n_samples)

# Plot histogram of MLE estimates
plt.figure(figsize=(10, 6))
plt.hist(mle_estimates, bins=50, density=True, alpha=0.7, color='skyblue')

# Plot theoretical distribution
x = np.linspace(mu_true - 4*theoretical_std, mu_true + 4*theoretical_std, 100)
plt.plot(x, norm.pdf(x, mu_true, theoretical_std), 'r-', lw=2, label='Theoretical N(μ, σ²/n)')


plt.title(f'Distribution of MLE Estimates (n={n_samples})')
plt.xlabel('μ̂_ML')
plt.ylabel('Density')
plt.legend()

# Add text box with results
textstr = '\n'.join((
    f'True μ = {mu_true:.2f}',
    f'Mean of μ̂_ML = {mle_mean:.2f}',
    f'Std of μ̂_ML (MC) = {mle_std:.2f}',
    f'Std of μ̂_ML (Theory) = {theoretical_std:.2f}'
))
plt.text(0.05, 0.95, textstr, transform=plt.gca().transAxes, fontsize=10,
         verticalalignment='top', bbox=dict(boxstyle='round', facecolor='white', alpha=0.5))

plt.show()
image

This code performs a Monte-Carlo simulation to generate multiple samples and compute the MLE (sample mean) for each. It then plots the histogram of the MLE results along with the theoretical normal distribution. The results demonstrate that the sampling distribution of the mean follows a normal distribution centered around the true mean, with a standard deviation that closely matches the theoretical value (σ/n\sigma/\sqrt{n}).

3. Criteria for Good Estimators

Fisher proposed three criteria for evaluating the quality of estimators:

3.1 Consistency

An estimator is consistent if it converges in probability to the true parameter value as the sample size increases. Mathematically, for an estimator θ^n\hat{\theta}_n of parameter θ\theta:

limnP(θ^nθ>ε)=0, for any ε>0.\lim_{n \to \infty} P(|\hat{\theta}_n - \theta| > \varepsilon) = 0, \text{ for any } \varepsilon > 0.

3.2 Efficiency

An efficient estimator has the smallest variance among all unbiased estimators. The Cramér-Rao lower bound provides a theoretical minimum for the variance of an unbiased estimator: Var(θ^n)1I(θ)\text{Var}(\hat{\theta}_n) \geq \frac{1}{I(\theta)}, where I(θ)I(\theta) is the Fisher information:

I(θ)=E[(θlogf(Xθ))2].I(\theta) = E\left[\left(\frac{\partial}{\partial\theta} \log f(X|\theta)\right)^2\right].

3.3 Sufficiency

A sufficient statistic contains all the information in the sample about the parameter of interest. Mathematically, if a function of data T(X)T(X), is a sufficient statistic for θ\theta, if and only if:

f(Xθ)=h(X)g(θ;T(X))f(X|\theta) = h(X)g(\theta;T(X)). This is called the Neyman-Fisher factorization.

Let's illustrate sufficiency with an example:

import numpy as np
from scipy.optimize import minimize
from scipy.stats import norm


def sufficient_statistic(data):
    return np.mean(data), np.var(data)

def log_likelihood(params, data):
    mu, sigma = params
    return np.sum(norm.logpdf(data, mu, sigma))


# Generate two datasets with the same sufficient statistics
data1 = np.random.normal(170, 10, 100)
data2 = np.random.normal(170, 10, 100)

ss1 = sufficient_statistic(data1)
ss2 = sufficient_statistic(data2)

print(f"Sufficient statistics for data1: {ss1}")
print(f"Sufficient statistics for data2: {ss2}")

# Estimate parameters using MLE for both datasets
initial_guess = [100, 5]

result1 = minimize(lambda params, data: -log_likelihood(params, data),
                   initial_guess, args=(data1,), method='Nelder-Mead')
result2 = minimize(lambda params, data: -log_likelihood(params, data),
                   initial_guess, args=(data2,), method='Nelder-Mead')

print(f"MLE estimates for data1: mu = {result1.x[0]:.2f}, sigma = {result1.x[1]:.2f}")
print(f"MLE estimates for data2: mu = {result2.x[0]:.2f}, sigma = {result2.x[1]:.2f}")
Sufficient statistics for data1: (169.20479442577823, 77.45232191761514)
Sufficient statistics for data2: (169.59730476641695, 118.69144201539567)
MLE estimates for data1: mu = 169.20, sigma = 8.80
MLE estimates for data2: mu = 169.60, sigma = 10.89

This example demonstrates that datasets with the same sufficient statistics lead to the same parameter estimates.

4. Fisher's Critique of Bayesian Methods

Fisher was critical of Bayesian methods, particularly the use of prior distributions for parameters. He argued that parameters should be treated as fixed, unknown constants rather than random variables.

To illustrate the difference between frequentist (Fisher's approach) and Bayesian methods, let's compare MLE with a Bayesian estimation:

import pymc3 as pm

# Frequentist MLE (already computed)
print(f"MLE estimates: mu = {mu_mle:.2f}, sigma = {sigma_mle:.2f}")

# Bayesian estimation
with pm.Model() as model:
    mu = pm.Normal('mu', mu=170, sd=20)
    sigma = pm.HalfNormal('sigma', sd=20)
    y = pm.Normal('y', mu=mu, sd=sigma, observed=heights)

    trace = pm.sample(2000, tune=1000)

pm.plot_posterior(trace, var_names=['mu', 'sigma'])
plt.show()

print(f"Bayesian estimates:")
print(f"mu: mean = {trace['mu'].mean():.2f}, std = {trace['mu'].std():.2f}")
print(f"sigma: mean = {trace['sigma'].mean():.2f}, std = {trace['sigma'].std():.2f}")
MLE estimates: mu = 170.26, sigma = 10.59
image

This code compares the frequentist MLE approach with a Bayesian approach using PyMC. The Bayesian method provides a distribution for each parameter, reflecting uncertainty, while the MLE gives point estimates. Note that we cheated here, we assumed a prior distribution for the parameter μ\mu that is centered around 170. We couldn’t have know that a-priori.

5. Relevance to Modern Machine Learning

Fisher's concepts have profound implications for modern machine learning:

5.1 Model Specification: In machine learning, model architecture design is analogous to Fisher's specification problem.

5.2 Parameter Estimation: Training algorithms in machine learning, such as gradient descent, can be viewed as solving the estimation problem. What are we estimating? The weights of the neural network, or in other words, ww s.t. yf(w,x)y \approx f(w,x) where ww represents all the weights in the neural network, xx the inputs and yy the desired output.

5.3 Distribution Understanding: Comparing the distribution of model predictions to the true distribution of the target variable.

Let's illustrate these concepts with a simple neural network example:

import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
import numpy as np
import matplotlib.pyplot as plt

# Generate synthetic data
X = np.random.normal(0, 1, (1000, 5))
y = 2*X[:, 0] + 3*X[:, 1] - X[:, 2] + 0.5*X[:, 3] + np.random.normal(0, 0.1, 1000)

# Split data into train and test sets
train_size = 800
X_train, X_test = X[:train_size], X[train_size:]
y_train, y_test = y[:train_size], y[train_size:]

# Model specification
model = Sequential([
    Dense(10, activation='relu', input_shape=(5,)),
    Dense(1)
])

# Parameter estimation (training)
model.compile(optimizer='adam', loss='mse')
history = model.fit(X_train, y_train, epochs=100, validation_split=0.2, verbose=0)

# Plot training history
plt.figure(figsize=(10, 6))
plt.plot(history.history['loss'], label='Training Loss')
plt.plot(history.history['val_loss'], label='Validation Loss')
plt.title('Model Loss')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend()
plt.show()

# Generate predictions for the test set
y_pred = model.predict(X_test).flatten()

# Plot the distribution of predictions and true values
plt.figure(figsize=(12, 6))

plt.hist(y_test, bins=30, density=True, alpha=0.7, label='True y')
plt.hist(y_pred, bins=30, density=True, alpha=0.7, label='Predicted y')
plt.title('Distribution of True y vs Predicted y')
plt.xlabel('Value')
plt.ylabel('Density')
plt.legend()

plt.show()

# Calculate and print some metrics
mse = np.mean((y_test - y_pred)**2)
mae = np.mean(np.abs(y_test - y_pred))
print(f"Mean Squared Error: {mse:.4f}")
print(f"Mean Absolute Error: {mae:.4f}")
image
image

This example demonstrates how Fisher's concepts apply to modern machine learning:

  • Model specification is handled by defining the neural network architecture.
  • Parameter estimation is performed through the training process.
  • Distribution understanding is achieved through predictions comparison.

6. Conclusion

R.A. Fisher's 1921 paper laid the foundation for modern statistical thinking and continues to influence contemporary data science and machine learning practices. By understanding these fundamental concepts, we can better appreciate the theoretical underpinnings of the tools we use today and develop more robust analytical approaches.

This tutorial has covered the main points of Fisher's paper, including the three main problems in statistics, criteria for good estimators, and the relevance of these concepts to modern machine learning. By exploring these ideas with mathematical rigor and practical Python examples, we've bridged the gap between historical statistical methods and current data science practices.