Tutorial: Maximum Likelihood Estimation and Kullback-Leibler Divergence Minimization - are they equivalent?

📌
For more content check out the Circuit of Knowledge.

1 Introduction

Maximum Likelihood estimation is a widely-used classical estimation method, in which the parameter estimator is the result of maximizing the likelihood function. The likelihood function is defined as the probability density function (PDF) f(x;θ)f(x; \theta) with respect to the parameter θ\theta. In this paper, we explore the intriguing relationship between Maximum Likelihood Estimation and Information Theory, specifically the Kullback–Leibler divergence.

2 The Kullback–Leibler Divergence

The Kullback–Leibler divergence (KLD) is a measure of the difference between two probability distributions. Given two probability distributions PP and QQ, the KLD from PP to QQ is defined as:

where the integral is taken over the support of the distributions.

3 Minimizing the KLD and Maximum Likelihood Estimation

Now, let's consider the problem of estimating the parameter θ\theta based on a set of observed data points. We denote the empirical probability density function estimate of the data as f^(x)\hat{f}(x). The goal is to find the value of θ\theta that maximizes the likelihood function f(x;θ)f(x; \theta).

Interestingly, it can be shown that minimizing the Kullback–Leibler divergence between the empirical PDF estimate f^(x)\hat{f}(x) and the true PDF f(x;θ)f(x; \theta) with respect to θ\theta leads to the Maximum Likelihood Estimator (MLE) for θ\theta. Mathematically, we have:

4 Proof

To prove this relationship, we start by expanding the KLD as follows:

We observe that minimizing the KLD over θ\theta is equivalent to maximizing only the second term in the KLD, since the first one is independent of θ\theta, hence,

so θ^KL\hat{\theta}_{KL} will be θ^KL=argmaxθJ(θ)\hat{\theta}_{KL} = \arg \max_{\theta} J(\theta).

Now, we use the definition of the empirical PDF

where δ(x)\delta(x) is the Dirac delta function, and xix_i's are the observed data points, so the cost function J(θ)J(\theta) takes the following form,

since the xix_i's were assumed i.i.d.

Hence, minimizing the KLD between the empirical PDF estimate f^(x)\hat{f}(x) and the true PDF f(x;θ)f(x; \theta) with respect to θ\theta yields the MLE of θ\theta.

5 Conclusion

In conclusion, the relationship between Maximum Likelihood Estimation and Information Theory, particularly the Kullback–Leibler divergence, provides valuable insights into the estimation of parameters from observed data. Taking the actual PDF and seeking a θ\theta that will make it closer to the empirical PDF in a Kullback–Leibler sense, is equivalent of searching for the MLE, a widely-used and powerful estimation method.

6 Python based Tutorial

Part 6.1: Estimating Only the Mean

First, let's look at the case where we estimate only the mean (μ) of a Gaussian distribution, keeping the standard deviation (σ) fixed at 1.

6.1.1 Importing Libraries and Defining Helper Functions

import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import minimize_scalar

def gaussian_pdf(x, mu, sigma=1):
    """
    Compute the Gaussian probability density function.

    Args:
    x (array-like): Points at which to evaluate the distribution
    mu (float): Mean of the Gaussian distribution
    sigma (float): Standard deviation of the Gaussian distribution (default: 1)

    Returns:
    array-like: Probability density at each point in x
    """
    return 1 / (sigma * np.sqrt(2 * np.pi)) * np.exp(-(x - mu)**2 / (2 * sigma**2))

def kl_divergence(p, q):
    """
    Compute the Kullback-Leibler divergence between two discrete probability distributions.

    Args:
    p, q (array-like): Probability distributions

    Returns:
    float: KL divergence from q to p
    """
    epsilon = 1e-10  # Small constant to avoid numerical issues
    p = np.asarray(p)
    q = np.asarray(q)
    return np.sum(np.where(p != 0, p * np.log(np.maximum(p / q, epsilon)), 0))

def negative_log_likelihood(mu, data):
    """
    Compute the negative log-likelihood for a Gaussian distribution.

    Args:
    mu (float): Mean of the Gaussian distribution
    data (array-like): Observed data points

    Returns:
    float: Negative log-likelihood
    """
    epsilon = 1e-10  # Small constant to avoid numerical issues
    return -np.sum(np.log(np.maximum(gaussian_pdf(data, mu), epsilon)))

def kl_cost_function(mu, data, bin_centers, hist):
    """
    Compute the KL divergence between the empirical PDF and the Gaussian PDF.

    Args:
    mu (float): Mean of the Gaussian distribution
    data (array-like): Observed data points
    bin_centers (array-like): Centers of histogram bins
    hist (array-like): Heights of histogram bars

    Returns:
    float: KL divergence between empirical and model distributions
    """
    epsilon = 1e-10  # Small constant to avoid numerical issues
    model_pdf = gaussian_pdf(bin_centers, mu)
    # Normalize to ensure both distributions sum to 1
    hist_norm = hist / (np.sum(hist) + epsilon)
    model_pdf_norm = model_pdf / (np.sum(model_pdf) + epsilon)
    return kl_divergence(hist_norm, model_pdf_norm)

These functions form the core of our analysis. The gaussian_pdf function computes the probability density function of a Gaussian distribution. The kl_divergence function calculates the Kullback-Leibler divergence between two probability distributions. The negative_log_likelihood function computes the negative log-likelihood of the data given a Gaussian distribution, and the kl_cost_function computes the KL divergence between the empirical distribution (histogram) and a Gaussian distribution.

6.1.2 Generating Data and Estimating the Mean

Now, let's generate some data and estimate the mean using both Maximum Likelihood Estimation (MLE) and KL Divergence minimization:

# Generate sample data
np.random.seed(42)  # For reproducibility
true_mu = 2.5  # True mean of the distribution
n_samples = 1000  # Number of data points
data = np.random.normal(true_mu, 1, n_samples)  # Generate Gaussian data

# Compute empirical PDF (histogram)
hist, bin_edges = np.histogram(data, bins=50, density=True)
bin_centers = (bin_edges[:-1] + bin_edges[1:]) / 2

# Find Maximum Likelihood Estimate (MLE)
mle_result = minimize_scalar(negative_log_likelihood, args=(data,))
mle_mu = mle_result.x

# Find KL Divergence minimizer
kld_result = minimize_scalar(lambda x: kl_cost_function(x, data, bin_centers, hist))
kld_mu = kld_result.x

# Print results
print(f"True mu: {true_mu}")
print(f"MLE estimate: {mle_mu}")
print(f"KLD minimizer: {kld_mu}")

# Visualize results
x = np.linspace(min(data), max(data), 1000)
plt.figure(figsize=(10, 6))
plt.hist(data, bins=50, density=True, alpha=0.7, label='Empirical data')
plt.plot(x, gaussian_pdf(x, mle_mu), 'r-', label='MLE estimate')
plt.plot(x, gaussian_pdf(x, kld_mu), 'g--', label='KLD minimizer')
plt.plot(x, gaussian_pdf(x, true_mu), 'k:', label='True distribution')
plt.legend()
plt.title('Comparison of MLE and KLD minimization (Mean Only)')
plt.xlabel('x')
plt.ylabel('Probability density')
plt.show()
True mu: 2.5
MLE estimate: 2.5193320558223258
KLD minimizer: 2.5201447899898883
image

In this code, we first generate synthetic data from a Gaussian distribution with a known mean. We then create a histogram of this data to serve as our empirical probability density function (PDF).

We use scipy.optimize.minimize_scalar to find the value of μ that minimizes the negative log-likelihood function (which is equivalent to maximizing the likelihood function). We do the same to find the value of μ that minimizes the KL divergence between our empirical PDF and the Gaussian PDF.

Finally, we visualize the results, plotting the histogram of our data along with the true Gaussian distribution and the distributions estimated by MLE and KLD minimization.

The key observation here is that both the MLE and KLD minimization approaches should produce very similar estimates of μ, both close to the true value of 2.5. This demonstrates the theoretical relationship between Maximum Likelihood Estimation and Kullback-Leibler Divergence minimization, as described in the paper.

Part 6.2: Estimating Both Mean and Standard Deviation

Now, let's extend our analysis to estimate both the mean (μ) and standard deviation (σ) of the Gaussian distribution.

6.2.1 Updating Helper Functions

We need to modify some of our functions to handle two parameters:

import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import minimize

def gaussian_pdf(x, mu, sigma):
    """
    Compute the Gaussian probability density function.

    Args:
    x (array-like): Points at which to evaluate the distribution
    mu (float): Mean of the Gaussian distribution
    sigma (float): Standard deviation of the Gaussian distribution

    Returns:
    array-like: Probability density at each point in x
    """
    return 1 / (sigma * np.sqrt(2 * np.pi)) * np.exp(-(x - mu)**2 / (2 * sigma**2))

def kl_divergence(p, q):
    """
    Compute the Kullback-Leibler divergence between two discrete probability distributions.

    Args:
    p, q (array-like): Probability distributions

    Returns:
    float: KL divergence from q to p
    """
    epsilon = 1e-10  # Small constant to avoid numerical issues
    p = np.asarray(p)
    q = np.asarray(q)
    return np.sum(np.where(p != 0, p * np.log(np.maximum(p / q, epsilon)), 0))
    
def negative_log_likelihood(params, data):
    """
    Compute the negative log-likelihood for a Gaussian distribution.

    Args:
    params (array-like): [mu, sigma] - Mean and standard deviation of the Gaussian distribution
    data (array-like): Observed data points

    Returns:
    float: Negative log-likelihood
    """
    mu, sigma = params
    epsilon = 1e-10  # Small constant to avoid numerical issues
    return -np.sum(np.log(np.maximum(gaussian_pdf(data, mu, sigma), epsilon)))

def kl_cost_function(params, data, bin_centers, hist):
    """
    Compute the KL divergence between the empirical PDF and the Gaussian PDF.

    Args:
    params (array-like): [mu, sigma] - Mean and standard deviation of the Gaussian distribution
    data (array-like): Observed data points
    bin_centers (array-like): Centers of histogram bins
    hist (array-like): Heights of histogram bars

    Returns:
    float: KL divergence between empirical and model distributions
    """
    mu, sigma = params
    epsilon = 1e-10  # Small constant to avoid numerical issues
    model_pdf = gaussian_pdf(bin_centers, mu, sigma)
    # Normalize to ensure both distributions sum to 1
    hist_norm = hist / (np.sum(hist) + epsilon)
    model_pdf_norm = model_pdf / (np.sum(model_pdf) + epsilon)
    return kl_divergence(hist_norm, model_pdf_norm)

The main changes here are in the negative_log_likelihood and kl_cost_function. They now take a params argument which includes both μ and σ.

6.2.2 Generating Data and Estimating Parameters

Now let's generate data and estimate both parameters:

# Generate sample data
np.random.seed(42)
true_mu, true_sigma = 2.5, 1.2
n_samples = 1000
data = np.random.normal(true_mu, true_sigma, n_samples)

# Compute empirical PDF
hist, bin_edges = np.histogram(data, bins=50, density=True)
bin_centers = (bin_edges[:-1] + bin_edges[1:]) / 2

# Initial guess for parameters
initial_guess = [np.mean(data), np.std(data)]

# Find MLE
mle_result = minimize(negative_log_likelihood, initial_guess, args=(data,), method='Nelder-Mead')
mle_mu, mle_sigma = mle_result.x

# Find KLD minimizer
kld_result = minimize(kl_cost_function, initial_guess, args=(data, bin_centers, hist), method='Nelder-Mead')
kld_mu, kld_sigma = kld_result.x

print(f"True parameters: mu = {true_mu}, sigma = {true_sigma}")
print(f"MLE estimates: mu = {mle_mu:.4f}, sigma = {mle_sigma:.4f}")
print(f"KLD minimizer: mu = {kld_mu:.4f}, sigma = {kld_sigma:.4f}")

# Visualization
x = np.linspace(min(data), max(data), 1000)
plt.figure(figsize=(10, 6))
plt.hist(data, bins=50, density=True, alpha=0.7, label='Empirical data')
plt.plot(x, gaussian_pdf(x, mle_mu, mle_sigma), 'r-', label='MLE estimate')
plt.plot(x, gaussian_pdf(x, kld_mu, kld_sigma), 'g--', label='KLD minimizer')
plt.plot(x, gaussian_pdf(x, true_mu, true_sigma), 'k:', label='True distribution')
plt.legend()
plt.title('Comparison of MLE and KLD minimization (Mean and Std Dev)')
plt.xlabel('x')
plt.ylabel('Probability density')
plt.show()
True parameters: mu = 2.5, sigma = 1.2
MLE estimates: mu = 2.5232, sigma = 1.1745
KLD minimizer: mu = 2.5246, sigma = 1.1799
image

In this version, we're generating data from a Gaussian distribution with both μ and σ specified. We use scipy.optimize.minimize instead of minimize_scalar because we're now optimizing over two parameters.

We provide initial guesses for both parameters based on the sample mean and standard deviation. The optimization process then finds the values of μ and σ that minimize the negative log-likelihood (for MLE) or the KL divergence (for KLD minimization).

The key observation here is that both MLE and KLD minimization produce very similar estimates for both μ and σ, and these estimates should be close to the true values. This demonstrates that the equivalence between these two approaches holds when estimating multiple parameters of a distribution.

Python Tutorial Conclusion

This tutorial demonstrates the relationship between Maximum Likelihood Estimation and Kullback-Leibler Divergence minimization in two scenarios: estimating only the mean of a Gaussian distribution, and estimating both the mean and standard deviation.

In both cases, we see that MLE and KLD minimization produce very similar results, confirming the theoretical relationship described in the paper. This relationship holds whether we're estimating a single parameter or multiple parameters of a distribution.

The visualization helps to illustrate how well both estimates match the true distribution and the empirical data. The close overlap of the MLE and KLD minimizer lines visually confirms their theoretical equivalence.

This practical demonstration supports the theoretical main point: that minimizing the Kullback-Leibler divergence between the empirical PDF estimate and the true PDF with respect to the parameters leads to the Maximum Likelihood Estimator for those parameters.

References

[1] “Information-Theoretic Signal Processing and Its Applications”, Kay, S.M., 2020