Insights into Maximum Likelihood Estimation and Information Theory [1]

📌
For more content check out the Circuit of Knowledge.

1 Introduction

Maximum Likelihood estimation is a widely-used classical estimation method, in which the parameter estimator is the result of maximizing the likelihood function. The likelihood function is defined as the probability density function (PDF) f(x;θ)f(x; \theta) with respect to the parameter θ\theta. In this paper, we explore the intriguing relationship between Maximum Likelihood Estimation and Information Theory, specifically the Kullback–Leibler divergence.

2 The Kullback–Leibler Divergence

The Kullback–Leibler divergence (KLD) is a measure of the difference between two probability distributions. Given two probability distributions PP and QQ, the KLD from PP to QQ is defined as:

DKL(PQ)=P(x)log(P(x)Q(x))dxD_{KL}(P \| Q) = \int P(x) \log \left(\frac{P(x)}{Q(x)}\right) dx

where the integral is taken over the support of the distributions.

3 Minimizing the KLD and Maximum Likelihood Estimation

Now, let's consider the problem of estimating the parameter θ\theta based on a set of observed data points. We denote the empirical probability density function estimate of the data as f^(x)\hat{f}(x). The goal is to find the value of θ\theta that maximizes the likelihood function f(x;θ)f(x; \theta).

Interestingly, it can be shown that minimizing the Kullback–Leibler divergence between the empirical PDF estimate f^(x)\hat{f}(x) and the true PDF f(x;θ)f(x; \theta) with respect to θ\theta leads to the Maximum Likelihood Estimator (MLE) for θ\theta. Mathematically, we have:

θ^KL=argminθDKL(f^(x)f(x;θ))\hat{\theta}_{KL} = \arg \min_{\theta} D_{KL}(\hat{f}(x) \| f(x; \theta))

4 Proof

To prove this relationship, we start by expanding the KLD as follows:

DKL(f^(x)f(x;θ))=f^(x)log(f^(x)f(x;θ))dxD_{KL}(\hat{f}(x) \| f(x; \theta)) = \int \hat{f}(x) \log \left(\frac{\hat{f}(x)}{f(x; \theta)}\right) dx

=f^(x)logf^(x)dxf^(x)logf(x;θ)dx.= \int \hat{f}(x) \log \hat{f}(x)dx - \int \hat{f}(x) \log f(x; \theta)dx.

We observe that minimizing the KLD over θ\theta is equivalent to maximizing only the second term in the KLD, since the first one is independent of θ\theta, hence,

J(θ)=f^(x)logf(x;θ)dx,J(\theta) = \int \hat{f}(x) \log f(x; \theta)dx,

so θ^KL\hat{\theta}{KL} will be θ^KL=argmaxθJ(θ)\hat{\theta}_{KL} = \arg \max_{\theta} J(\theta).

Now, we use the definition of the empirical PDF

f^(x)=i=1N1Nδ(xxi),\hat{f}(x) = \sum_{i=1}^{N} \frac{1}{N} \delta(x - x_i),

where δ(x)\delta(x) is the Dirac delta function, and xix_i's are the observed data points, so the cost function J(θ)J(\theta) takes the following form,

J(θ)=f^(x)logf(x;θ)dx=i=1N1Nδ(xxi)logf(x;θ)dxJ(\theta) = \int \hat{f}(x) \log f(x; \theta)dx = \int \sum_{i=1}^{N} \frac{1}{N} \delta(x - x_i) \log f(x; \theta)dx

=1Ni=1Nlogf(xi;θ)= \frac{1}{N} \sum_{i=1}^{N} \log f(x_i; \theta)

=1Nlogf(x;θ),= \frac{1}{N} \log f(x; \theta),

since the xix_i's were assumed i.i.d.

Hence, minimizing the KLD between the empirical PDF estimate f^(x)\hat{f}(x) and the true PDF f(x;θ)f(x; \theta) with respect to θ\theta yields the MLE of θ\theta.

5 Conclusion

In conclusion, the relationship between Maximum Likelihood Estimation and Information Theory, particularly the Kullback–Leibler divergence, provides valuable insights into the estimation of parameters from observed data. Taking the actual PDF and seeking a θ\theta that will make it closer to the empirical PDF in a Kullback–Leibler sense, is equivalent of searching for the MLE, a widely-used and powerful estimation method.

References

[1] “Information-Theoretic Signal Processing and Its Applications”, Kay, S.M., 2020