Insights into Maximum Likelihood Estimation and Information Theory [1]

📌

For more content check out the Circuit of Knowledge.

1 Introduction

Maximum Likelihood estimation is a widely-used classical estimation method, in which the parameter estimator is the result of maximizing the likelihood function. The likelihood function is defined as the probability density function (PDF) $f(x; \theta)$ with respect to the parameter $\theta$ . In this paper, we explore the intriguing relationship between Maximum Likelihood Estimation and Information Theory, specifically the Kullback–Leibler divergence.

2 The Kullback–Leibler Divergence

The Kullback–Leibler divergence (KLD) is a measure of the difference between two probability distributions. Given two probability distributions $P$ and $Q$ , the KLD from $P$ to $Q$ is defined as:

$D_{KL}(P \| Q) = \int P(x) \log \left(\frac{P(x)}{Q(x)}\right) dx$

where the integral is taken over the support of the distributions.

3 Minimizing the KLD and Maximum Likelihood Estimation

Now, let's consider the problem of estimating the parameter $\theta$ based on a set of observed data points. We denote the empirical probability density function estimate of the data as $\hat{f}(x)$ . The goal is to find the value of $\theta$ that maximizes the likelihood function $f(x; \theta)$ .

Interestingly, it can be shown that minimizing the Kullback–Leibler divergence between the empirical PDF estimate $\hat{f}(x)$ and the true PDF $f(x; \theta)$ with respect to $\theta$ leads to the Maximum Likelihood Estimator (MLE) for $\theta$ . Mathematically, we have:

$\hat{\theta}_{KL} = \arg \min_{\theta} D_{KL}(\hat{f}(x) \| f(x; \theta))$

4 Proof

To prove this relationship, we start by expanding the KLD as follows:

$D_{KL}(\hat{f}(x) \| f(x; \theta)) = \int \hat{f}(x) \log \left(\frac{\hat{f}(x)}{f(x; \theta)}\right) dx$

$= \int \hat{f}(x) \log \hat{f}(x)dx - \int \hat{f}(x) \log f(x; \theta)dx.$

We observe that minimizing the KLD over $\theta$ is equivalent to maximizing only the second term in the KLD, since the first one is independent of $\theta$ , hence,

$J(\theta) = \int \hat{f}(x) \log f(x; \theta)dx,$

so $\hat{\theta}{KL}$ will be $\hat{\theta}_{KL} = \arg \max_{\theta} J(\theta)$ .

Now, we use the definition of the empirical PDF

$\hat{f}(x) = \sum_{i=1}^{N} \frac{1}{N} \delta(x - x_i),$

where $\delta(x)$ is the Dirac delta function, and $x_i$ 's are the observed data points, so the cost function $J(\theta)$ takes the following form,

$J(\theta) = \int \hat{f}(x) \log f(x; \theta)dx = \int \sum_{i=1}^{N} \frac{1}{N} \delta(x - x_i) \log f(x; \theta)dx$

$= \frac{1}{N} \sum_{i=1}^{N} \log f(x_i; \theta)$

$= \frac{1}{N} \log f(x; \theta),$

since the $x_i$ 's were assumed i.i.d.

Hence, minimizing the KLD between the empirical PDF estimate $\hat{f}(x)$ and the true PDF $f(x; \theta)$ with respect to $\theta$ yields the MLE of $\theta$ .

5 Conclusion

In conclusion, the relationship between Maximum Likelihood Estimation and Information Theory, particularly the Kullback–Leibler divergence, provides valuable insights into the estimation of parameters from observed data. Taking the actual PDF and seeking a $\theta$ that will make it closer to the empirical PDF in a Kullback–Leibler sense, is equivalent of searching for the MLE, a widely-used and powerful estimation method.

References

[1] “Information-Theoretic Signal Processing and Its Applications”, Kay, S.M., 2020