Tutorial: The Connection Between Cross Entropy and Kullback-Leibler Divergence

📌
Join Circuit of Knowledge for more content

In the field of information theory and machine learning, cross entropy and Kullback-Leibler divergence (KLD) are fundamental concepts that play seminal roles in the evaluation of probabilistic models. Though they are closely related, their specific purposes and interpretations can sometimes lead to confusion. This blog post aims to demystify these concepts and elucidate the relationship between them.

What is Cross Entropy?

Cross entropy is defined as following: for two distributions (the true distribution) and (the approximate distribution), the cross entropy is defined as:

Here, represents the true probability (PMF or PDF) of event , and represents the predicted probability of event . Cross entropy quantifies the average number of bits (, or in machine learning nats if we use ) needed to encode data from distribution when using the distribution .

What is Kullback-Leibler Divergence?

Kullback-Leibler divergence, also known as relative entropy often abbreviated as KL divergence or KLD, measures how one probability distribution diverges from a second, expected probability distribution. This divergence is NOT a distance metric. For the same distributions and , the KL divergence is given by:

Alternatively, KL divergence can also be expressed as:

The Relationship Between Cross Entropy and KL Divergence

The relationship between cross entropy and KL divergence becomes clear when we decompose the cross entropy term to read:

where is the entropy of the true distribution , which is fixed for the optimization problems in Machine Learning as we usually optimize/learn the empirical distribution Q:

From this decomposition, it is evident that the cross entropy between and consists of two parts:

  1. Entropy : The intrinsic entropy of the distribution , which represents the minimum average number of bits required to encode events drawn from using an optimal code.
  2. KL Divergence : The extra number of bits required to encode events from using the distribution instead of the optimal distribution .

In simpler terms, cross entropy measures the total "cost" of encoding the data when using the approximate distribution , while KL divergence measures the inefficiency introduced by using instead of the true distribution .

Python Example:

import numpy as np
from scipy.special import softmax

# Example distributions
P = np.array([0.1, 0.9])
Q = np.array([0.8, 0.2])

# Cross Entropy
cross_entropy = -np.sum(P * np.log(Q))
print(f"Cross Entropy: {cross_entropy}")

# KL Divergence
kl_divergence = np.sum(P * np.log(P / Q))
print(f"KL Divergence: {kl_divergence}")

# Entropy of P
entropy_P = -np.sum(P * np.log(P))
print(f"Entropy of P: {entropy_P}")

# Cross Entropy = Entropy + KL Divergence
cross_entropy_calculated = entropy_P + kl_divergence
print(f"Calculated Cross Entropy: {cross_entropy_calculated}")
Cross Entropy: 1.4708084763221112
KL Divergence: 1.1457255029306632
Entropy of P: 0.3250829733914482
Calculated Cross Entropy: 1.4708084763221114

So we verified that the mathematics above holds.

Practical Implications in Machine Learning

In machine learning, cross entropy loss is widely used as a loss function for classification tasks, especially for neural networks. During training, we aim to minimize the cross entropy loss, which implicitly minimizes the KL divergence between the true distribution (often represented by one-hot encoded labels) and the predicted distribution (output of the model). This minimization process ensures that the model’s predictions become as close as possible to the true labels, thereby improving the model's accuracy. Since we already showed in another tutorial that the Maximum Likelihood Estimator is the Minimizer of the KLD we conclude that minimizing the cross-entropy loss yields the MLE of the network parameters.

Let’s see that in a Python example

import torch
import torch.nn as nn

# True labels (one-hot encoded)
# Each row corresponds to a sample. The first sample belongs to class 1 (second column),
# and the second sample belongs to class 0 (first column).
y_true = torch.tensor([[0, 1], [1, 0]], dtype=torch.float)

# Predicted probabilities (output of a model)
# Each row corresponds to a sample. These are the probabilities predicted by the model
# for each class. For example, the first sample has a 0.8 probability of being class 0
# and a 0.2 probability of being class 1.
y_pred = torch.tensor([[0.8, 0.2], [0.7, 0.3]], dtype=torch.float)

# Cross Entropy Loss
# nn.CrossEntropyLoss is a loss function commonly used in classification tasks.
# It expects raw scores (logits) rather than probabilities.
criterion = nn.CrossEntropyLoss()

# Note: nn.CrossEntropyLoss expects raw scores (logits), so we need to convert y_pred to logits.
# Logits are the raw, unnormalized scores output by the last layer of a neural network.
# We can convert probabilities to logits by applying the log function.
logits = torch.log(y_pred)

# nn.CrossEntropyLoss expects the target labels to be in a specific format (class indices).
# torch.argmax(y_true, dim=1) converts the one-hot encoded labels to class indices.
loss = criterion(logits, torch.argmax(y_true, dim=1))

# Print the cross-entropy loss
print(f"Cross Entropy Loss: {loss.item()}")

# KL Divergence Loss
# nn.KLDivLoss is a loss function used to measure the Kullback-Leibler divergence.
# The reduction='batchmean' argument specifies that the output should be averaged over the batch.
kl_loss = nn.KLDivLoss(reduction='batchmean')

# The logits need to be normalized to probabilities using log_softmax before computing KL divergence.
# log_softmax is used because nn.KLDivLoss expects the input to be log-probabilities.
loss_kl = kl_loss(logits.log_softmax(dim=1), y_true)

# Print the KL divergence loss
print(f"KL Divergence Loss: {loss_kl.item()}")
Cross Entropy Loss: 0.9830564260482788
KL Divergence Loss: 0.9830564260482788

Explanation of Each Step

  1. Import Libraries: We start by importing the necessary libraries, torch for tensor operations and torch.nn for the neural network module, which includes loss functions.
  2. True Labels (One-Hot Encoded):
    • We define the true labels y_true using a tensor. The labels are one-hot encoded, meaning each row represents a sample, and the correct class is indicated by a 1 in the corresponding position.
    • Example: The first row [0, 1] indicates that the true label for the first sample is class 1.
  3. Predicted Probabilities:
    • We define the predicted probabilities y_pred as a tensor. These are the probabilities output by the model for each class.
    • Example: The first row [0.8, 0.2] indicates that the model predicts a 0.8 probability for class 0 and a 0.2 probability for class 1 for the first sample.
  4. Cross Entropy Loss:
    • We create an instance of nn.CrossEntropyLoss(), which is a loss function commonly used for classification tasks. It calculates the cross-entropy loss between the predicted logits and the true labels.
    • nn.CrossEntropyLoss expects raw scores (logits) rather than probabilities. Logits are the raw, unnormalized scores output by the last layer of a neural network before applying softmax.
  5. Convert Probabilities to Logits:
    • Since y_pred contains probabilities, we convert them to logits using the torch.log function. This step is necessary because nn.CrossEntropyLoss expects logits as input.
  6. Convert One-Hot Encoded Labels to Class Indices:
    • torch.argmax(y_true, dim=1) converts the one-hot encoded labels to class indices. This is required because nn.CrossEntropyLoss expects the target labels to be in the form of class indices.
  7. Compute Cross Entropy Loss:
    • We calculate the cross-entropy loss using criterion(logits, torch.argmax(y_true, dim=1)) and print the result.
  8. KL Divergence Loss:
    • We create an instance of nn.KLDivLoss(), which measures the Kullback-Leibler divergence between the predicted and true distributions.
    • The reduction='batchmean' argument specifies that the output should be averaged over the batch.
  9. Normalize Logits to Log-Probabilities:
    • Before computing the KL divergence, we normalize the logits to log-probabilities using logits.log_softmax(dim=1). This is because nn.KLDivLoss expects the input to be log-probabilities.
  10. Compute KL Divergence Loss:
    • We calculate the KL divergence loss using kl_loss(logits.log_softmax(dim=1), y_true) and print the result.

What This Code Proves

This code demonstrates the relationship between cross-entropy loss and KL divergence in a practical machine-learning context. It shows that:

  • Cross-entropy loss is used to measure the “difference” (equivalent to loss of information) between the predicted distribution (logits) and the true distribution (class indices).
  • KL divergence measures how one probability distribution (predicted probabilities) diverges from a second, true probability distribution.
  • By converting predicted probabilities to logits and using these logits in nn.CrossEntropyLoss, we are effectively minimizing the KL divergence between the predicted distribution and the true distribution.
  • The provided Python examples illustrate the equivalence between minimizing the cross-entropy loss and minimizing the KL divergence, as both are used to align the predicted distribution with the true distribution in classification tasks.

Comparing the results of minimizing the KLD and the cross-entropy

import numpy as np
from scipy.optimize import minimize
from scipy.special import softmax

def cross_entropy(q, p):
    return -np.sum(p * np.log(q + 1e-15))

def kl_divergence(q, p):
    return np.sum(p * np.log((p + 1e-15) / (q + 1e-15)))

def entropy(p):
    return -np.sum(p * np.log(p + 1e-15))

def objective_ce(q, p):
    return cross_entropy(softmax(q), p)

def objective_kld(q, p):
    return kl_divergence(softmax(q), p)

# Set true distribution
p = np.array([0.7, 0.2, 0.1])

# Initial guess (in log space)
q0 = np.log(np.ones(len(p)) / len(p))

# Optimize using cross-entropy
result_ce = minimize(objective_ce, q0, args=(p,), method='BFGS')
optimized_q_ce = softmax(result_ce.x)

# Optimize using KL divergence
result_kld = minimize(objective_kld, q0, args=(p,), method='BFGS')
optimized_q_kld = softmax(result_kld.x)

print("Example: Optimizing distribution")
print(f"True distribution: {p}")
print(f"Optimized distribution (CE): {optimized_q_ce}")
print(f"Optimized distribution (KLD): {optimized_q_kld}")
print(f"KL divergence (CE): {kl_divergence(optimized_q_ce, p):.6f}")
print(f"KL divergence (KLD): {kl_divergence(optimized_q_kld, p):.6f}")
print(f"Cross-entropy (CE): {cross_entropy(optimized_q_ce, p):.6f}")
print(f"Cross-entropy (KLD): {cross_entropy(optimized_q_kld, p):.6f}")
print(f"Entropy of true distribution: {entropy(p):.6f}")
Example: Optimizing distribution
True distribution: [0.7 0.2 0.1]
Optimized distribution (CE): [0.69999988 0.20000022 0.09999989]
Optimized distribution (KLD): [0.69999987 0.20000022 0.0999999 ]
KL divergence (CE): 0.000000
KL divergence (KLD): 0.000000
Cross-entropy (CE): 0.801819
Cross-entropy (KLD): 0.801819
Entropy of true distribution: 0.801819

This code demonstrates the equivalence between minimizing cross-entropy and KL divergence in practice. Here's a step-by-step breakdown of the process:

  1. We define functions for cross-entropy, KL divergence, and entropy calculations.
  2. We create objective functions for both cross-entropy and KL divergence minimization. These functions use the softmax to ensure valid probability distributions.
  3. We set a true distribution p that we aim to approximate.
  4. We initialize our guess q0 in log space to avoid constraints on the optimization.
  5. We use SciPy's minimize function with the BFGS method to optimize both objective functions separately.
  6. After optimization, we apply softmax to the results to obtain the final probability distributions.
  7. We print the true distribution, optimized distributions, and various metrics (KL divergence, cross-entropy, and entropy) for comparison.

This process allows us to empirically verify that minimizing cross-entropy and KL divergence leads to the same optimal distribution, which closely approximates the true distribution. The results show that both methods converge to nearly identical distributions, with very small KL divergence values and cross-entropy values close to the entropy of the true distribution. This demonstrates the practical equivalence of these optimization objectives in machine learning tasks involving probability distributions.

Conclusion

Understanding the connection between cross entropy, Kullback-Leibler divergence, and Maximum Likelihood Estimation is crucial for grasping the underlying principles of many machine learning algorithms. This comprehensive tutorial, supported by practical Python examples, demonstrates how minimizing cross entropy or KL divergence leads to parameter estimates that maximize the likelihood function, thereby bridging these fundamental concepts.

By leveraging these insights, data scientists and engineers can design and optimize models more effectively, ensuring robust performance across a wide array of applications.