Tutorial: Information Theory Fundamentals: Shannon Entropy and Mutual Information

Tutorial: Information Theory Fundamentals: Shannon Entropy and Mutual Information

📌
By Dr. Nir Regev
📌
For more like-tutorials visit Circuit of Knowledge

1. Introduction

Information theory, pioneered by Claude Shannon in 1948, provides a mathematical framework for quantifying, storing, and communicating information. This tutorial will cover key concepts including Shannon entropy, mutual information, and information gain, which form the basis for understanding more advanced concepts like KLD and cross-entropy.

2. Shannon Entropy

Shannon entropy quantifies the average amount of information contained in a message. For a discrete random variable X with possible values {x₁, x₂, ..., xₙ} and probability mass function P(X), the Shannon entropy H(X) is defined as:

H(X) = -∑P(xᵢ) log₂ P(xᵢ)

Where:

  • The sum is over all possible values of X
  • log₂ is the base-2 logarithm
  • The unit of entropy is bits (when using log₂)

2.1. Python implementation:

import numpy as np

def shannon_entropy(p):
    """Compute Shannon entropy of a discrete probability distribution."""
    # Remove zero probabilities
    p = p[p > 0]
    return -np.sum(p * np.log2(p))

# Example
p = np.array([0.5, 0.25, 0.25])
print(f"Shannon entropy: {shannon_entropy(p):.4f} bits")
Shannon entropy: 1.5000 bits

3. Joint Entropy

For two discrete random variables X and Y, the joint entropy H(X,Y) is defined as:

H(X,Y) = -∑∑P(x,y) log₂ P(x,y)

Where P(x,y) is the joint probability distribution of X and Y.

3.1. Python implementation:

import numpy as np

def joint_entropy(p_xy):
    """Compute joint entropy of two discrete random variables."""
    # Remove zero probabilities
    p_xy = p_xy[p_xy > 0]
    return -np.sum(p_xy * np.log2(p_xy))

# Example
p_xy = np.array([[0.2, 0.1], [0.3, 0.4]])
print(f"Joint entropy: {joint_entropy(p_xy):.4f} bits")
Joint entropy: 1.8464 bits

4. Conditional Entropy

The conditional entropy H(Y|X) quantifies the amount of information needed to describe Y given that X is known:

H(Y|X) = H(X,Y) - H(X)

4.1. Python implementation:

def conditional_entropy(p_xy):
    """Compute conditional entropy H(Y|X)."""
    p_x = np.sum(p_xy, axis=1)
    return joint_entropy(p_xy) - shannon_entropy(p_x)

# Example
print(f"Conditional entropy H(Y|X): {conditional_entropy(p_xy):.4f} bits")
Conditional entropy H(Y|X): 0.9651 bits

5. Mutual Information

Mutual information I(X;Y) measures the mutual dependence between two variables. It quantifies the amount of information obtained about one variable by observing the other:

I(X;Y) = H(X) + H(Y) - H(X,Y) = H(X) - H(X|Y) = H(Y) - H(Y|X)

5.1. Python implementation:

def mutual_information(p_xy):
    """Compute mutual information between X and Y."""
    p_x = np.sum(p_xy, axis=1)
    p_y = np.sum(p_xy, axis=0)
    return shannon_entropy(p_x) + shannon_entropy(p_y) - joint_entropy(p_xy)

# Example
print(f"Mutual information: {mutual_information(p_xy):.4f} bits")
Mutual information: 0.0349 bits

6. Information Gain

Information gain is the change in entropy from a prior state to a state that takes some information as given. It's often used in decision tree algorithms to select the best feature for splitting:

IG(Y,X) = H(Y) - H(Y|X)

This is equivalent to the mutual information I(X;Y).

6.1 Python implementation:

def information_gain(p_y, p_xy):
    """Compute information gain."""
    return shannon_entropy(p_y) - conditional_entropy(p_xy)

# Example
p_y = np.sum(p_xy, axis=0)
print(f"Information gain: {information_gain(p_y, p_xy):.4f} bits")
Information gain: 0.0349 bits

7. Relationship to KLD and Cross-Entropy

Kullback-Leibler Divergence (KLD) and Cross-Entropy are closely related to the concepts we've discussed:

7.1. KLD

Measures the difference between two probability distributions P and Q:

KLD(P||Q) = ∑P(x) log(P(x)/Q(x)) = ∑P(x) log P(x) - ∑P(x) log Q(x) = -H(P) + H(P,Q)

Where H(P,Q) is the cross-entropy of P and Q.

7.2. Cross-Entropy

Measures the average number of “bits” (in log2 basis, or nats in loge basis) needed to identify an event from a set of possibilities, if a coding scheme is used based on a given probability distribution Q, rather than the "true" distribution P:

H(P,Q) = -∑P(x) log Q(x)

7.3. Python implementation:

def kl_divergence(p, q):
    """Compute Kullback-Leibler divergence."""
    return np.sum(p * np.log2(p / q))

def cross_entropy(p, q):
    """Compute cross-entropy."""
    return -np.sum(p * np.log2(q))

# Example
p = np.array([0.5, 0.5])
q = np.array([0.8, 0.2])
print(f"KL divergence: {kl_divergence(p, q):.4f} bits")
print(f"Cross-entropy: {cross_entropy(p, q):.4f} bits")
KL divergence: 0.3219 bits
Cross-entropy: 1.3219 bits

8. Analysis

To summarize these are the combined results:

Shannon entropy: 1.5000 bits
Joint entropy: 1.8464 bits
Conditional entropy H(Y|X): 0.9651 bits
Mutual information: 0.0349 bits
Information gain: 0.0349 bits
KL divergence: 0.3219 bits
Cross-entropy: 1.3219 bits

These results provide valuable insights into the information content and relationships between the variables in our example. Let's analyze each result thoroughly:

  1. Shannon entropy: 1.5000 bits This value represents the average amount of information contained in the distribution p = [0.5, 0.25, 0.25]. The maximum entropy for a 3-event system is log₂(3) ≈ 1.5850 bits, so our distribution is close to, but not at, maximum uncertainty. This suggests a moderately balanced distribution with a slight bias towards one outcome (the 0.5 probability event).
  2. Joint entropy: 1.8464 bits This measures the total uncertainty of the joint distribution pxyp_{xy}. The value being less than 2 bits (which would be the maximum for two binary variables) indicates some level of dependency between X and Y.
  3. Conditional entropy H(Y|X): 0.9651 bits This represents the average uncertainty of Y given X. It's less than 1 bit (the maximum for a binary variable), suggesting that knowing X provides some information about Y, but doesn't completely determine it.
  4. Mutual information: 0.0349 bits
  5. See information gain below

  6. Information gain: 0.0349 bits These identical values (as expected, since mutual information is equivalent to information gain) indicate the amount of information shared between X and Y. The low value suggests a weak, but non-zero, dependency between the variables. Knowing one variable reduces uncertainty about the other by only 0.0349 bits on average. In other words, it will be hard to predict one based on the other.
  7. KL divergence: 0.3219 bits This measures the difference between distributions p and q. The non-zero value indicates that q is not an accurate representation of p. However, the relatively low value suggests the distributions are not drastically different.
  8. Cross-entropy: 1.3219 bits This represents the average number of bits needed to encode events from distribution p using an optimal code for distribution q. It's higher than the Shannon entropy of p (1 bit for a balanced binary distribution), indicating inefficiency in using q to encode p.

Key insights:

  1. Weak dependency: The low mutual information/information gain (0.0349 bits) suggests that X and Y are only weakly dependent. This could indicate that in a predictive scenario, X would not be a strong predictor of Y.
  2. Remaining uncertainty: The conditional entropy H(Y|X) is still quite high (0.9651 bits), reinforcing that knowing X doesn't substantially reduce uncertainty about Y.
  3. Distribution mismatch: The KL divergence (0.3219 bits) shows that distributions p and q are notably different, but not extremely so. This mismatch leads to inefficiency in encoding, as seen in the cross-entropy value.
  4. Encoding efficiency: The cross-entropy (1.3219 bits) being higher than the optimal Shannon entropy (1 bit for a balanced binary distribution) quantifies the inefficiency in using distribution q to encode events from distribution p.
  5. Information structure: The joint entropy (1.8464 bits) being less than the sum of individual entropies (1 bit each if they were independent) indicates some shared information structure between X and Y, albeit small.

In a practical scenario, these results might suggest:

  • X and Y have a statistically significant but weak relationship.
  • Using X to predict Y would provide only minimal improvement over random guessing.
  • If these represent features in a machine learning context, X might not be a very informative feature for predicting Y.

This analysis demonstrates how information theory metrics can provide quantitative insights into the relationships between variables and the efficiency of probability distributions for encoding information.

9. Connections to Machine Learning and AI

The concepts of information theory we've explored have profound implications for machine learning and AI:

  1. Feature Selection: Information gain is widely used in decision tree algorithms (like ID3, C4.5) to select the most informative features. Features with higher information gain are preferred as they reduce uncertainty about the target variable more effectively.
  2. Model Evaluation: KL divergence and cross-entropy serve as loss functions in many machine learning models, especially in classification tasks. For instance, minimizing cross-entropy is equivalent to maximizing the likelihood of the true labels under the model's predictions.
  3. Neural Networks: The cross-entropy loss is commonly used in neural networks, particularly for classification tasks. It provides a measure of the difference between the predicted probability distribution and the true distribution of labels.
  4. Dimensionality Reduction: Mutual information can guide feature extraction methods. Techniques like Information Bottleneck method use mutual information to find a compressed representation of data that preserves relevant information about the target variable.
  5. Reinforcement Learning: Information-theoretic concepts help in designing exploration strategies. For example, maximum entropy reinforcement learning encourages the agent to behave as randomly as possible while still maximizing rewards.
  6. Natural Language Processing: Concepts like entropy and mutual information are used in various NLP tasks, including language modeling, topic modeling, and word sense disambiguation.
  7. Anomaly Detection: KL divergence can be used to measure how much a new data point deviates from the expected distribution, potentially flagging anomalies.
  8. Generative Models: In generative adversarial networks (GANs) and variational autoencoders (VAEs), KL divergence is often part of the loss function, helping to match the generated distribution to the true data distribution.
  9. Clustering: Information-theoretic measures can be used to determine the optimal number of clusters or to measure the quality of clustering results.
  10. Transfer Learning: Mutual information can help in selecting which parts of a model to fine-tune when adapting to a new task, by identifying which layers contain the most task-relevant information.

These applications demonstrate how the fundamental concepts of information theory permeate modern machine learning and AI, providing both theoretical foundations and practical tools for developing and improving algorithms.

10. Conclusion

This tutorial has covered the fundamental concepts of information theory, including Shannon entropy, mutual information, and information gain. These concepts provide the theoretical foundation for understanding more advanced topics like KLD and cross-entropy, which are widely used in machine learning and data compression.

Understanding these information-theoretic measures allows us to quantify uncertainty, measure information content, and assess the relationships between variables. These tools are invaluable in various fields, including machine learning, data compression, and communication systems.