K-Means: Least Squares Quantization in PCM

Based on the paper, by Lloyd, S. P. (1982). Least squares quantization in PCM. IEEE Transactions on Information Theory, 28(2), 129-137.

📌

By Dr. Nir Regev

📌

If it’s knowledge you’re after, join our growing Slack community!

Introduction

Quantization is a fundamental concept in signal processing and data compression, crucial for converting analog signals into digital form. This tutorial explores the Least Squares Quantization method in Pulse Code Modulation (PCM), based on the seminal paper by Stuart P. Lloyd. We'll delve into the mathematical foundations of the method and implement it in Python, providing detailed explanations and visualizations to enhance understanding.

Understanding Quantization

What is Quantization?

Quantization involves mapping a large set of input values to a smaller, finite set of output values. This process is essential in digital signal processing, where continuous signals must be represented in discrete form.

Applications in PCM

In Pulse Code Modulation (PCM), quantization plays a vital role in digitizing analog signals:

Sampling: The analog signal is sampled at regular intervals.
Quantization: Each sample is approximated to the nearest value within a finite set of levels.
Encoding: Quantized values are converted into binary form for digital representation.

However, quantization introduces an error known as quantization error, which we aim to minimize to preserve signal fidelity.

Least Squares Quantization

Objective Function

Our goal is to design a quantizer that minimizes the total quantization error across all samples. The Mean Squared Error (MSE) is commonly used as the objective function:

\text{MSE} = \frac{1}{N} \sum_{i=1}^{N} (x_i - q(x_i))^2,

where

$N$ : Number of samples.
$x_i$ : Original sample value.
$q(x_i)$ : Quantized value of $x_i$ .

Quantization Error

Quantization error is defined as the difference between the input value and its quantized output:

e_i = x_i - q(x_i).

Minimizing the MSE corresponds to finding the optimal quantization levels that reduce the overall quantization error.

Lloyd's Algorithm

Lloyd's Algorithm is an iterative method used to find the optimal set of quantization levels (codebook vectors) and partition the input space accordingly. It's widely used in vector quantization and clustering applications.

Steps of the Algorithm

Initialization: Choose initial codebook vectors (quantization levels), often selected randomly from the dataset.
Assignment Step: Assign each sample to the nearest codebook vector based on a distance metric (usually Euclidean distance).
Update Step: Recompute the codebook vectors as the centroids (means) of all samples assigned to them.
Convergence Check: Repeat steps 2 and 3 until the change in MSE or codebook vectors falls below a predefined threshold.

Mathematical Derivation

Optimal Quantizer for a Given Partition

Given a partition of the input space into regions $S_j$ , the optimal quantizer minimizes the MSE by mapping each region to its centroid:

c_j = \frac{1}{|S_j|} \sum_{x_i \in S_j} x_i,

where:

$c_j$ : Codebook vector for region $S_j$ .
$|S_j|$ : Number of samples in region $S_j$ .

Optimal Partition for a Given Quantizer

Given codebook vectors $c_j$ , the optimal partition assigns each sample to the nearest codebook vector:

S_j = \{ x_i : \| x_i - c_j \| \leq \| x_i - c_k \|, \forall k \},

where $\| x_i - c_j \|$ is the distance between sample $x_i$ and codebook vector $c_j$ .

Convergence

The iterative process of updating partitions and codebook vectors guarantees a non-increasing MSE at each step. The algorithm converges when the changes in codebook vectors become negligible.

Python Implementation

We'll implement Lloyd's Algorithm in Python, providing detailed explanations of each code segment and plotting the results at every iteration to visualize the algorithm's progression.

Importing Libraries

import numpy as np
import matplotlib.pyplot as plt

numpy: Used for numerical computations and array manipulations.
matplotlib.pyplot: Used for creating visualizations.

Generating Sample Data

We generate synthetic 2D data points to simulate the input signal.

# Set a seed for reproducibility
np.random.seed(0)

# Generate data points from three different Gaussian distributions
data = np.vstack((
    np.random.normal(0, 0.5, (100, 2)),  # Cluster around (0, 0)
    np.random.normal(3, 0.5, (100, 2)),  # Cluster around (3, 3)
    np.random.normal(6, 0.5, (100, 2))   # Cluster around (6, 6)
))

Initializing the Codebook

We randomly select initial codebook vectors from the data points.

# Number of codebook vectors (clusters)
K = 3

# Randomly initialize codebook vectors from the data
initial_indices = np.random.choice(len(data), K, replace=False)
codebook = data[initial_indices]

Defining the Plotting Function

We create a function to plot the clustering result at each iteration.

def plot_iteration(data, assignments, codebook, iteration):
    plt.figure(figsize=(8, 6))
    plt.title(f'Iteration {iteration}')

    # Plot data points colored by their assigned cluster
    for k in range(K):
        cluster_data = data[assignments == k]
        plt.scatter(cluster_data[:, 0], cluster_data[:, 1], label=f'Cluster {k}')

    # Plot codebook vectors (centroids)
    plt.scatter(codebook[:, 0], codebook[:, 1], c='black', marker='x', s=100, label='Centroids')

    plt.xlabel('Feature 1')
    plt.ylabel('Feature 2')
    plt.legend()
    plt.grid(True)
    plt.show()

Purpose: Visualizes the clustering process at each iteration.
Parameters:

data: The dataset.
assignments: Cluster assignments for each data point.
codebook: Current codebook vectors.
iteration: Current iteration number.

Implementing Lloyd's Algorithm

We define a function that implements Lloyd's Algorithm, including plotting at each iteration.

def lloyds_algorithm(data, codebook, max_iters=100, tol=1e-6):
    # Store the history of codebooks for plotting (optional)
    codebook_history = [codebook.copy()]

    for iteration in range(max_iters):
        # Assignment Step
        distances = np.linalg.norm(data[:, np.newaxis] - codebook, axis=2)
        assignments = np.argmin(distances, axis=1)

        # Update Step
        new_codebook = np.array([
            data[assignments == k].mean(axis=0) if len(data[assignments == k]) > 0 else codebook[k]
            for k in range(K)
        ])

        # Store the new codebook (optional)
        codebook_history.append(new_codebook.copy())

        # Plot the current state
        plot_iteration(data, assignments, new_codebook, iteration)

        # Convergence Check
        diff = np.linalg.norm(new_codebook - codebook)
        if diff < tol:
            print(f'Converged at iteration {iteration}')
            break

        codebook = new_codebook

    return codebook, assignments, codebook_history

Explanation

Assignment Step:

Computes distances between data points and codebook vectors.
Assigns each data point to the nearest codebook vector.

Update Step:

Recalculates codebook vectors as the mean of the assigned data points.
Handles empty clusters by retaining the previous codebook vector.

Convergence Check:

Calculates the change in codebook vectors.
Stops the algorithm if the change is below the tolerance level.

Visualization:

Calls plot_iteration() to visualize the clustering at each iteration.

Running the Algorithm

final_codebook, assignments, codebook_history = lloyds_algorithm(data, codebook)

Execution: Runs the algorithm with the generated data and initial codebook.
Outputs:

final_codebook: The converged codebook vectors.
assignments: Final cluster assignments of data points.
codebook_history: History of codebook vectors (optional).

Examples and Visualization

Observing the Clustering Process

At each iteration, the algorithm updates the cluster assignments and codebook vectors. The plots generated at each step allow us to observe:

Movement of Centroids: How the codebook vectors (centroids) move towards the true centers of the data clusters.
Reassignment of Data Points: How data points switch clusters as centroids move.

Interpreting the Plots

Data Points: Colored dots represent data points, with each color indicating a different cluster.
Centroids: Black 'x's represent the current codebook vectors (centroids).
Iteration Title: Indicates the current iteration number.

Sample Visualization

Iteration 0:

Initial random centroids.
Data points assigned based on proximity to these centroids.

Subsequent Iterations:

Centroids move towards the mean of assigned data points.
Cluster assignments update accordingly.
Convergence is observed when centroids stabilize.

Conclusion

In this comprehensive tutorial, we've explored the Least Squares Quantization method in Pulse Code Modulation, understanding both its mathematical foundations and practical implementation. By implementing Lloyd's Algorithm in Python and visualizing each iteration, we've gained deeper insight into how quantization levels (codebook vectors) are optimized to minimize quantization error.

Key Takeaways:

Quantization is essential for digital signal processing but introduces quantization error.
Least Squares Quantization aims to minimize the Mean Squared Error between original and quantized values.
Lloyd's Algorithm provides an effective iterative method to find optimal quantization levels.
Visualization enhances understanding by showing the dynamic process of clustering and convergence.

References

Lloyd, S. P. (1982). Least squares quantization in PCM. IEEE Transactions on Information Theory, 28(2), 129-137.

Note: The complete code provided can be run in a Python environment with the necessary libraries installed. Running the code will display the clustering process at each iteration, offering a visual and practical understanding of Least Squares Quantization using Lloyd's Algorithm.

Complete Code with Detailed Comments:

import numpy as np
import matplotlib.pyplot as plt

# Generate Sample Data
np.random.seed(0)
data = np.vstack((
    np.random.normal(0, 0.5, (100, 2)),  # Cluster around (0, 0)
    np.random.normal(3, 0.5, (100, 2)),  # Cluster around (3, 3)
    np.random.normal(6, 0.5, (100, 2))   # Cluster around (6, 6)
))

# Initialize Codebook
K = 3  # Number of clusters
initial_indices = np.random.choice(len(data), K, replace=False)
codebook = data[initial_indices]

# Define the Plotting Function
def plot_iteration(data, assignments, codebook, iteration):
    plt.figure(figsize=(8, 6))
    plt.title(f'Iteration {iteration}')

    # Plot data points colored by their assigned cluster
    for k in range(K):
        cluster_data = data[assignments == k]
        plt.scatter(cluster_data[:, 0], cluster_data[:, 1], label=f'Cluster {k}')

    # Plot codebook vectors (centroids)
    plt.scatter(codebook[:, 0], codebook[:, 1], c='black', marker='x', s=100, label='Centroids')

    plt.xlabel('Feature 1')
    plt.ylabel('Feature 2')
    plt.legend()
    plt.grid(True)
    plt.show()

# Implement Lloyd's Algorithm
def lloyds_algorithm(data, codebook, max_iters=100, tol=1e-6):
    # Store the history of codebooks for plotting (optional)
    codebook_history = [codebook.copy()]

    for iteration in range(max_iters):
        # Assignment Step
        distances = np.linalg.norm(data[:, np.newaxis] - codebook, axis=2)
        assignments = np.argmin(distances, axis=1)

        # Update Step
        new_codebook = np.array([
            data[assignments == k].mean(axis=0) if len(data[assignments == k]) > 0 else codebook[k]
            for k in range(K)
        ])

        # Store the new codebook (optional)
        codebook_history.append(new_codebook.copy())

        # Plot the current state
        plot_iteration(data, assignments, new_codebook, iteration)

        # Convergence Check
        diff = np.linalg.norm(new_codebook - codebook)
        if diff < tol:
            print(f'Converged at iteration {iteration}')
            break

        codebook = new_codebook

    return codebook, assignments, codebook_history

# Run the Algorithm
final_codebook, assignments, codebook_history = lloyds_algorithm(data, codebook)

Converged at iteration 2

Iteration 0

Iteration 1

Iteration 2

Feel free to experiment with the code by changing parameters such as the number of clusters K, the maximum number of iterations max_iters, or the convergence tolerance tol. This hands-on experience will further solidify your understanding of the algorithm and its applications in quantization and clustering.