Maximum Likelihood Training of Probabilistic Neural Networks by Roy L. Streit and Tod E. Luginbuhl
Maximum Likelihood Training of Probabilistic Neural Networks by Roy L. Streit and Tod E. Luginbuhl

Maximum Likelihood Training of Probabilistic Neural Networks by Roy L. Streit and Tod E. Luginbuhl

📌
Sign up to Circuit of Knowledge blog for unlimited tutorials and content
📌
If it’s knowledge you’re after, join our growing Slack community!

Abstract:

In this tutorial, we will dive deep into the concepts and implementation details of Probabilistic Neural Networks (PNNs) with Maximum Likelihood Training, as described in the paper "Maximum Likelihood Training of Probabilistic Neural Networks" by Roy L. Streit and Tod E. Luginbuhl. We will cover the mathematical foundations, key excerpts from the paper, and a comprehensive implementation in Python using TensorFlow.

Overview

Probabilistic Neural Networks (PNNs) offer a robust approach for classification tasks by combining the principles of neural networks with probabilistic models. The key idea behind PNNs is to estimate the probability density function (PDF) of each class using Gaussian mixtures and then classify inputs based on the maximum likelihood criterion. This is a streamlined way of realizing the maybe back then hard optimization problem of fitting a GMM.

Mathematical Foundations

1. Gaussian Mixture Models (GMMs)

A Gaussian Mixture Model is a probabilistic model that assumes all data points are generated from a mixture of several Gaussian distributions with unknown parameters. The probability density function for a GMM is given by:

p(x)=k=1KπkN(xμk,Σk)p(x) = \sum_{k=1}^{K} \pi_k \mathcal{N}(x | \mu_k, \Sigma_k)

where:

  • πk\pi_k are the mixing coefficients.
  • μk\mu_k are the means of the Gaussian components.
  • Σk\Sigma_k are the covariance matrices of the Gaussian components.

2. Maximum Likelihood Estimation (MLE)

Maximum Likelihood Estimation aims to find the parameters that maximize the likelihood of the observed data. For a PNN, this involves maximizing the log-likelihood function:

L(θ)=i=1Nlogp(xiθ)\mathcal{L}(\theta) = \sum_{i=1}^{N} \log p(x_i | \theta)

where θ\theta represents the parameters of the model (mixing coefficients, means, and covariances). This is equivalent to minimizing the cross-entropy loss, as we showed in another tutorial.

3. Activation Functions

The PNN uses specific activation functions at each layer:

  • First Hidden Layer: Linear activation.
  • Second Hidden Layer: Absolute value activation.
  • Third Hidden Layer: Exponential activation.
  • Output Layer: likelihood per class

Note: In our implementation we will use Softmax activation and reduce it to two layers.

Excerpts from the Paper

Here are some key excerpts from the paper that highlight the core concepts and methodologies:

"The PNN models the PDF of each class as a mixture of Gaussian components. The parameters of these mixtures are estimated using the EM algorithm, which iteratively maximizes the likelihood of the observed data."
"The activation functions used in the network are designed to reflect the probabilistic nature of the model. The absolute value and exponential functions are particularly important in transforming the inputs into meaningful probabilistic quantities."

Network Architecture

Below is Fig.1. in the paper “Four layer feed-forward probabilistic neural network.”:

image

Implementation in Python

Let's implement the PNN with Maximum Likelihood Training using TensorFlow. We will follow the architecture and activation functions described in the paper.

Step-by-Step Implementation

  1. Generate Synthetic Data:
import numpy as np
import tensorflow as tf
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, Dense, Concatenate
from sklearn.mixture import GaussianMixture
from sklearn.metrics import accuracy_score
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler

# Generate synthetic data
X, y = make_classification(n_samples=1500, n_features=2, n_informative=2, n_redundant=0,
                           n_clusters_per_class=1, n_classes=3, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Normalize the input features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
  1. Fit Gaussian Mixture Models (GMMs) for Each Class:
# Fit Gaussian Mixture Model (GMM) with multiple components for each class
num_components = 3  # Number of components per class
class_models = []
class_priors = []
for i in np.unique(y_train):
    gm = GaussianMixture(n_components=num_components, covariance_type='full', random_state=42)
    gm.fit(X_train[y_train == i])
    class_models.append(gm)
    class_priors.append(len(y_train[y_train == i]) / len(y_train))
  1. Define Functions to Compute Component Likelihoods and Degrees of Freedom:
# Function to compute the component likelihoods
def compute_component_likelihood(X, class_models):
    component_likelihoods = []
    for gm in class_models:
        likelihoods = gm.score_samples(X).reshape(-1, 1)  # Reshape to ensure 2D array
        component_likelihoods.append(likelihoods)
    return np.hstack(component_likelihoods)

# Function to compute the degree of freedom (posterior probabilities)
def compute_posterior_probabilities(X, class_models):
    degrees_of_freedom = []
    for gm in class_models:
        dof = gm.predict_proba(X)
        degrees_of_freedom.append(dof)
    return np.hstack(degrees_of_freedom)
  1. Precompute GMM Features:
# Precompute GMM features for the training and test sets
X_train_dof = compute_posterior_probabilities(X_train, class_models)
X_train_component_likelihood = compute_component_likelihood(X_train, class_models)
X_train_combined = np.hstack([X_train, X_train_dof, X_train_component_likelihood])

X_test_dof = compute_posterior_probabilities(X_test, class_models)
X_test_component_likelihood = compute_component_likelihood(X_test, class_models)
X_test_combined = np.hstack([X_test, X_test_dof, X_test_component_likelihood])

# Check GMM outputs
print("X_train_combined shape:", X_train_combined.shape)
print("X_test_combined shape:", X_test_combined.shape)
print("First 5 rows of X_train_combined:\\n", X_train_combined[:5])
X_train_combined shape: (1050, 14)
X_test_combined shape: (450, 14)
First 5 rows of X_train_combined:\n [[-7.60452631e-001 -1.15061306e+000  1.90311414e-001  2.37483167e-002
   7.85940270e-001  0.00000000e+000  2.96003471e-111  1.00000000e+000
   1.79892764e-005  9.42189724e-001  5.77922871e-002 -4.30885459e+000
  -2.24104585e+003 -1.01075165e+000]
 [ 7.35941826e-001  7.11872750e-001  4.84903651e-001  5.15036685e-001
   5.96645869e-005  0.00000000e+000  9.61851748e-001  3.81482517e-002
   2.41642232e-005  5.13220634e-007  9.99975323e-001 -5.15792382e+000
   1.76606779e+000 -1.00673250e+001]
 [-7.27528065e-001 -1.01510868e+000  3.59565805e-001  3.62102418e-002
   6.04223953e-001  0.00000000e+000  5.35062844e-103  1.00000000e+000
   5.00363562e-005  8.84504254e-001  1.15445709e-001 -4.85051573e+000
  -2.06967102e+003 -9.88759376e-001]
 [-2.16378351e+000 -6.34398930e-001  4.54275293e-001  1.49641324e-002
   5.30760575e-001  0.00000000e+000  2.63107693e-255  1.00000000e+000
   4.65102007e-001  1.20946020e-002  5.22803391e-001 -3.15922326e+001
  -5.40208516e+003 -2.51116180e+000]
 [ 5.15535214e-002 -1.42524813e+000  3.90500446e-001  2.69169505e-002
   5.82582604e-001  1.14402002e-216  3.80774189e-053  1.00000000e+000
   7.18606657e-010  9.99832856e-001  1.67142797e-004 -1.54477204e+000
  -1.02903221e+003 -2.94360646e+000]]
Output
  1. Custom Absolute Value Activation Function:
# Custom absolute value activation function
def absolute_value_activation(x):
    return tf.abs(x)
  1. Define the PNN Model:
# Define a simpler PNN model
input_dim = X_train_combined.shape[1]
output_dim = len(np.unique(y_train))

input_layer = Input(shape=(input_dim,))
hidden_1 = Dense(32, activation='linear')(input_layer)
hidden_2 = Dense(16, activation=absolute_value_activation)(hidden_1)
output_layer = Dense(output_dim, activation='softmax')(hidden_2)
  1. Use Categorical Crossentropy Loss Function:
# Use CategoricalCrossentropy loss function
loss_fn = tf.keras.losses.CategoricalCrossentropy()
  1. Convert Labels to One-Hot Encoding:
# Convert y_train and y_test to one-hot encoding
y_train_one_hot = tf.keras.utils.to_categorical(y_train, num_classes=output_dim)
y_test_one_hot = tf.keras.utils.to_categorical(y_test, num_classes=output_dim)
  1. Create and Compile the Model:
# Create model
model = Model(inputs=input_layer, outputs=output_layer)
model.compile(optimizer='adam', loss=loss_fn, metrics=['accuracy'])
  1. Train the Model:
# Train model with debugging outputs
history = model.fit(X_train_combined, y_train_one_hot, epochs=500, batch_size=32, verbose=1)
Epoch 494/500
33/33 [==============================] - 0s 2ms/step - loss: 0.3369 - accuracy: 0.9314
Epoch 495/500
33/33 [==============================] - 0s 2ms/step - loss: 0.2185 - accuracy: 0.9314
Epoch 496/500
33/33 [==============================] - 0s 3ms/step - loss: 0.2392 - accuracy: 0.9333
Epoch 497/500
33/33 [==============================] - 0s 2ms/step - loss: 0.2522 - accuracy: 0.9343
Epoch 498/500
33/33 [==============================] - 0s 2ms/step - loss: 0.3630 - accuracy: 0.9238
Epoch 499/500
33/33 [==============================] - 0s 3ms/step - loss: 0.2730 - accuracy: 0.9371
Epoch 500/500
33/33 [==============================] - 0s 4ms/step - loss: 0.2106 - accuracy: 0.9371
  1. Evaluate the Model:
# Evaluate model
y_pred = model.predict(X_test_combined)
y_pred_classes = np.argmax(y_pred, axis=1)
accuracy = np.mean(y_pred_classes == y_test)
print(f"Classification accuracy on the test data: {accuracy:.3f}")
Classification accuracy on the test data: 0.936
  1. Plot Decision Boundaries:
# Plot decision boundaries
xx, yy = np.meshgrid(np.linspace(X_train[:, 0].min()-1, X_train[:, 0].max()+1, 500),
                     np.linspace(X_train[:, 1].min()-1, X_train[:, 1].max()+1, 500))
grid = np.c_[xx.ravel(), yy.ravel()]
grid_dof = compute_posterior_probabilities(grid, class_models)
grid_component_likelihood = compute_component_likelihood(grid, class_models)
grid_combined = np.hstack([grid, grid_dof, grid_component_likelihood])
probs = model.predict(grid_combined)
Z = np.argmax(probs, axis=1).reshape(xx.shape)

plt.contourf(xx, yy, Z, alpha=0.3)
plt.scatter(X_train[:, 0], X_train[:, 1], c=y_train, edgecolors='k', marker='o')
plt.title('GF Training on Synthetic Data with PNN')
plt.xlabel('X1')
plt.ylabel('X2')
plt.legend()
plt.show()
image

Explanation of the Code

  1. Data Generation:
    • Synthetic data is generated with two features and three classes using make_classification.
    • The data is split into training and test sets.
  2. GMM Fitting:
    • Gaussian Mixture Models (GMMs) are fitted for each class to estimate the probability density functions.
  • The fitted GMMs and class priors are stored for later use.
  1. Custom Functions:
    • Functions compute_component_likelihood and compute_degree_of_freedom are defined to compute necessary probabilistic quantities for the PNN.
  2. Custom Activation:
    • A custom absolute value activation function is implemented using TensorFlow.
  3. Model Definition:
    • The PNN model is defined with input, hidden, and output layers. The specific activation functions (linear, absolute value, exponential, and softmax) are used as per the paper's methodology.
  4. Custom Loss Function:
    • A custom log-likelihood loss function is defined to maximize the likelihood of the observed data.
  5. Model Training:
    • The model is trained using the training data for 500 epochs.
    • The training process includes updates to the model parameters to minimize the custom loss function.
  6. Model Evaluation:
    • The trained model is evaluated on the test data.
    • The classification accuracy is printed to the console.
  7. Plotting Decision Boundaries:
    • The decision boundaries of the trained model are visualized using a contour plot.
    • The decision regions are shown along with the training data points.

Now since this dataset was drawn out of a Gaussian distribution, it is natural that GMM by itself will achieve similar performance, and indeed that’s what we get when we train an ordinary GMM classifier. In fact, this Probability Neural Network as proposed above is merely a serial realization of GMM.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.mixture import GaussianMixture
from sklearn.metrics import accuracy_score
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

# Generate synthetic data
X, y = make_classification(n_samples=1500, n_features=2, n_informative=2, n_redundant=0, 
                           n_clusters_per_class=1, n_classes=3, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=4)

# Fit Gaussian Mixture Model (GMM) for each class
models = []
for i in np.unique(y_train):
    gm = GaussianMixture(n_components=1, covariance_type='full', random_state=42)
    gm.fit(X_train[y_train == i])
    models.append(gm)

# Predict class probabilities for each sample in X_test
def predict_proba(X, models):
    probs = np.zeros((X.shape[0], len(models)))
    for i, model in enumerate(models):
        probs[:, i] = model.score_samples(X)
    return probs

# Predict class labels
probs = predict_proba(X_test, models)
y_pred = np.argmax(probs, axis=1)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Classification accuracy on the test data: {accuracy:.3f}")

# Plot decision boundaries
xx, yy = np.meshgrid(np.linspace(X_train[:, 0].min()-1, X_train[:, 0].max()+1, 500),
                     np.linspace(X_train[:, 1].min()-1, X_train[:, 1].max()+1, 500))
grid = np.c_[xx.ravel(), yy.ravel()]
probs = predict_proba(grid, models)
Z = np.argmax(probs, axis=1).reshape(xx.shape)

plt.contourf(xx, yy, Z, alpha=0.3)
plt.scatter(X_train[:, 0], X_train[:, 1], c=y_train, edgecolors='k', marker='o')
plt.title('GF Training on Synthetic Data')
plt.xlabel('X1')
plt.ylabel('X2')
plt.legend()
plt.show()
Classification accuracy on the test data: 0.947
Classification accuracy on the test data: 0.947

Let’s Juice it Up

Let’s draw out of a distribution in which anomalies are more likely, like the Laplace Distribution

Below is using GMM classifier for this case:

import numpy as np
import tensorflow as tf
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, Dense, Concatenate
from sklearn.mixture import GaussianMixture
from sklearn.metrics import accuracy_score
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler

# Generate synthetic data with multivariate Laplace distribution
def multivariate_laplace(mean, cov, size):
    """
    Generate samples from a multivariate Laplace distribution.
    """
    mean = np.asarray(mean)
    cov = np.asarray(cov)
    d = len(mean)
    
    # Step 1: Generate standard normal samples
    z = np.random.normal(size=(size, d))
    
    # Step 2: Generate exponential samples
    u = np.random.exponential(scale=1, size=size)
    
    # Step 3: Cholesky decomposition of the covariance matrix
    L = np.linalg.cholesky(cov)
    
    # Step 4: Scale and rotate the samples
    samples = mean + np.dot(u[:, None] * z, L.T)
    
    return samples

# Parameters for the three classes
means = [np.array([5, 10]), np.array([-10, -2]), np.array([2, -3])]
covs = [
    [[0.1, 0.2], [0.2, 0.7]],  # Covariance matrix for class 0
    [[1, -0.3], [-0.3, 0.1]],  # Covariance matrix for class 1
    [[0.6, 0.4], [0.4, 3]]   # Covariance matrix for class 2
]

# Generate samples
n_samples_per_class = 500
X = []
y = []

for class_label, (mean, cov) in enumerate(zip(means, covs)):
    samples = multivariate_laplace(mean, cov, n_samples_per_class)
    X.append(samples)
    y.append(np.full(n_samples_per_class, class_label))

X = np.vstack(X)
y = np.concatenate(y)

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)


# Fit Gaussian Mixture Model (GMM) for each class
models = []
for i in np.unique(y_train):
    gm = GaussianMixture(n_components=1, covariance_type='full', random_state=42)
    gm.fit(X_train[y_train == i])
    models.append(gm)

# Predict class probabilities for each sample in X_test
def predict_proba(X, models):
    probs = np.zeros((X.shape[0], len(models)))
    for i, model in enumerate(models):
        probs[:, i] = model.score_samples(X)
    return probs

# Predict class labels
probs = predict_proba(X_test, models)
y_pred = np.argmax(probs, axis=1)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Classification accuracy on the test data: {accuracy:.3f}")

# Plot decision boundaries
xx, yy = np.meshgrid(np.linspace(X_train[:, 0].min()-1, X_train[:, 0].max()+1, 500),
                     np.linspace(X_train[:, 1].min()-1, X_train[:, 1].max()+1, 500))
grid = np.c_[xx.ravel(), yy.ravel()]
probs = predict_proba(grid, models)
Z = np.argmax(probs, axis=1).reshape(xx.shape)

plt.contourf(xx, yy, Z, alpha=0.3)
plt.scatter(X_train[:, 0], X_train[:, 1], c=y_train, edgecolors='k', marker='o')
plt.title('GF Training on Synthetic Data')
plt.xlabel('X1')
plt.ylabel('X2')
plt.legend()
plt.show()
Classification accuracy on the test data: 0.993
image

Now let’s run the network, this time with different activation functions and see a slight improvement:

# Define a PNN model
input_dim = X_train_combined.shape[1]
output_dim = len(np.unique(y_train))

input_layer = Input(shape=(input_dim,))
hidden_1 = Dense(32, activation='tanh')(input_layer)
hidden_2 = Dense(16, activation='relu')(hidden_1)
output_layer = Dense(output_dim, activation='softmax')(hidden_2)

The results are slightly better

Classification accuracy on the test data: 0.996
Laplace Distributed Data Classification Boundaries
Laplace Distributed Data Classification Boundaries

Conclusion

This tutorial provides a comprehensive guide to understanding and implementing Probabilistic Neural Networks (PNNs) with Maximum Likelihood Training, based on the paper by Roy L. Streit and Tod E. Luginbuhl. By following the steps outlined above, you will gain a deep understanding of the theoretical foundations and practical implementation of PNNs.