Introduction
In this post, we will explore the application of the Generalized Likelihood Ratio Test (GLRT) to separate the Setosa species from the other two species (Versicolor and Virginica) in the Iris dataset. GLRT is a powerful statistical hypothesis testing method that compares the likelihood of two competing hypotheses based on the observed data.
Kernel Density Estimation (KDE)
Before diving into the GLRT, let's estimate the likelihood functions of the training data using Kernel Density Estimation (KDE). KDE is a non-parametric method for estimating the probability density function of a random variable based on a finite data sample.
import numpy as np
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KernelDensity
# Load the Iris dataset
iris = datasets.load_iris()
X = iris.data
y = iris.target
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Separate the training data into setosa and non-setosa classes
X_train_setosa = X_train[y_train == 0]
X_train_non_setosa = X_train[y_train != 0]
# Estimate the likelihood functions using KDE
kde_setosa = KernelDensity(kernel='gaussian', bandwidth=0.5).fit(X_train_setosa)
kde_non_setosa = KernelDensity(kernel='gaussian', bandwidth=0.5).fit(X_train_non_setosa)
In this code, we load the Iris dataset, split it into training and testing sets, and separate the training data into setosa and non-setosa classes. We then use KDE with a Gaussian kernel and a bandwidth of 0.5 to estimate the likelihood functions for each class.
Generalized Likelihood Ratio Test (GLRT)
Now, let's formulate the GLRT for separating the setosa species from the other two species.
Given a feature vector , we want to decide between two hypotheses (Classes):
- : belongs to the setosa class
- : belongs to the non-setosa class (versicolor or virginica)
The GLRT decides in favor of if the likelihood ratio exceeds a threshold :
where and are the likelihood functions under the null hypothesis (setosa) and the alternative hypothesis (non-setosa), respectively. Alternatively we can the equation above to compare the difference of the log likelihoods to a new threshold .
# Compute the log-likelihood ratios for the test data
log_likelihood_ratios = kde_non_setosa.score_samples(X_test) - kde_setosa.score_samples(X_test)
# Set a threshold for classification
threshold = 0.0
# Classify the test data based on the log-likelihood ratios
y_pred = (log_likelihood_ratios > threshold).astype(int)
# Calculate the accuracy
accuracy = np.mean(y_pred == (y_test != 0))
print(f"Accuracy: {accuracy:.2f}")
Output:
Accuracy: 1.00
In this code, we compute the log-likelihood ratios for the test data by subtracting the log-likelihoods under the setosa hypothesis from the log-likelihoods under the non-setosa hypothesis. We then set a threshold of 0.0 and classify the test data based on whether the log-likelihood ratios exceed the threshold. If it exceeds the threshold we choose (or reject the null hypothesis ), if not, we choose the null hypothesis. We achieve perfect separation between the classes.
ROC Curve and Threshold Selection
To confirm our result, we can generate an ROC curve and calculate the area under the curve (AUC). We expect it to be unity.
from sklearn.metrics import roc_curve, roc_auc_score
import matplotlib.pyplot as plt
# Compute the ROC curve
fpr, tpr, thresholds = roc_curve(y_test != 0, log_likelihood_ratios)
roc_auc = roc_auc_score(y_test != 0, log_likelihood_ratios)
# Plot the ROC curve
plt.figure()
plt.plot(fpr, tpr, label=f'ROC curve (AUC = {roc_auc:.2f})')
plt.plot([0, 1], [0, 1], 'k--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc="lower right")
plt.show()
Output:
In this code, we compute the ROC curve using the roc_curve
function from scikit-learn, which takes the true labels (y_test != 0
) and the log-likelihood ratios as input. We also compute the Area Under the ROC Curve (AUC) using roc_auc_score
to quantify the overall performance of the GLRT. We see that the AUC is 1 which implies perfect separation of the two classes or hypotheses.
Histogram of the GLRT Test under both hypotheses
After computing the log-likelihood ratios using KernelDensity.score_samples
, let's analyze the distribution of the GLRT test under both classes (setosa and non-setosa). The GLRT is a function of a random variable (r.v.) which is it itself is a (derived) random variable. Hence, we can plot the distribution of this r.v. under both hypotheses.
# Compute the log-likelihood ratios for the training data
log_likelihood_ratios_train_setosa = kde_non_setosa.score_samples(X_train_setosa) - kde_setosa.score_samples(X_train_setosa)
log_likelihood_ratios_train_non_setosa = kde_non_setosa.score_samples(X_train_non_setosa) - kde_setosa.score_samples(X_train_non_setosa)
# Plot the distribution of log-likelihood ratios for both classes
plt.figure()
plt.hist(log_likelihood_ratios_train_setosa, bins=20, alpha=0.5, density=True, label='Setosa')
plt.hist(log_likelihood_ratios_train_non_setosa, bins=20, alpha=0.5, density=True, label='Non-Setosa')
plt.xlabel('Log-Likelihood Ratio')
plt.ylabel('Density')
plt.title('Distribution of Log-Likelihood Ratios')
plt.legend()
plt.show()
Output:
In this code, we compute the log-likelihood ratios for the training data of both classes (setosa and non-setosa) using KernelDensity.score_samples
. We then plot the distributions of the log-likelihood ratios using histograms with a density scale.
By examining the plot, we can gain insights into how well the GLRT separates the two classes. Ideally, we would expect the distributions of the log-likelihood ratios to be well-separated for effective classification. If there is significant overlap between the distributions, it indicates that the GLRT may have difficulty distinguishing between the classes. We see that there is no overlap, and hence a perfect classification can be achieved (for threshold=0)
To further quantify the separation between the distributions, we can compute summary statistics such as the mean and standard deviation of the log-likelihood ratios for each class.
# Compute summary statistics of log-likelihood ratios for each class
mean_setosa = np.mean(log_likelihood_ratios_train_setosa)
std_setosa = np.std(log_likelihood_ratios_train_setosa)
mean_non_setosa = np.mean(log_likelihood_ratios_train_non_setosa)
std_non_setosa = np.std(log_likelihood_ratios_train_non_setosa)
print(f"Setosa - Mean: {mean_setosa:.2f}, Standard Deviation: {std_setosa:.2f}")
print(f"Non-Setosa - Mean: {mean_non_setosa:.2f}, Standard Deviation: {std_non_setosa:.2f}")
Output:
Setosa - Mean: -11.59, Standard Deviation: 1.62
Non-Setosa - Mean: 28.46, Standard Deviation: 14.73
By comparing the means and standard deviations of the log-likelihood ratios for each class, we can assess the separation between the distributions. A larger difference in means and smaller standard deviations indicate better separation and potentially higher classification accuracy. We can see that even though for Non-Setosa class the standard deviation is large, the means are so far apart that we can perfectly separate them.
Generalized Likelihood Ratio Test for Non-Setosa Iris Species Classification
How We will now explore the application of the Generalized Likelihood Ratio Test (GLRT) for separating the two non-setosa classes (versicolor and virginica) in the Iris dataset. We will estimate the likelihood functions using Kernel Density Estimation (KDE), analyze the distribution of the GLRT test under both classes, and generate an interpolated ROC curve to evaluate the performance of the GLRT.
Kernel Density Estimation (KDE)
To estimate the likelihood functions of the versicolor and virginica classes, we will use Kernel Density Estimation (KDE) with a Gaussian kernel and a bandwidth of 0.5.
import numpy as np
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KernelDensity
# Load the Iris dataset
iris = datasets.load_iris()
X = iris.data
y = iris.target
# Select only the non-setosa classes (versicolor and virginica)
X_non_setosa = X[y != 0]
y_non_setosa = y[y != 0] - 1
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_non_setosa, y_non_setosa, test_size=0.3, random_state=42)
# Separate the training data into versicolor and virginica classes
X_train_versicolor = X_train[y_train == 0]
X_train_virginica = X_train[y_train == 1]
# Estimate the likelihood functions using KDE
kde_versicolor = KernelDensity(kernel='gaussian', bandwidth=0.5).fit(X_train_versicolor)
kde_virginica = KernelDensity(kernel='gaussian', bandwidth=0.5).fit(X_train_virginica)
GLRT and its distribution
To analyze the distribution of the GLRT test under both classes, we will compute the log-likelihood ratios for the training data and plot the smoothed kernel density estimates of the distributions.
import matplotlib.pyplot as plt
from scipy.stats import gaussian_kde
# Compute the log-likelihood ratios for the training data
log_likelihood_ratios_train_versicolor = kde_virginica.score_samples(X_train_versicolor) - kde_versicolor.score_samples(X_train_versicolor)
log_likelihood_ratios_train_virginica = kde_virginica.score_samples(X_train_virginica) - kde_versicolor.score_samples(X_train_virginica)
# Estimate the kernel density for each class
kde_versicolor_dist = gaussian_kde(log_likelihood_ratios_train_versicolor)
kde_virginica_dist = gaussian_kde(log_likelihood_ratios_train_virginica)
# Generate points for plotting the smooth distributions
x_plot = np.linspace(min(log_likelihood_ratios_train_versicolor.min(), log_likelihood_ratios_train_virginica.min()),
max(log_likelihood_ratios_train_versicolor.max(), log_likelihood_ratios_train_virginica.max()), 100)
# Plot the smoothed distributions of log-likelihood ratios for both classes
plt.figure()
plt.plot(x_plot, kde_versicolor_dist(x_plot), label='Versicolor')
plt.plot(x_plot, kde_virginica_dist(x_plot), label='Virginica')
plt.xlabel('Log-Likelihood Ratio')
plt.ylabel('Density')
plt.title('Smoothed Distributions of Log-Likelihood Ratios')
plt.legend()
plt.show()
Output:
By examining the smoothed distributions of the log-likelihood ratios, we can observe the overlap between the versicolor and virginica classes. The overlap indicates that the GLRT may have difficulty perfectly separating these two classes.
ROC Curve and Threshold Selection
To evaluate the performance of the GLRT in separating the versicolor and virginica classes, we will generate an interpolated ROC curve.
from sklearn.metrics import roc_curve, roc_auc_score
import numpy as np
# Compute the log-likelihood ratios for the test data
log_likelihood_ratios_test = kde_virginica.score_samples(X_test) - kde_versicolor.score_samples(X_test)
# Compute the ROC curve
fpr, tpr, thresholds = roc_curve(y_test, log_likelihood_ratios_test)
roc_auc = roc_auc_score(y_test, log_likelihood_ratios_test)
# Interpolate the ROC curve
interp_fpr = np.linspace(0, 1, 100)
interp_tpr = np.interp(interp_fpr, fpr, tpr)
interp_tpr[0] = 0.0
# Plot the interpolated ROC curve
plt.figure()
plt.plot(interp_fpr, interp_tpr, label=f'ROC curve (AUC = {roc_auc:.2f})')
plt.plot([0, 1], [0, 1], 'k--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Interpolated Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc="lower right")
plt.show()
Output:
The interpolated ROC curve provides a smoother representation of the GLRT's performance in separating the versicolor and virginica classes. The Area Under the ROC Curve (AUC) quantifies the overall performance of the GLRT (AUC=0.97). We see that we should choose an operating point of about a FPR of which will give us a TPR close to 1. We will calculate the threshold using the Neyman-Pearson’s theorem.
Neyman-Pearson Theorem and Constant False Alarm
The Neyman-Pearson theorem is a fundamental result in statistical hypothesis testing that provides a criterion for selecting the optimal decision rule based on a fixed probability of false alarm (Type I error). The theorem states that the most powerful test for a given probability of false alarm is the likelihood ratio test.
In the context of binary hypothesis testing, the Neyman-Pearson theorem allows us to design a test that maximizes the probability of detection (power) while keeping the probability of false alarm (Type I error) at a desired level. This is known as the constant false alarm rate (CFAR) criterion.
The CFAR approach involves setting a threshold for the likelihood ratio test based on a specified false alarm probability. The threshold is chosen such that the integral of the probability density function (PDF) of the test statistic under the null hypothesis (in this case, the "versicolor" class) from the threshold to infinity equals the desired false alarm probability.
Mathematically, if we denote the PDF of the test statistic under the null hypothesis as and the desired false alarm probability as , the threshold is determined by solving the equation:
By setting the threshold in this way, we ensure that the probability of falsely rejecting the null hypothesis (classifying a sample as "virginica" when it actually belongs to the "versicolor" class) is controlled at the constant specified level .
Now, let's proceed with the code to find the threshold that achieves a false alarm probability of 0.07 using the Neyman-Pearson criteria.
from scipy.optimize import fsolve
# Set the desired false alarm probability
alpha = 0.07
# Define the function to solve for the threshold
def threshold_equation(x):
return kde_versicolor_dist.integrate_box_1d(x, np.inf) - alpha
# Find the threshold using fsolve
threshold = fsolve(threshold_equation, 0.0)[0]
print(f"Threshold for a false alarm probability of {alpha:.2f}: {threshold:.2f}")
# Plot the smoothed distributions and the threshold
plt.figure()
plt.plot(x_plot, kde_versicolor_dist(x_plot), label='Versicolor')
plt.plot(x_plot, kde_virginica_dist(x_plot), label='Virginica')
plt.axvline(x=threshold, color='r', linestyle='--', label=f'Threshold ({threshold:.2f})')
plt.fill_between(x_plot, kde_versicolor_dist(x_plot), where=(x_plot >= threshold), alpha=0.3, color='red')
plt.xlabel('Log-Likelihood Ratio')
plt.ylabel('Density')
plt.title(f'Smoothed Distributions and Threshold for False Alarm Probability of {alpha:.2f}')
plt.legend()
plt.show()
Output:
Threshold for a false alarm probability of 0.07: -0.46
As we can see in the code we followed the following steps:
- We set the desired false alarm probability
alpha
to 0.07. - We define a function
threshold_equation
that represents the equation we need to solve to find the threshold. It calculates the integral of the "versicolor" PDF from the threshold to infinity and subtracts the desired false alarm probability. - We use
fsolve
from SciPy'soptimize
module to solve the equation and find the threshold value. - We print the threshold value.
- We plot the smoothed distributions of the log-likelihood ratios for both classes and the threshold as a vertical line.
- We fill the area under the "versicolor" PDF from the threshold to infinity with a translucent red color to visualize the false alarm region.
By finding the threshold using the Neyman-Pearson criteria, we ensure that the probability of false alarm is controlled at the specified level (0.07 in this case) while maximizing the probability of detection for the "virginica" class. Now let’s continue and execute the classification using our found threshold, and analyze the results.
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
import seaborn as sns
# Classify the test data based on the log-likelihood ratios and the threshold
y_pred = (log_likelihood_ratios_test > threshold).astype(int)
# Calculate the accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")
# Create a confusion matrix
cm = confusion_matrix(y_test, y_pred)
# Plot the confusion matrix using seaborn heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, cmap='Blues', fmt='d', cbar=False,
xticklabels=['Versicolor', 'Virginica'],
yticklabels=['Versicolor', 'Virginica'])
plt.xlabel('Predicted')
plt.ylabel('True')
plt.title(f'Confusion Matrix (Accuracy: {accuracy:.2f})')
plt.show()
Output:
Accuracy: 0.93
Explanation:
- We classify the test data based on the log-likelihood ratios (
log_likelihood_ratios_test
) and the threshold obtained from the Neyman-Pearson criteria. If the log-likelihood ratio is greater than the threshold, the sample is classified as "virginica" (assigned a value of 1), otherwise it is classified as "versicolor" (assigned a value of 0). - We calculate the accuracy using the
accuracy_score
function from scikit-learn, which compares the predicted labels (y_pred
) with the true labels (y_test
). - We create a confusion matrix using the
confusion_matrix
function from scikit-learn, which compares the true labels (y_test
) with the predicted labels (y_pred
). - We plot the confusion matrix using seaborn's
heatmap
function. We set the following parameters: annot=True
to display the values in each cell of the confusion matrix.cmap='Blues'
to set the color scheme to shades of blue.fmt='d'
to format the values as integers.cbar=False
to remove the color bar.xticklabels
andyticklabels
to set the labels for the x-axis and y-axis, respectively.- We set the labels for the x-axis and y-axis using
plt.xlabel
andplt.ylabel
, respectively. - We set the title of the plot to "Confusion Matrix" and include the accuracy in the title using an f-string.
- Finally, we display the plot using
plt.show()
.
The confusion matrix provides a visual representation of the model's performance, showing the number of true positive, true negative, false positive, and false negative predictions. The accuracy is displayed in the title of the plot.
By using the threshold obtained from the Neyman-Pearson criteria, we ensure that the probability of false alarm is controlled at the specified level while maximizing the probability of detection for the "virginica" class. The confusion matrix and accuracy help us assess the performance of the classification based on this threshold.
Conclusion
In this post, we explored the application of the Generalized Likelihood Ratio Test (GLRT) for separating the setosa species from the other two species in the Iris dataset. We estimated the likelihood functions using Kernel Density Estimation (KDE) on the training data and formulated the GLRT based on the likelihood ratio.
We also generated an ROC curve to evaluate the performance of the GLRT and select an appropriate threshold. The ROC curve provides insights into the trade-off between TPR and FPR, allowing us to choose a threshold that balances these metrics according to our specific needs.
Additionally, we explored the application of the Generalized Likelihood Ratio Test (GLRT) for separating the versicolor and virginica classes in the Iris dataset. We estimated the likelihood functions using Kernel Density Estimation (KDE) and analyzed the distribution of the GLRT test under both classes using smoothed kernel density estimates.
The overlap between the distributions of the log-likelihood ratios suggests that the GLRT cannot succeed in perfectly separating the versicolor and virginica classes. We generated an interpolated ROC curve to evaluate the performance of the GLRT, set the threshold and proceeded to analyze the classifier results.
GLRT is a useful approach for classification tasks, even when the classes are not perfectly separable. By understanding the limitations and interpreting the results accordingly, we can make informed decisions and assess the performance of the GLRT in real-world scenarios.