The Use of Multiple Measurements in Taxonomic Problems, By Ronald Fisher, 1936. — A short tutorial
The Use of Multiple Measurements in Taxonomic Problems, By Ronald Fisher, 1936. — A short tutorial

The Use of Multiple Measurements in Taxonomic Problems, By Ronald Fisher, 1936. — A short tutorial

Introduction:

In the introduction, Fisher highlights the importance of using multiple measurements to discriminate between populations or species. He mentions previous applications of this idea in craniometry and analyzing secular trends. The main goal of the paper is to illustrate how linear functions of multiple measurements can be used to maximize the distinction between groups, focusing on a taxonomic problem involving iris flower measurements.

To follow along with the examples in Python, we'll first import the necessary libraries:

import numpy as np
import pandas as pd
from scipy import linalg
from scipy.stats import f, norm
from sklearn.datasets import load_iris
import matplotlib.pyplot as plt

# Load the iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Construct the variables for setosa and versicolor
setosa = X[y == 0]  # Data for Iris setosa
versicolor = X[y == 1]  # Data for Iris versicolor

# Optionally, print the shapes of the arrays to verify
print("Setosa data shape:", setosa.shape)
print("Versicolor data shape:", versicolor.shape)
Setosa data shape: (50, 4)
Versicolor data shape: (50, 4)

Arithmetic Procedure: Fisher presents three tables of data:

  • Table I: Measurements of 4 flower attributes for 50 plants each of 2 iris species (setosa, versicolor)
  • Table II: Observed means and differences between species for each attribute
  • Table III: Sums of squares and cross-products of deviations from species means (with 98 degrees of freedom). This is what we call today the cross correlation matrix.

Table I is already loaded from the dataset, let's create the rest of t tables in Python:

# Table II
setosa_means = np.mean(setosa_data, axis=0)
versicolor_means = np.mean(versicolor_data, axis=0)
d = versicolor_means - setosa_means

# Table III
# Assume S is the within-class scatter matrix and d is the difference between the means
# Within-class scatter matrix S
S_setosa = np.cov(setosa, rowvar=False)
S_versicolor = np.cov(versicolor, rowvar=False)
S = S_setosa + S_versicolor

The main objective is to find a linear function X = λ1x1 + λ2x2 + λ3x3 + λ4x4 that maximizes the ratio of the difference between species means to the within-species standard deviation. This leads to solving a system of linear equations for the λ coefficients:

lam = linalg.solve(S, d)
print("Discriminant function coefficients:")
print(lam)

Output:

Discriminant function coefficients:
[-1.52638511 -9.01147968 10.88309735 15.42208247]

So, the discriminating linear function is (after normalization):

X=x1+5.9x27.12x310.1x4.X = x_1 + 5.9 x_2 -7.12 x_3 - 10.1 x_4.

Let’s compare it with the result Fisher presented in the paper:

image

Question to reader: why is the unnormalized hyperplane coefficients are different than in the paper and the normalized ones are the same?

Plot the histograms

## Plot histograms
# Project all data points onto the LDA hyperplane
virginica = X[y == 2] 
virginica_proj = -np.dot(virginica, lam)  # Projection for virginica

# Plot histograms of the LDA values for all three species
plt.figure(figsize=(10, 6))
plt.hist(setosa_proj, bins=10, alpha=0.7, label='Setosa', color='red')
plt.hist(versicolor_proj, bins=10, alpha=0.7, label='Versicolor', color='green')
plt.hist(virginica_proj, bins=10, alpha=0.7, label='Virginica', color='blue')
plt.xlabel('LDA Projection')
plt.ylabel('Frequency')
plt.title('Histogram of LDA Projections for Setosa, Versicolor, and Virginica')
plt.legend(loc='best')
plt.grid(True)
plt.show()
image

Now let’s compare it to the hand-computed histogram of Fisher:

image

Conclusion:

Fisher's paper introduces a groundbreaking approach to discriminant analysis, demonstrating how linear combinations of multiple variables can be used to maximize the separation between groups. The ideas presented laid the foundation for many modern techniques in multivariate statistics and machine learning.

The paper clearly explains the core concepts and illustrates them using an interesting taxonomic problem with iris species. Fisher also discusses extensions to testing between-group differences and exploring phylogenetic hypotheses.

Throughout this tutorial, we have provided detailed explanations of the mathematical concepts and corresponding Python code to help readers gain a deeper understanding of Fisher's work. By following along with the code examples, readers can reproduce the analyses and gain hands-on experience with these important statistical techniques.