Machine Learning - Principal Component Analysis (PCA) Part I - ZhangZhihuiAAA

Machine Learning - Principal Component Analysis (PCA) Part I

SVD

understand the variables vectors rather than scalers

Step-by-step process to calculate eigenvalues in PCA

✅ Step 1: Standardize the data

Before applying PCA, standardize your data to have mean = 0 and standard deviation = 1 for each feature (unless already standardized).

✅ Step 2: Compute the covariance matrix

Let the standardized data matrix be

$C will be a$

✅ Step 3: Compute eigenvalues and eigenvectors

You now solve the eigenvalue problem for the covariance matrix:

The eigenvalues are the solutions to the characteristic equation:

Use numerical methods (e.g. in Python with NumPy) to compute these.

💡 Interpretation

Each eigenvalue indicates the variance explained by the corresponding principal component.
The sum of all eigenvalues equals the total variance in the dataset.
Larger eigenvalues → more important principal components.

import numpy as np

# Example data (rows = samples, columns = features)
X = np.array([[2.5, 2.4],
              [0.5, 0.7],
              [2.2, 2.9],
              [1.9, 2.2],
              [3.1, 3.0]])

# Step 1: Standardize the data
X_centered = X - np.mean(X, axis=0)

# Step 2: Covariance matrix
cov_matrix = np.cov(X_centered, rowvar=False)

# Step 3: Eigenvalues and eigenvectors
eigenvalues, eigenvectors = np.linalg.eig(cov_matrix)

print("Eigenvalues:", eigenvalues)

The reason we divide by instead of

🔍 Here's the key difference:

Divide by → for population covariance (when you have all possible data points).
Divide by → for sample covariance (when you're working with a sample drawn from a larger population).

📌 Why divide by

This adjustment is known as Bessel’s correction, and it's used to make the sample variance (and hence the sample covariance) an unbiased estimator of the true population variance/covariance.