Principal Component Analysis (PCA) is often described by intro machine learning courses as follows:

• PCA is dimensionality reduction.
• PCA is decorrelation of features.
• PCA involves Singular Value Decomposition / matrix factorization: $M = U \Sigma V^T$
• PCA is projecting the image onto a subspace spanned by the top N eigenvectors of the image’s covariance matrix.

That’s alot of words. I think we can build intuition for all of these ideas from the ground up, with a few animations to help. This post assumes a high-school understanding of statistics and linear algebra - nothing more.

 These animations were created using a modified fork of manim - a Python library written by the guy behind 3blue1brown.

# The covariance matrix of an image

What is the covariance matrix of an image? Let’s start with its definition.

Say we have $X$, a single image, flattened from a square $D \times D$. As an example, if we started with a $8 \times 8$ square image, we would flatten to a single row of $1 \times 64$ pixels. We’ll’ take the second row and just join it up to the end of the first row, etc.

To keep things simple, assume we have no color channels (like RGB). This is a grayscale image, with a single value from 0 to 255 for each pixel. We can then normalize this scale to fall between 0 and 1.1

An image X could then be represented with something like this:

## Ok, cool. So what’s the covariance matrix?

Let’s ask Wikipedia. For a single pair of pixels $X_i$ and $X_j$, it is:

In general we can write this as a dot product:

## But wait…why do we even have an expectation? Or a mu?

Aren’t we just talking about a single image right now?

Indeed, if we only had a single image, $X_i$ would be exactly equal to $E[X_i]$. And $X$ would be exactly equal to $\mu_X$. It wouldn’t be meaningful to calculate covariance for a single image, just like it wouldn’t be meaningful to compute an average when you only have one sample. So let’s keep going.

# What if we had more $\mathbf{X}$’s?’

Let’s say we had many different observations of $X$’s. $N$ observations, to be precise. A single observation of $X$ is $1 \times 64$, but let’s drop the 1 to keep things clean.

So now we have a collection of $X$’s which is $N \times 64$. We’ll call this collection boldface $\mathbf{X}$.

$\mathbf{X}$ is getting complex, so let’s agree on some notation. $\mathbf{X}$ can be described by:

• An index $i$ which runs from 1 to $D^2$, describing each of the $D^2$ pixels.
• Continuing with our example above, $D^2 = 64$.
• We usually put $i$ as a subscript: $X_i$.
• An index $n$ running from 1 to $N$, describing each “observation”, or different image.
• We usually put $n$ as a bracketed superscript: $X^{(n)}$

Ok! Now, $X_3^{(5)}$ means “the value of the 3rd pixel in the 5th image”.2

# Calculating $\mu_X$

Now we can meaningfully calculate $E[X_i]$, the expected value for pixel i:

What a horrible way to write a simple idea! All we are saying here is that if we take the $i$th pixel from all N images, add up their values, and divide by N, we then get the average value for pixel i. Easy.

If we do this for every pixel, we’ll get 64 different averages. We can pack all of these into a single vector called $\mu_X$:

# Calculating $cov(\mathbf{X},\mathbf{X})$

Now, $cov(X_i, X_j) = E[(X_i - E[X_i])(X_j - E[X_j])]$ is meaningful too.

For any single picture, note how much its pixel i differs from the average pixel i. Also note how much its pixel j differs from the average pixel j. Multiply these two values together.

Note the following:

• This value is large and positive if whenever pixel i is higher than average, pixel j tends to be higher than average as well.
• This value is large and negative if whenever pixel i is higher than average, pixel j tends to be lower than average.
• This value is close to zero when there’s a pretty even split between these situations. In other words, seeing a high value for pixel i doesn’t give you much useful information about pixel j - it’s higher as often as it is lower.

# If we think of the covariance matrix as a transformation, what does it do?

We established above that a given $cov(X_i, X_j)$ gives us a measure of how “similar” two pixels tend to be. If pixel i is dark, does pixel j tend to be dark as well? Or does it tend to be light? Or does it give us no information at all?

The first row of the covariance matrix, then, is all of the $cov(X_i, X_j)$’s for $i=1$:

We can then imagine that the first row of the covariance matrix, dotted with an input image, gives us a sum that represents how much we should expect the first pixel of the output to be “on”, considering all the other pixels in the image. Same for the second row and the second pixel. And the third row and third pixel. Etc etc.

When we use the covariance matrix to transform an image, we do the following: “Given an image, look at every single pixel and use that information to darken or lighten other pixels. Then show the collective result.”

The covariance matrix for a set of images encodes patterns within those images. I almost think of them as children who have only learned to draw specific things. Imagine Child 1 has only ever seen smiley faces, and Child 2 has only ever seen cartoon cats. If you present them both with a picture of a circle and ask them to complete what is missing, Child 1 will add eyes and a mouth, while Child 2 would add cat ears and whiskers.

# So what are the eigenvectors of the covariance matrix?

Here’s some basic intuition about eigenvectors that we need:

• A matrix is a transformation.
• An eigenvector is able to pass through that transformation without changing direction - only magnitude.
• The corresponding eigenvalue is that change in magnitude.

(If you don’t have this intuition, stop right now and go watch this amazing video by 3blue1brown.)

So go back to the covariance matrix we found in the previous step. What are its eigenvectors?

They are images that are able to pass through the matrix without changing direction - only magnitude.

## “…what do you mean by change direction?”

To be fair, the human brain is bad at visualizing things in more than 3 dimensions. So let’s use images with only 3 pixels.

Here are the values [0, 0.2, 0.1], [0, 0.4, 0.2], and [0, 1.0, 0.5]. These all point in the same direction, as you can see on the 3d plot - they only change in magnitude:

[picture here]

Here are the values [0, 1, 1], and [1, 1, 0]. These do not point in the same direction, even though they have the same magnitude:

So in the context of a picture, “not changing direction” means the same pixels stay on. “Change in magnitude” means the pixels can get brighter or darker, but they maintain their proportions to each other.

# So that’s why we use eigenvectors…

Now it should be a little more clear: the eigenvectors of the covariance matrix are images which are able to pass through that matrix and only get lighter or darker - they do not change.

Specifically, an eigenvector with a large and positive eigenvalue will get longer - or in the context of an image, its pixels only become more intense.

We can imagine that if we’ve calculated our covariance matrix on a specific set of images - i.e. only a certain digit - the eigenvectors will encode structures that are very representative of that digit.

In fact, we should be able to add a bunch of these eigenvectors up, in certain quantities, and get something that looks very much like our original image!

# Decorrelation and dimensionality reduction

The covariance matrix was created by matrix-multiplying two identical things by each other, so it is real and symmetric:

i.e. the first row is equal to the first column, second row to the second column, etc.

Eigenvectors of real symmetric matrices are orthogonal to each other. Again, it’s hard to imagine a 90 degree angle in a space with more than 3 dimensions, but all it means is that there is no redundant information in the eigenvectors.

If we’re trying to construct our image by adding up a bunch of eigenvectors, each additional one that we throw in adds information that was not present in the others.

This is what it means for PCA to decorrelate features: by using eigenvectors as our new basis for representing the image, we automatically ensure that the basis vectors will be orthogonal to each other.

And what about dimensionality reduction? A $K \times K$ covariance matrix will have K eigenvectors, but you certainly don’t need to use them all to get a decent reconstruction. Here’s a digit reconstructed with 1, 2, 3, and 4, eigenvectors:

[image]

(svd)

# How much of each eigenvector use?

(projection onto subspace)

1. Usually, $0$ represents black while $255$ represents white. For simplicity we flip this convention in the post to keep things simple: $1$ is a completely black pixel.

2. Note - this boldface $\mathbf{X}$ notation to represent a batch of datapoints $X$ is common but certainly not standard. ML literature notation tends to be all over the place. Always read carefully.