Thoughts on Principal Components Analysis

This is a post with more questions than answers.

I’ve been thinking about Principal Component Analysis (PCA) lately. Suppose we have m measurements each on N units, represented in a matrix YRN×m, where mN. The idea in PCA is that the results of these m measurements are driven by k underlying factors (k<m). So each unit i can be described by a vector aiRk, and each measurement j is affected by the factors in a way described by a vector bjRk, so that the jth measurement on the ith unit is aiTbj. In this case, Y=ABT has rank k, where ARN×k has ai as the ith row, and BRm×k has bj as the jth row. This representation is not unique, because if C is any invertible matrix in Rk×k, then AAC and BBCT leads to the same matrix Y.

So let’s use the (reduced) singular value decomposition (SVD), which represents Y as the product of three matrices, UΣVT, where URN×k and VRm×k have orthonormal columns, and ΣRk×k is diagonal. We’ll define A=UΣ and B=V.

Now what we’d really like is to recover the vectors ai. This tells us everything we care to know about the units. All we need for this is the matrix V, since YV=UΣVTV=UΣ=A, and we can just read off the rows. This is all standard PCA stuff.

The part that is interesting to me is when we only observe a random sample of the rows of Y, which are in a matrix XRn×m, nN. I’m wondering under what circumstances we can learn about k and V from X.

For example, I’m guessing that if nk, then we can compute the SVD of X, look at how many singular values are meaningfully greater than zero, and get a good estimate of k. I’m guessing the right singular vectors we so calculate are a pretty good estimate of V. But I’d love to have some statistical rigor behind this.

Question 1: suppose we have a null hypothesis that k=k0 (recall that Y has rank k). How do we use X to calculate a p-value against this null hypothesis? How can we get a confidence interval on k?

Question 2: how can we get a confidence region on V? If we had this, then we could directly calculate a confidence region on XV, the unit vectors for which we observe measurements. How does not knowing k affect this procedure?

I’ve had a glance in Jolliffe’s Principal Component Analysis, and it seems that when the rows of Y have a multivariate normal distribution, and k is known, we can do this sort of thing. But I’m specifically interested in inferences regarding the parameter k. And I’d much prefer a nonparametric approach.

As I said, I have no idea how to do this, it just seems interesting!

Update 2020-11-15

The reason I’ve been thinking about this is that certain aspects of PCA seem a little arbitrary to me. I’m no expert, but the general guidance I’ve seen on selecting the parameter k is based on either a “knee in the curve” in the singular values, or a threshold like getting 80% of the “energy” of the singular values. Typically PCA is used as an unsupervised learning technique. There is no “correct” answer on how many singular values to keep. I was trying to imagine a scenario where there actually is a correct answer, and this was what I came up with.

Subscribe to Adventures in Why

* indicates required
Bob Wilson
Bob Wilson
Marketing Data Scientist

The views expressed on this blog are Bob’s alone and do not necessarily reflect the positions of current or previous employers.