I’ve been thinking about Principal Component Analysis (PCA) lately. Suppose we
have measurements each on units, represented in a matrix , where . The idea in PCA is that the results
of these measurements are driven by underlying factors (). So
each unit can be described by a vector , and each
measurement is affected by the factors in a way described by a vector , so that the th measurement on the th unit is . In this case, has rank , where has as the th row, and has
as the th row. This representation is not unique, because if is
any invertible matrix in , then and leads to the same matrix .
So let’s use the (reduced) singular value decomposition (SVD), which represents
as the product of three matrices, , where and have orthonormal
columns, and is diagonal. We’ll define and .
Now what we’d really like is to recover the vectors . This tells us
everything we care to know about the units. All we need for this is the matrix
, since , and we can just read off the
rows. This is all standard PCA stuff.
The part that is interesting to me is when we only observe a random sample of
the rows of Y, which are in a matrix , . I’m wondering under what circumstances we can learn about and from
.
For example, I’m guessing that if , then we can compute the SVD of
, look at how many singular values are meaningfully greater than zero, and
get a good estimate of . I’m guessing the right singular vectors we so
calculate are a pretty good estimate of . But I’d love to have some
statistical rigor behind this.
Question 1: suppose we have a null hypothesis that (recall that
has rank ). How do we use to calculate a p-value against this null
hypothesis? How can we get a confidence interval on ?
Question 2: how can we get a confidence region on ? If we had this, then we
could directly calculate a confidence region on , the unit vectors for
which we observe measurements. How does not knowing affect this procedure?
I’ve had a glance in Jolliffe’s Principal Component Analysis, and it seems
that when the rows of have a multivariate normal distribution, and is
known, we can do this sort of thing. But I’m specifically interested in
inferences regarding the parameter . And I’d much prefer a nonparametric
approach.
As I said, I have no idea how to do this, it just seems interesting!
Update 2020-11-15
The reason I’ve been thinking about this is that certain aspects of PCA seem a
little arbitrary to me. I’m no expert, but the general guidance I’ve seen on
selecting the parameter is based on either a “knee in the curve” in the
singular values, or a threshold like getting 80% of the “energy” of the
singular values. Typically PCA is used as an unsupervised learning technique.
There is no “correct” answer on how many singular values to keep. I was trying
to imagine a scenario where there actually is a correct answer, and this was
what I came up with.