Thoughts on Principal Components Analysis
This is a post with more questions than answers.
I’ve been thinking about Principal Component Analysis (PCA) lately. Suppose we have $m$ measurements each on $N$ units, represented in a matrix $Y \in \mathbb{R}^{N \times m}$, where $m \ll N$. The idea in PCA is that the results of these $m$ measurements are driven by $k$ underlying factors ($k < m$). So each unit $i$ can be described by a vector $a_i \in \mathbb{R}^k$, and each measurement $j$ is affected by the factors in a way described by a vector $b_j \in \mathbb{R}^k$, so that the $j$th measurement on the $i$th unit is $a_i^T b_j$. In this case, $Y = A B^T$ has rank $k$, where $A \in \mathbb{R}^{N \times k}$ has $a_i$ as the $i$th row, and $B \in \mathbb{R}^{m \times k}$ has $b_j$ as the $j$th row. This representation is not unique, because if $C$ is any invertible matrix in $\mathbb{R}^{k \times k}$, then $A \to A C$ and $B \to B C^{-T}$ leads to the same matrix $Y$.
So let’s use the (reduced) singular value decomposition (SVD), which represents $Y$ as the product of three matrices, $U \Sigma V^T$, where $U \in \mathbb{R}^{N \times k}$ and $V \in \mathbb{R}^{m \times k}$ have orthonormal columns, and $\Sigma \in \mathbb{R}^{k \times k}$ is diagonal. We’ll define $A = U \Sigma$ and $B = V$.
Now what we’d really like is to recover the vectors $a_i$. This tells us everything we care to know about the units. All we need for this is the matrix $V$, since $YV = U \Sigma V^T V = U \Sigma = A$, and we can just read off the rows. This is all standard PCA stuff.
The part that is interesting to me is when we only observe a random sample of the rows of Y, which are in a matrix $X \in \mathbb{R}^{n \times m}$, $n \ll N$. I’m wondering under what circumstances we can learn about $k$ and $V$ from $X$.
For example, I’m guessing that if $n \gg k$, then we can compute the SVD of $X$, look at how many singular values are meaningfully greater than zero, and get a good estimate of $k$. I’m guessing the right singular vectors we so calculate are a pretty good estimate of $V$. But I’d love to have some statistical rigor behind this.
Question 1: suppose we have a null hypothesis that $k = k_0$ (recall that $Y$ has rank $k$). How do we use $X$ to calculate a p-value against this null hypothesis? How can we get a confidence interval on $k$?
Question 2: how can we get a confidence region on $V$? If we had this, then we could directly calculate a confidence region on $XV$, the unit vectors for which we observe measurements. How does not knowing $k$ affect this procedure?
I’ve had a glance in Jolliffe’s Principal Component Analysis, and it seems that when the rows of $Y$ have a multivariate normal distribution, and $k$ is known, we can do this sort of thing. But I’m specifically interested in inferences regarding the parameter $k$. And I’d much prefer a nonparametric approach.
As I said, I have no idea how to do this, it just seems interesting!
Update 2020-11-15
The reason I’ve been thinking about this is that certain aspects of PCA seem a little arbitrary to me. I’m no expert, but the general guidance I’ve seen on selecting the parameter $k$ is based on either a “knee in the curve” in the singular values, or a threshold like getting 80% of the “energy” of the singular values. Typically PCA is used as an unsupervised learning technique. There is no “correct” answer on how many singular values to keep. I was trying to imagine a scenario where there actually is a correct answer, and this was what I came up with.