PCA for High-Dimensional Heteroscedastic Data
Modern applications increasingly involve high-dimensional and heterogeneous data, e.g., datasets formed by combining numerous measurements from myriad sources. Principal Component Analysis (PCA) is a classical method for reducing dimensionality by projecting data onto a low-dimensional subspace capturing most of their variation, but it does not robustly recover underlying subspaces in the presence of heteroscedastic noise. Specifically, PCA suffers from treating all data samples as if they are equally informative. We will discuss the consequences of this on performance, which lead us naturally to consider weighting PCA in such a way that we give less influence to samples with larger noise variance. Doing so better recovers underlying principal components, but precisely how to choose the weights turns out to be an interesting problem. Surprisingly, we show that whitening the noise by using inverse noise variance is sub-optimal. Our analysis provides expressions for the asymptotic recovery of underlying low-dimensional components from samples with heteroscedastic noise in the high-dimensional regime. We derive optimal weights and characterize the performance of optimally weighted PCA. We also show the Probabilistic PCA problem corresponding to heteroscedastic noise, and discuss its optimization. This is work in collaboration with David Hong, Kyle Gilman, and Jeff Fessler.