Variable clustering with G-models: CORD vs PECOK
The problem of variable clustering is that of grouping similar components of a p-dimensional vector X=(X1,…,Xp), and estimating these groups from n independent copies of X. When cluster similarity is defined via G-latent models, in which groups of X-variables have a common latent generator, and groups are relative to a partition G of the index set {1,…,p}, the most natural clustering strategy is K-means. We explain why this strategy cannot lead to perfect cluster recovery and offer a correction, based on semi-definite programing, that can be viewed as a penalized convex relaxation of K-means (PECOK). We introduce a cluster separation measure tailored to G-latent models, and derive its minimax lower bound for perfect cluster recovery. The clusters estimated by PECOK are shown to recover G at a near minimax optimal cluster separation rate, a result that holds true even if K, the number of clusters, is estimated adaptively from the data. We contrast this with cluster estimation via CORD, a method tailored to the more general class of G-block covariance models. We also compare PECOK with appropriate corrections of spectral clustering-type procedures, and show that the former outperforms the latter for perfect cluster recovery of minimally separated clusters.