Imputing Missing Data with the Low-Rank Gaussian Copula
Missing data imputation forms the first critical step of many data analysis pipelines. The challenge is greatest for mixed data sets, including real, Boolean, and ordinal data, where standard techniques for imputation (including low rank models) fail basic sanity checks: for example, the imputed values may not follow the same distribution as the data. This talk introduces a new semiparametric algorithm to impute missing values. The algorithm models mixed data as a Gaussian copula. This model can fit arbitrary marginals for continuous variables and can handle ordinal variables with many levels, including Boolean variables as a special case. We develop an efficient approximate EM algorithm to estimate copula parameters from incomplete mixed data, and low rank and online extensions of the method that can handle extremely large datasets. The resulting model reveals the statistical associations among variables. Experimental results on several synthetic and real datasets show the superiority of the proposed algorithm to state-of-the-art imputation algorithms for mixed data.