SCIENTIFIC PROGRAMS AND ACTIVITIES

November 24, 2024

THE FIELDS INSTITUTE FOR RESEARCH IN MATHEMATICAL SCIENCES

Thematic Program on Statistical Inference, Learning, and Models for Big Data, January to June, 2015

February 9 – 13 , 2015
Workshop on Optimization and Matrix Methods in Big Data
Organizing Committee

Stephen Vavasis (Chair),Anima Anandkumar,
Petros Drineas, Michael Friedlander,
Nancy Reid, Martin Wainwright

Program


Speaker Abstracts

Animashree Anandkumar, University of California, Irvine
Spectral Methods for Generative and Discriminative Learning with Latent Variables

Incorporating latent or hidden variables is a crucial aspect of statistical modeling. I will present a statistical and a computational framework for guaranteed learning of a wide range of latent variable models under generative and discriminative settings. It is based on the method of moments, and involves efficient methods for spectral/tensor decomposition of low order observed moments (typically up
to fourth order). We establish that the tensor
method has low computational and sample complexities. In practice, these methods are fast to implement and embarrassingly parallel, and are thus, scalable to large scale datasets.

Recently, we have provided novel spectral-based approaches for learning discriminative latent variable models such as multi-layer feedforward neural networks and mixtures of classifiers. The moment tensors we construct are based on the label and higher order score functions of the input. The score functions characterize local variation of the probability density function of the input. Thus, incorporating features based on the generative input model is the key ingredient for learning discriminative latent variable models.




Alexandre d'Aspremont
, École Normale Supérieure

-

Quentin Berthet, California Institute of Technology
Statistical and Computational Tradeoffs for Sparse Principal Component Analysis

Sparse Principal Component Analysis has emerged as an extremely popular dimension reduction technique for high-dimensional data. In the context of the recent data revolution, the algorithmic aspect of inference methods cannot be neglected, which can be a challenge in models that incorporate structure, or sparsity. We investigate here the fundamental tradeoffs between statistical and computational efficiency for the detection and estimation versions of this problem. We will show how it is achieved by finding a relationship to the planted clique problem.


Inderjit S. Dhillon, University of Texas
Sparse inverse covariance estimation for a million variables

The L1-regularized Gaussian maximum likelihood estimator has been shown to have strong statistical guarantees in recovering a sparse inverse covariance matrix even under high-dimensional settings. However, it requires solving a difficult non-smooth log-determinant program with number of parameters that scale quadratically with the number of Gaussian variables. State-of-the-art methods thus do not scale to problems with more than 20,000 variables. In this talk, I will describe a quadratic approximation method that can solve 1-million dimensional L1-regularized log determinant problems (which would thus have a trillion parameters) on a single computer. In order to do so, we carefully exploit the underlying structure of the problem. Our innovations include (i) a second-order Newton-like method, (ii) divisionof the variables into free and fixed sets, (iii) a block co-ordinate descent method, and (iv) a memory efficient scheme that approximates key quantities within the algorithm. Even with the latter approximations, we show that the proposed BIGQUIC algorithmcan achieve a quadratic convergence rate. Experimental results using synthetic and real application data demonstrate the improvements in performance over other state-of-the-art methods.

This is joint work with Cho-Jui Hsieh, Matyas Sustik, Pradeep Ravikumar and Russell Poldrack.

Petros Drineas, Rensselaer Polytechnic Institute
RandNLA: Randomization in Numerical Linear Algebra
The introduction of randomization in the design and analysis of algorithms for matrix computations (such as matrix multiplication, least-squares regression, the Singular Value Decomposition (SVD), etc.) over the last decade provided a new paradigm and a complementary perspective to traditional numerical linear algebra approaches. These novel approaches were motivated by technological developments in many areas of scientific research that permit the automatic generation of large data sets, which are often modeled as matrices.

In this talk we will outline how such approaches can be used to approximately solve problems in numerical linear algebra, ranging from the Singular Value Decomposition (SVD) to the Column Subset Selection Problem and the related CX decomposition. Applications of the proposed algorithms to data analysis tasks in population genetics will also be discussed.

Maryam Fazel, University of Washington

-


Michael Friedlander, University of California, Davis

-

Tamara Kolda, Sandia National Laboratories
Computing the Largest Entries in a Matrix Product via Sampling

We consider the problem of how to efficiently find the largest entries in the matrix product AB without explicitly computing the product. We are motivated by the problem of finding nodes in a graph that share many common neighbors. Consequently, we begin by explaining sampling techniques for graphs: counting triangles and 4-vertex patterns (Seshadhri, Pinar, Kolda, SDM13; Jha, Pinar, Seshadhri, WWW15). We explain how this leads naturally to sampling methods for computing the largest entries in the matrix product. Our techniques can be used for computing A2 in the case that A is symmetric (undirected graph), A'A in the case that A is rectangular (bipartite graph), and the general matrix product AB (tripartite graph). It works for both binary and weighted matrices, even in the case that some weights are negative.

This is joint work with Grey Ballard, Ali Pinar, and C. Seshadhri.


Po-Ling Loh, University of Pennsylvania
PDW methods for support recovery in nonconvex high-dimensional problems

The primal-dual witness (PDW) technique is a now-standard proof strategy for establishing variable selection consistency for sparse high-dimensional estimation problems when the objective function and regularizer are convex. The method proceeds by optimizing the objective function over the parameter space restricted to the true support of the unknown vector, then using a dual witness to certify that the resulting solution is also a global optimum of the unrestricted problem. We present a modified primal-dual witness framework that may be applied even to nonconvex, penalized objective functions that satisfy certain regularity conditions. Notably, our theory allows us to derive rigorous support recovery guarantees for local and global optima of regularized regression estimators involving nonconvex penalties such as the SCAD and MCP, which do not involve the restrictive incoherence conditions from Lasso-based theory. Projected gradient descent methods may be used to obtain these optima in an efficient manner.


This is joint work with Martin Wainwright.



Michael Mahoney, University of California, Berkeley
Eigenvector localization, implicit regularization, and algorithmic anti-differentiation for large-scale graphs and matrix data

Although graphs and matrices are popular ways to model data drawn from many application domains, the size scale, sparsity properties, noise properties, and many other aspects of graphs and matrices arising in realistic machine learning and data analysis applications present fundamental challenges for traditional matrix and graph algorithms. Informally, this at root has to do with the observation that small-scale or local structure in many data sets is often better or more meaningful than large-scale or global structure, which is exactly the opposite of what many traditional asymptotic tools typically implicitly assume. For example, there might be meaningful small-scale clusters in social networks that are tree-like or expander-like at large size scales. Here, I will describe how eigenvector localization can be used to justify these sorts of claims, how localization can be engineered in a principled way into popular matrix and graph algorithms in order to characterize local properties in large data sets much more generally, and how seemingly-minor details in the approximate computation of localized (as well as non-localized) objectives can make a large difference in the usefulness of a given method in downstream applications. A common theme will be the critical role of what can be called implicit regularization---that is, regularization not in the usual sense of explicitly adding a regularization term and solving the modified objective exactly, but instead as an implicit side effect of heuristics that are often used to make approximation algorithms scale well---as well as the question of what is the objective function that an approximation algorithm implicitly solve exactly.

Jakub Marecek, IBM Research
Coordinate Descent and Challenges therein

Since the publication of Nesterov's paper "Efficiency of coordinate descent methods on huge-scale optimization problems" (SIOPT, 2012), there has been much interest in randomised coordinate descent. Coordinate descent may be a particularly simple algorithm schema, but both analyses in realistic models of distributed computation and applications beyond composite optimisation remain a challenge. We present analyses and implementations of distributed coordinate descent with both traditional applications to partially-separable functions in sparse regression and training of sparse support vector machines, and novel applications including matrix-completion with inequalities and semidefinite-programming relaxations of polynomial optimisation problems. Engineering highlights include a major progress on a 3TB sparse regression instance within 30 minutes of wall-clock time and solving relaxations of alternating current optimal power flows, a key problem in power systems analysis.

This talk is based on joint work with Bissan Ghaddar, Martin Mevissen, Peter Richtarik, and last but not least Martin Takac



Yaniv Plan, University of British Columbia
The generalized lasso with non-linear measurements

Consider signal estimation from non-linear measurements. A rough heuristic often used in practice postulates that "non-linear measurements may be treated as noisy linear measurements" and the signal may be reconstructed accordingly. We give a rigorous backing to this idea, with a focus on low-dimensional signals buried in a high-dimensional spaces. Just as noise may diminished by projecting onto the lower dimensional space, the error from modeling non-linear measurements with linear measurements will be greatly reduced when using the signal structure in the reconstruction. We assume a random Gaussian model for the measurement matrix, but allow the rows to have an unknown, and ill-conditioned, covariance matrix. As a special case of our results, we give theoretical accuracy guarantee for 1-bit compressed sensing with unknown covariance matrix of the measurement vectors.

Ben Recht, University of California, Berkeley

-

Stephen Vavasis, University of Waterloo
Nonnegative matrix factorization for structured data

Nonnegative matrix factorization (NMF) has become a crucial tool for revealing hidden relationships in a large data matrix and is applicable to identifying topics in document collections, features in images, related genes in organisms and many other problems. Despite the wide usage of NMF and the existence of dozens of heuristic methods, the underlying problem is NP-hard. We describe our recent results in which convex optimization is guaranteed to compute the correct NMF in the case that the input data has certain separable or related structure. Parts of this talk represent joint work with X. V. Doan, N. Gillis, and K.-C. Toh.

Lin Xiao, Microsoft Research
Communication-Efficient Distributed Optimization of Self-Concordant Empirical Loss

We propose a communication-efficient distributed algorithm for solving large-scale empirical risk minimization problems in machine learning. The algorithm is based on an inexact damped Newton method, where the inexact Newton steps are computed by a distributed preconditioned conjugate gradient method. We analyze its iteration complexity and communication efficiency for minimizing self-concordant empirical loss functions, and discuss the implications for distributed ridge regression, logistic regression and binary classification with a smoothed hinge loss. In a standard setting for supervised learning where the problem condition number grows with the total sample size, the required number of communication rounds of our algorithm does not increase with the sample size, but only grows slowly with the number of machines in the distributed system. (This is joint work with Yuchen Zhang.)

Back to top