Program
Speaker Abstracts
Animashree Anandkumar, University of California,
Irvine
Spectral Methods for Generative and Discriminative Learning with Latent
Variables
Incorporating latent or hidden variables is a crucial aspect of statistical
modeling. I will present a statistical and a computational framework for guaranteed
learning of a wide range of latent variable models under generative and discriminative
settings. It is based on the method of moments, and involves efficient methods
for spectral/tensor decomposition of low order observed moments (typically
up
to fourth order). We establish that the tensor
method has low computational and sample complexities. In practice, these methods
are fast to implement and embarrassingly parallel, and are thus, scalable
to large scale datasets.
Recently, we have provided novel spectral-based approaches for learning discriminative
latent variable models such as multi-layer feedforward neural networks and
mixtures of classifiers. The moment tensors we construct are based on the
label and higher order score functions of the input. The score functions characterize
local variation of the probability density function of the input. Thus, incorporating
features based on the generative input model is the key ingredient for learning
discriminative latent variable models.
Alexandre d'Aspremont, École Normale Supérieure
-
Quentin Berthet, California Institute of Technology
Statistical and Computational Tradeoffs for Sparse Principal Component
Analysis
Sparse Principal Component Analysis has emerged as an extremely popular dimension
reduction technique for high-dimensional data. In the context of the recent
data revolution, the algorithmic aspect of inference methods cannot be neglected,
which can be a challenge in models that incorporate structure, or sparsity.
We investigate here the fundamental tradeoffs between statistical and computational
efficiency for the detection and estimation versions of this problem. We will
show how it is achieved by finding a relationship to the planted clique problem.
Inderjit S. Dhillon, University of Texas
Sparse inverse covariance estimation for a million variables
The L1-regularized Gaussian maximum likelihood estimator has been shown
to have strong statistical guarantees in recovering a sparse inverse covariance
matrix even under high-dimensional settings. However, it requires solving
a difficult non-smooth log-determinant program with number of parameters that
scale quadratically with the number of Gaussian variables. State-of-the-art
methods thus do not scale to problems with more than 20,000 variables. In
this talk, I will describe a quadratic approximation method that can solve
1-million dimensional L1-regularized log determinant problems (which would
thus have a trillion parameters) on a single computer. In order to do so,
we carefully exploit the underlying structure of the problem. Our innovations
include (i) a second-order Newton-like method, (ii) divisionof the variables
into free and fixed sets, (iii) a block co-ordinate descent method, and (iv)
a memory efficient scheme that approximates key quantities within the algorithm.
Even with the latter approximations, we show that the proposed BIGQUIC algorithmcan
achieve a quadratic convergence rate. Experimental results using synthetic
and real application data demonstrate the improvements in performance over
other state-of-the-art methods.
This is joint work with Cho-Jui Hsieh, Matyas Sustik, Pradeep Ravikumar and
Russell Poldrack.
Petros Drineas, Rensselaer Polytechnic Institute
RandNLA: Randomization in Numerical Linear Algebra
The introduction of randomization in the design and analysis of algorithms
for matrix computations (such as matrix multiplication, least-squares regression,
the Singular Value Decomposition (SVD), etc.) over the last decade provided
a new paradigm and a complementary perspective to traditional numerical linear
algebra approaches. These novel approaches were motivated by technological
developments in many areas of scientific research that permit the automatic
generation of large data sets, which are often modeled as matrices.
In this talk we will outline how such approaches can be used to approximately
solve problems in numerical linear algebra, ranging from the Singular Value
Decomposition (SVD) to the Column Subset Selection Problem and the related
CX decomposition. Applications of the proposed algorithms to data analysis
tasks in population genetics will also be discussed.
Maryam Fazel, University of Washington
-
Michael Friedlander, University of California, Davis
-
Tamara Kolda, Sandia National Laboratories
Computing the Largest Entries in a Matrix Product via Sampling
We consider the problem of how to efficiently find the largest entries
in the matrix product AB without explicitly computing the product. We are
motivated by the problem of finding nodes in a graph that share many common
neighbors. Consequently, we begin by explaining sampling techniques for graphs:
counting triangles and 4-vertex patterns (Seshadhri, Pinar, Kolda, SDM13;
Jha, Pinar, Seshadhri, WWW15). We explain how this leads naturally to sampling
methods for computing the largest entries in the matrix product. Our techniques
can be used for computing A2 in the case that A is symmetric (undirected graph),
A'A in the case that A is rectangular (bipartite graph), and the general matrix
product AB (tripartite graph). It works for both binary and weighted matrices,
even in the case that some weights are negative.
This is joint work with Grey Ballard, Ali Pinar, and C. Seshadhri.
Po-Ling Loh, University of Pennsylvania
PDW methods for support recovery in nonconvex high-dimensional problems
The primal-dual witness (PDW) technique is a now-standard proof strategy for
establishing variable selection consistency for sparse high-dimensional estimation
problems when the objective function and regularizer are convex. The method
proceeds by optimizing the objective function over the parameter space restricted
to the true support of the unknown vector, then using a dual witness to certify
that the resulting solution is also a global optimum of the unrestricted problem.
We present a modified primal-dual witness framework that may be applied even
to nonconvex, penalized objective functions that satisfy certain regularity
conditions. Notably, our theory allows us to derive rigorous support recovery
guarantees for local and global optima of regularized regression estimators
involving nonconvex penalties such as the SCAD and MCP, which do not involve
the restrictive incoherence conditions from Lasso-based theory. Projected
gradient descent methods may be used to obtain these optima in an efficient
manner.
This is joint work with Martin Wainwright.
Michael Mahoney, University of California,
Berkeley
Eigenvector localization, implicit regularization, and algorithmic anti-differentiation
for large-scale graphs and matrix data
Although graphs and matrices are popular ways to model data drawn from many
application domains, the size scale, sparsity properties, noise properties,
and many other aspects of graphs and matrices arising in realistic machine
learning and data analysis applications present fundamental challenges for
traditional matrix and graph algorithms. Informally, this at root has to do
with the observation that small-scale or local structure in many data sets
is often better or more meaningful than large-scale or global structure, which
is exactly the opposite of what many traditional asymptotic tools typically
implicitly assume. For example, there might be meaningful small-scale clusters
in social networks that are tree-like or expander-like at large size scales.
Here, I will describe how eigenvector localization can be used to justify
these sorts of claims, how localization can be engineered in a principled
way into popular matrix and graph algorithms in order to characterize local
properties in large data sets much more generally, and how seemingly-minor
details in the approximate computation of localized (as well as non-localized)
objectives can make a large difference in the usefulness of a given method
in downstream applications. A common theme will be the critical role of what
can be called implicit regularization---that is, regularization not in the
usual sense of explicitly adding a regularization term and solving the modified
objective exactly, but instead as an implicit side effect of heuristics that
are often used to make approximation algorithms scale well---as well as the
question of what is the objective function that an approximation algorithm
implicitly solve exactly.
Jakub Marecek, IBM Research
Coordinate Descent and Challenges therein
Since the publication of Nesterov's paper "Efficiency of coordinate descent
methods on huge-scale optimization problems" (SIOPT, 2012), there has
been much interest in randomised coordinate descent. Coordinate descent may
be a particularly simple algorithm schema, but both analyses in realistic
models of distributed computation and applications beyond composite optimisation
remain a challenge. We present analyses and implementations of distributed
coordinate descent with both traditional applications to partially-separable
functions in sparse regression and training of sparse support vector machines,
and novel applications including matrix-completion with inequalities and semidefinite-programming
relaxations of polynomial optimisation problems. Engineering highlights include
a major progress on a 3TB sparse regression instance within 30 minutes of
wall-clock time and solving relaxations of alternating current optimal power
flows, a key problem in power systems analysis.
This talk is based on joint work with Bissan Ghaddar, Martin Mevissen, Peter
Richtarik, and last but not least Martin Takac
Yaniv Plan, University of British Columbia
The generalized lasso with non-linear measurements
Consider signal estimation from non-linear measurements. A rough heuristic
often used in practice postulates that "non-linear measurements may be
treated as noisy linear measurements" and the signal may be reconstructed
accordingly. We give a rigorous backing to this idea, with a focus on low-dimensional
signals buried in a high-dimensional spaces. Just as noise may diminished
by projecting onto the lower dimensional space, the error from modeling non-linear
measurements with linear measurements will be greatly reduced when using the
signal structure in the reconstruction. We assume a random Gaussian model
for the measurement matrix, but allow the rows to have an unknown, and ill-conditioned,
covariance matrix. As a special case of our results, we give theoretical accuracy
guarantee for 1-bit compressed sensing with unknown covariance matrix of the
measurement vectors.
Ben Recht, University of California, Berkeley
-
Stephen Vavasis, University of Waterloo
Nonnegative matrix factorization for structured data
Nonnegative matrix factorization (NMF) has become a crucial tool for revealing
hidden relationships in a large data matrix and is applicable to identifying
topics in document collections, features in images, related genes in organisms
and many other problems. Despite the wide usage of NMF and the existence of
dozens of heuristic methods, the underlying problem is NP-hard. We describe
our recent results in which convex optimization is guaranteed to compute the
correct NMF in the case that the input data has certain separable or related
structure. Parts of this talk represent joint work with X. V. Doan, N. Gillis,
and K.-C. Toh.
Lin Xiao, Microsoft Research
Communication-Efficient Distributed Optimization of Self-Concordant Empirical
Loss
We propose a communication-efficient distributed algorithm for solving large-scale
empirical risk minimization problems in machine learning. The algorithm is
based on an inexact damped Newton method, where the inexact Newton steps are
computed by a distributed preconditioned conjugate gradient method. We analyze
its iteration complexity and communication efficiency for minimizing self-concordant
empirical loss functions, and discuss the implications for distributed ridge
regression, logistic regression and binary classification with a smoothed
hinge loss. In a standard setting for supervised learning where the problem
condition number grows with the total sample size, the required number of
communication rounds of our algorithm does not increase with the sample size,
but only grows slowly with the number of machines in the distributed system.
(This is joint work with Yuchen Zhang.)
Back to top