Applications of Statistical Science
Seminar Series
Rob Tibshirani, Stanford University
December 17th 1998
ABSTRACT
Boosting (Freund and Schapire, 1995) is one of the most important recent
developments in classification methodology. Boosting works by sequentially
applying a classification algorithm to reweighted versions of the training
data, and then taking a weighted majority vote of the sequence of classifiers
thus produced. For many classification algorithms, this simple strategy
results in dramatic improvements in performance. We show that this seemingly
mysterious phenomenon can be understood in terms of well known statistical
principles, namely additive modeling and maximum likelihood. For the
two-class problem, boosting can be viewed as an approximation to additive
modeling on the logistic scale using maximum Bernoulli likelihood as
a criterion. We develop more direct approximations and show that they
exhibit nearly identical results to boosting. Direct multi-class generalizations
based on multinomial likelihood are derived that exhibit performance
comparable to other recently proposed multi-class generalizations of
boosting in most situations, and far superior in some. We suggest a
minor modification to boosting that can reduce computation, often by
factors of 10 to 50. Finally, we apply these insights to produce an
alternative formulation of boosting decision trees. This approach, based
on best-first truncated tree induction, often leads to better performance,
and can provide interpretable descriptions of the aggregate decision
rule. It is also much faster computationally, making it more suitable
to large scale data mining applications.
This is joint work with Jerome Friedman and Trevor Hastie
Shelley Bull
Samuel Lunenfeld Research Institute of Mount Sinai Hospital and
Department of Public Health Sciences, University of Toronto
February 2nd 1999
ABSTRACT
The relationship between statistics and research in human genetics
goes back to the work of early statisticians such as Fisher, Haldane,
Pearson, and others. Because of random aspects of the process of meiosis
that produces human gametes, transmission of genetic information from
parents to offspring is inherently probabilistic. In many cases, the
subject of interest, the gene, is not directly observable and its location
on the genome is unknown, so typically inference is based on likelihoods
that consider all possible observations that are consistent with known
information. Similarly, because of the importance of pedigrees that
extend across generations, data on some family members are often incomplete.
The development of new molecular technologies has dramatically changed
the nature and the volume of data that are available for the study of
complex human disease. It is estimated that the completion of the Human
Genome Project will reveal the location of 100,000 genes. As investigators
turn their attention from simple single-gene Mendelian disease to the
"genetic dissection of complex traits" such as diabetes and cardiovascular
disease that depend on many genes, statistical methods such as regression
and mixture distributions are required to model heterogeneity from known
and unspecified sources. I will review current statistical approaches
to finding genes for complex human disease and highlight some recent
advances in these approaches.
David Andrews, University of Toronto
March 2nd 1999
ABSTRACT
Election-night forecasting includes statistical components
for data collection, prediction and display. The evolution of these
components and their interaction is reviewed. The statstical aspects
of this activity have application to a broader class of problems. The
role of a mathematical scientist in complex activities is discussed.
Carl G. Amrhein, PhD
Department of Geography and The Graduate Program in Planning, Faculty
of Arts and Science, University of Toronto
(Joint work with David Buckeridge, MD, MSc
Department of Public Health Sciences, Faculty of Medicine, University
of Toronto)
April 6th 1999
ABSTRACT
The goal of this seminar is to provide an overview of
current problems in spatial analysis, and explore computational solutions
to some of these problems.
The seminar will begin with a review of the basic assumptions
underlying spatial statistics, then move on to briefly examine the issue
of spatial autocorrelation. Following this, a number of current problems
in spatial analysis will be presented, including edge effects, the overlay
problem, incompatible data sets, and aggregation effects. After this
overview, the problem of aggregation effects will be examined more closely.
Specifically, the importance of the problem will be discussed, and work
done in this area will be reviewed.
An epidemiological study will then be used to examine
some applied problems and solutions in spatial analysis with a focus
on spatial autocorrelation. In this study, a GLM is used to model the
relationship between hospital admission for respiratory illness and
estimated exposure to motor vehicle emissions. Issues associated with
data source overlay and georeferencing are briefly illustrated. Moran's
coefficient of spatial autocorrelation is used to describe the degree
of spatial autocorrelation in individual variables, and examine regression
residuals to explore the possible effect of spatial autocorrelation
on the standard error of regression parameter estimates. Finally, some
further issues will be briefly discussed such as regression models that
explicitly incorporate spatial autocorrelation, and the choice of connectivity
measures in spatial analysis.
Yoshua Bengio
Departement d'Informatique et Recherche Operationnelle and Centre
de Recherches Mathematiques, Université de Montréal
May 4th 1999
ABSTRACT
The objective of machine learning is not necessarily to
discover the underlying distribution of the observed data, but rather
to infer a function from data coming from an unknown distribution, such
that this function can be used to make predictions or take decisions
on new data from the same unknown distribution. In this talk we will
talk about a particular class of machine learning problems in which,
contrary to what is usually assumed in econometrics and machine learning,
the data are not assumed to be stationary or IID (independently and
identically distributed). First we will discuss a framework in which
the usual notions of generalization error can be extended to this case.
Second we will discuss briefly how different results can be obtained
when testing hypotheses about the model rather than testing hypotheses
about the generalization error.
In the second part of this talk, I will present some of
our work on improving generalization error when the data are not IID,
based on the optimization of so-called hyper-parameters. Many machine
learning algorithms can be formulated as the minimization of a training
criterion which involves both training errors on each training example
and some hyper-parameters, which are kept fixed during this minimization.
When there is only a single hyper-parameter one can easily explore how
its value affects a model selection criterion (that is not the same
as the training criterion, and is used to select hyper-parameters).
We will briefly describe a new methodology to select many hyper-parameters,
based on the computation of the gradient of a model selection criterion
with respect to the hyper-parameters. In this talk we will present an
application of this methodology to the selection of hyper-parameters
that control how much weight should be put on each training example
in time-series prediction, putting more weight on more recent examples,
in a way that is controlled by these hyper-parameters. Statistically
significant improvements were obtained in predicting future volatility
of Canadian stocks using this method, in a very simple setting. Body
text goes here