My main focus of research is
currently the theory of statistical machine learning and its applications to
data mining issues.
The Computational Complexity of Statistical Learning
The complexity of machine
learning tasks can be measured along two parameters: information and
computation. Information complexity is concerned with the generalization
performance of learning algorithms, namely, how many training examples are
needed to achieve certain accuracy of prediction? How fast do empirical
estimates converge to the true population parameters? Etc.
The computational complexity
of learning concerns the computation applied to the training data in order to
extract from it the learner's predictions. Although computational resources are
a critical bottleneck in practical applications of machine learning tools,
until recently there were surprisingly few theoretical results on this issue.
In a series of papers (Barteltt Ben-David '99,
Ben-David, Eiron,
Long '00 Ben-David,
Simon '00) we have shown that many central tasks that arise in common
applications of machine learning techniques are computationally infeasible.
More precisely, we have shown that for each of a wide variety of commonly used
hypothesis classes, there exists a constant such that the task of (even)
approximating the best fit hypothesis in the class (for an input training data)
is NP-hard.
On the other hand, in an
attempt to provide mathematical understanding to the success of some of these
methods in practice, we have offered a relaxation of the worst-case complexity
measure that on one hand reflects
properties common to data encountered in practice, and on the other hand
suffices to allow computational feasibility of the basic learning task (Ben-David, Simon '01, Ben-David, Simon '00).
In some sense this relaxation is similar to the notion of "Smoothed
Complexity" offered very recently by Spielman.
I have been invited to present this research in a special tutorial at the
Thirteenth Annual Conference on Computational Learning Theory (COLT 2000) and
in a tutorial at the Alpine Workshop on Computational Aspects of Learning 2001.
A Theoretical Analysis of the Generalization Ability of Support Vector Machines
Kernel based learning
methods are among the most powerful and widely used learning paradigms. A large
body of research supports the algorithmic and computational complexity aspects
of kernel based learning, and in particular of the support vector machines
approach. When it comes to the information theoretic aspects, namely figuring
out needed sample sizes and generalization guarantees, the understanding of
these methods seems to be lacking. Most (if not all) of the existing
statistical significance results have an a posteriori nature – the bounds are
based on viewing the actual input training data, rather than on some a priori
assumptions about the nature of the learning problem.
As a contrasting example,
consider the PAC learning framework, or the empirical risk minimization
approach to statistical regression, where the fundamental generalization
results are of the form "if the target function (or labeling distribution)
is close enough to a member of our hypothesis class, then with high
probability, if we observe so many labeled examples, our hypothesis will have
error smaller than ... ." The question of a priori generalization
guarantees can be formulated as follows:
Given a class H of binary valued functions over some domain set X,
does there exist an embedding F of X into some Euclidean space Rn,
such that:
Note that this is a rather
weak version of a priori bounds. We only ask about the existence of a
successful embedding F and do not ask about the complexity of the search
for such an F. It follows that negative answers to this type of question
are quite strong, whereas a positive answer will require some further work
before it can be transformed into a useful algorithmic procedure.
There are three parameters
used to derive useful a-posteriori generalization bounds for SVM's: The
Euclidean dimension of the embedding's range space (or, equivalently, the VC
dimension of the space of hyper-planes), the margins obtained by a hypothesis
half-space (w.r.t. the image of the training data under the embedding), and the
sparsity of the hypothesis hyper-plane. The following question emerges as
fundamental to our evaluation of the SVM approach:
Can
any of these three parameters provide an a priori guarantee on the
generalization quality of a learning algorithm? That is, assuming the sample
generating distribution is close to a function in some restricted class of
domain dichotomies, does there exist a kernel embedding that is guaranteed to
give rise to a useful value of any of the above three parameters?
A significant progress in
this direction was recently obtained in
(Ben-David,
Eiron, Simon '01). It is shown there that most of the finite concept
classes of any fixed VC-dimension cannot be embedded into classes of Euclidean
half-spaces unless the resulting margins are very small and the Euclidean
dimension of the half-spaces is very large. These results imply that the known
generalization bounds that are based on either VC-dimension or margins cannot provide
a priori generalization guarantees for kernel based learning (even for classes
on which, without embedding them into half-spaces, empirical risk minimization
does enjoy such guarantees).
Scale Sensitive Dimension and Statistical Generalization
In typical learning
problems, the learner is presented with a finite sample of data generated by an
unknown source and has to find, within a given class, the model yielding best
predictions on future data generated by the same source. In a realistic scenario,
the information provided by the sample is incomplete, and therefore the learner
might settle for approximating the actual best model in the class within some
given accuracy. If the data source is probabilistic and the hypothesis class
consists of functions, a sample size sufficient for a given accuracy has been
shown to be dependent on different combinatorial notions of
"dimension," each measuring, in a certain sense, the complexity of
the learner's hypothesis class.
Whenever the learner is
allowed a low degree of accuracy, the complexity of the hypothesis class might
be measured on a coarse "scale" since, in this case, we do not need
the full power of the entire set of models. This position can be related to
Rissanen's MDL principle (Ris78), Vapnik's structural minimization method
(Vap82), and Guyon et al.'s notion of effective dimension (GVBBS91).
Intuitively, the "dimension" of a class of functions decreases as the
coarseness of the scale at which it is measured increases. Thus, by measuring
the complexity at the right "scale" (i.e., proportional to the
accuracy) the sample size sufficient for finding the best model within the
given accuracy might dramatically shrink. The results described below show that
the accuracy parameter plays a crucial role in determining the effective
complexity of the learner's hypothesis class.
Learnability in Valiant's
PAC learning model has been shown to be strongly related to the existence of
uniform laws of large numbers. These laws define a distribution-free convergence
property of means to expectations uniformly over classes of random variables.
Classes of real-valued functions enjoying such a property are also known as
uniform Glivenko-Cantelli classes. In Alon, Ben-David,
Cesa-Bianchi, Haussler '97 we prove, through a generalization of Sauer's
lemma that may be interesting in its own right, a new characterization of
uniform Glivenko-Cantelli classes. Our characterization yields Dudley, Giné,
and Zinn's previous characterization as a corollary. Furthermore, it is the
first characterization that is based on a simple combinatorial quantity
generalizing the Vapnik-Chervonenkis dimension. We apply this result to obtain
the weakest combinatorial condition known to imply PAC learnability in the
statistical regression (or "agnostic") framework.
These results have become a
key mathematical tool in the development of margin based generalization
bounds for a variety of central learning approaches such as Boosting, Neural
Nets and Support Vector Machines.