My main focus of research is currently the theory of statistical machine learning and its applications to data mining issues.
The Computational Complexity of Statistical Learning
The complexity of machine learning tasks can be measured along two parameters: information and computation. Information complexity is concerned with the generalization performance of learning algorithms, namely, how many training examples are needed to achieve certain accuracy of prediction? How fast do empirical estimates converge to the true population parameters? Etc.
The computational complexity of learning concerns the computation applied to the training data in order to extract from it the learner's predictions. Although computational resources are a critical bottleneck in practical applications of machine learning tools, until recently there were surprisingly few theoretical results on this issue. In a series of papers (Barteltt Ben-David '99, Ben-David, Eiron, Long '00 Ben-David, Simon '00) we have shown that many central tasks that arise in common applications of machine learning techniques are computationally infeasible. More precisely, we have shown that for each of a wide variety of commonly used hypothesis classes, there exists a constant such that the task of (even) approximating the best fit hypothesis in the class (for an input training data) is NP-hard.
On the other hand, in an attempt to provide mathematical understanding to the success of some of these methods in practice, we have offered a relaxation of the worst-case complexity measure that on one hand reflects properties common to data encountered in practice, and on the other hand suffices to allow computational feasibility of the basic learning task (Ben-David, Simon '01, Ben-David, Simon '00). In some sense this relaxation is similar to the notion of "Smoothed Complexity" offered very recently by Spielman. I have been invited to present this research in a special tutorial at the Thirteenth Annual Conference on Computational Learning Theory (COLT 2000) and in a tutorial at the Alpine Workshop on Computational Aspects of Learning 2001.
A Theoretical Analysis of the Generalization Ability of Support Vector Machines
Kernel based learning methods are among the most powerful and widely used learning paradigms. A large body of research supports the algorithmic and computational complexity aspects of kernel based learning, and in particular of the support vector machines approach. When it comes to the information theoretic aspects, namely figuring out needed sample sizes and generalization guarantees, the understanding of these methods seems to be lacking. Most (if not all) of the existing statistical significance results have an a posteriori nature – the bounds are based on viewing the actual input training data, rather than on some a priori assumptions about the nature of the learning problem.
As a contrasting example, consider the PAC learning framework, or the empirical risk minimization approach to statistical regression, where the fundamental generalization results are of the form "if the target function (or labeling distribution) is close enough to a member of our hypothesis class, then with high probability, if we observe so many labeled examples, our hypothesis will have error smaller than ... ." The question of a priori generalization guarantees can be formulated as follows:
Given a class H of binary valued functions over some domain set X, does there exist an embedding F of X into some Euclidean space Rn, such that:
Note that this is a rather weak version of a priori bounds. We only ask about the existence of a successful embedding F and do not ask about the complexity of the search for such an F. It follows that negative answers to this type of question are quite strong, whereas a positive answer will require some further work before it can be transformed into a useful algorithmic procedure.
There are three parameters used to derive useful a-posteriori generalization bounds for SVM's: The Euclidean dimension of the embedding's range space (or, equivalently, the VC dimension of the space of hyper-planes), the margins obtained by a hypothesis half-space (w.r.t. the image of the training data under the embedding), and the sparsity of the hypothesis hyper-plane. The following question emerges as fundamental to our evaluation of the SVM approach:
Can any of these three parameters provide an a priori guarantee on the generalization quality of a learning algorithm? That is, assuming the sample generating distribution is close to a function in some restricted class of domain dichotomies, does there exist a kernel embedding that is guaranteed to give rise to a useful value of any of the above three parameters?
A significant progress in this direction was recently obtained in (Ben-David, Eiron, Simon '01). It is shown there that most of the finite concept classes of any fixed VC-dimension cannot be embedded into classes of Euclidean half-spaces unless the resulting margins are very small and the Euclidean dimension of the half-spaces is very large. These results imply that the known generalization bounds that are based on either VC-dimension or margins cannot provide a priori generalization guarantees for kernel based learning (even for classes on which, without embedding them into half-spaces, empirical risk minimization does enjoy such guarantees).
Scale Sensitive Dimension and Statistical Generalization
In typical learning problems, the learner is presented with a finite sample of data generated by an unknown source and has to find, within a given class, the model yielding best predictions on future data generated by the same source. In a realistic scenario, the information provided by the sample is incomplete, and therefore the learner might settle for approximating the actual best model in the class within some given accuracy. If the data source is probabilistic and the hypothesis class consists of functions, a sample size sufficient for a given accuracy has been shown to be dependent on different combinatorial notions of "dimension," each measuring, in a certain sense, the complexity of the learner's hypothesis class.
Whenever the learner is allowed a low degree of accuracy, the complexity of the hypothesis class might be measured on a coarse "scale" since, in this case, we do not need the full power of the entire set of models. This position can be related to Rissanen's MDL principle (Ris78), Vapnik's structural minimization method (Vap82), and Guyon et al.'s notion of effective dimension (GVBBS91). Intuitively, the "dimension" of a class of functions decreases as the coarseness of the scale at which it is measured increases. Thus, by measuring the complexity at the right "scale" (i.e., proportional to the accuracy) the sample size sufficient for finding the best model within the given accuracy might dramatically shrink. The results described below show that the accuracy parameter plays a crucial role in determining the effective complexity of the learner's hypothesis class.
Learnability in Valiant's PAC learning model has been shown to be strongly related to the existence of uniform laws of large numbers. These laws define a distribution-free convergence property of means to expectations uniformly over classes of random variables. Classes of real-valued functions enjoying such a property are also known as uniform Glivenko-Cantelli classes. In Alon, Ben-David, Cesa-Bianchi, Haussler '97 we prove, through a generalization of Sauer's lemma that may be interesting in its own right, a new characterization of uniform Glivenko-Cantelli classes. Our characterization yields Dudley, Giné, and Zinn's previous characterization as a corollary. Furthermore, it is the first characterization that is based on a simple combinatorial quantity generalizing the Vapnik-Chervonenkis dimension. We apply this result to obtain the weakest combinatorial condition known to imply PAC learnability in the statistical regression (or "agnostic") framework.
These results have become a key mathematical tool in the development of margin based generalization bounds for a variety of central learning approaches such as Boosting, Neural Nets and Support Vector Machines.