Abstract:
Exponential families of distributions are widely used in machine learning
and statistics. It is well known that such distributions can be interpreted
as maximum Entropy models under empirical expectation constraints. We argue
that for classification tasks, however, Mutual Information is the correct
information theoretic measure to optimize. A related fundamental question is
how to quantify the information in empirical observations. Here again,
mutual information should be minimized under the observation constraints.
We show how this principle of minimum mutual information (MinMI) generalizes
that of maximum entropy (MaxEnt) and better motivated for learning. We
provide a comprehensive framework for using MinMI for building maximally
discriminative classifiers and introduce an iterative convex optimization
algorithm for finding such classifiers. We further provide generalization
error bounds for such classifiers and demonstrate their performance.
We also discuss how this principle can be used for estimating information
lower-bounds from limited data and for the analysis of neural codes.
Interesting relations with feature extraction and maximum margin classifiers
can also be obtained.
This is a joint work with Amir Globerson.