Abstract:
The input of a supervised machine learning algorithm is a training set of
labeled examples. Most machine learning algorithms are designed with the
implicit assumption that the labels in the training set are provided by a
single "teacher". However, in practice, labels are often collected from
multiple teachers, with different levels of expertise, competence, and
motivation. In this talk, I will present a new family of machine learning
algorithms that benefit from the presence of multiple teachers. These
algorithms explicitly use the association between labels and teachers to
clean the training data and ultimately to learn more accurate models. First,
I will present the problem of learning from a crowd, where labeled data is
collected from the general public via a crowd-sourcing website (such as
galaxyzoo.org or mturk.com). In this setting, the set of teachers is very
large and each teacher provides only a handful of labels. Moreover, the
average label quality is poor, so algorithms that ignore the association of
labels to teachers are likely to produce inferior results. Within this
setting, I will focus on the problem of identifying low-quality teachers and
removing their labels from the training data. Next, I will go on to discuss
the problem of active-learning from multiple teachers, where the learning
algorithm has to decide which examples should be labeled and which subset of
teachers should label them. I will present a new online learning algorithm
for this problem, which does almost as well as each teacher in its area of
expertise. For each algorithm, I will give a sketch of a formal analysis and
present experimental results on real datasets.
Parts of this work were done in collaboration with Claudio Gentile, Ohad
Shamir, and Karthik Sridharan.