Abstract:
Machine Learning (ML) research is entering a new phase as the
move to deploy applications incorporating ML technology
identifies interesting problems that need to be addressed. This
seminar will address two such issues in the development of
practical message classification systems. The first issue is the
need to be able to address concept drift in the incoming message
stream. The second is the need to be able to attach confidence
measures to classifications.
There are many classification problems where the 'concept', once
learned, remains static. This is often not the case in message
classification as the concept that is being learned by a message
classifier typically changes over time. In this seminar
ensemble-based and online learning approaches to concept drift
will be compared on the task of tracking concept drift in spam
filtering. It will also be argued that the failure-driven update
approach that is common in the concept drift research literature
is not adequate in this scenario.
Assessing classifier confidence is an issue in tracking concept
drift and it is a requirement that arises in a number of guises
in message classification. It arises when Active Learning
techniques are used to bootstrap classifiers with the objective
of minimizing the labeling load on the user. It also arises in
message routing scenarios where users may be required to scan
lists of messages that may not have been routed correctly.
Surprisingly, classification scores from 'ranking' classifiers
such as Support Vector Machines, Nearest Neighbour Classifiers,
Naive Bayes and Logistic Regression are poor estimates of
classification confidence. Alternative aggregation techniques
will be presented in this seminar.