Anna Feldman (Ohio State University): Portable Language Technology: A resource-light approach to
morphological tagging
(joint work with Jiri Hana and Chris Brew)
Abstract
Part-of-speech tagging is essential for many NLP tasks, and is needed both for
resource-rich languages (such as English or Czech) and resource-poor languages
(such as Russian). Because of wide variation between languages and tagsets, it
cannot be assumed that the same methodology for tagging will be appropriate in
all cases (Elworthy 1995). But linguists do have useful knowledge of the
probable relationships between languages, so it is natural to wonder whether
these relationships can be pressed into service for the rapid development of
effective taggers. In this talk, I will describe a resource-light system for
the automatic morphological analysis and tagging of Russian. We eschew the use
of extensive resources (particularly, large annotated corpora and lexicons),
exploiting instead 1) pre-existing annotated corpora of Czech; 2) an unannotated
corpus of Russian. We use a (resource-light) morphological analyzer (Hana 2004)
and an automatically derived lexicon of Russian (Hana 2004), combine the results
with the information derived from Czech and use the TnT tagger (Brants 2000) in
a number of different ways, including modes where we use a committee-based
approach. We show that our approach has benefits, and present what we believe to
be one of the first full evaluations of a Russian tagger in the openly available
literature.