A Hebrew morphological disambiguator based on an unsupervised morpheme-based stochastic model

Meni Adler (Ben Gurion University)

Abstract:

Morphological disambiguation is the process of assigning morphological features (such as gender, number, person and part of speech) to each individual word in a text. In most cases, the word is ambiguous - there are several possible analyses for the word - and a disambiguation procedure, based on the word context, i.e., its adjacent words, must be applied. In Hebrew, words can combine several free morphemes in both agglutinative and fusional ways. The agglutinative nature of the Hebrew language causes the data encoded in a Hebrew corpus to be sparse; i.e., the words it contains rarely appear in the text. We propose an unsupervised stochastic model, which deals with the data sparseness problem. The model is an extension of Hidden Markov Model, and can handle several possible output emissions, by using morphemes as output emissions instead of words, which makes the learning process more efficient. Applying the suggested model on Hebrew corpora, we found 90% of the words were correctly analyzed with all morphological features, where 94% of the words were correctly segmented and tagged with part-of-speech.




Back to ISCOL'05 homepage