Corpus Based Analysis of Hebrew
Principle investigators : Alon Itai and Yoad Winter
The project is part of a joint project conducted in collaboration with
the corpus based NLP group of the Computer Science Institute of of the
Hebrew University, Jerusalem, headed by Prof. Eli Shamir. The project is
sponsored by a grant from the Israel Ministry of Science.
The project's aim is to study ways to arrive at better tools for morphological
and syntactic analysis of Hebrew using corpus based techniques.
The project consists of several stages:
-
A morphological analyzer of Hebrew: developed by Alon
Itai and Erel Segal.
-
An analyzer that finds all the parses of a Hebrew word written in the regular
undotted script.
-
A package that finds the correct parse of a Hebrew word in context. The
package combines several heuristics:
-
Choosing the most common parse,
-
the similar words technique (see Ornan,
Levinger and Itai),
-
Brill-style correction rules and
-
a rudimentary syntax parser.
The package correctly parsed over 96% of the words of a test text. The
package can be obtained at http://www.cs.technion.ac.il/~erelsgl/bxi/hmntx/teud.html
-
A statistical corpus based syntax parser for Hebrew, developed by Alon
Altman, Alon Itai, Noa Nativ (Hebrew University), Khalil Sima'an (University
of Amsterdam) and
Yoad Winter.
The aim is to learn a statistical parser from syntactically annotated
corpora (tree-banks) using the Tree-gram Parsing model of Sima'an, which
is based on the Data-Oriented Language Processing framework of Scha (University
of Amsterdam). We are currently constructing a small morphologically and
syntactically annotated corpus of Ha'aretz reports, which will be used
for learning a Tree-gram parser using Sima'an's T-gram System.
-
Combining the statistical parser with the morphological package to reduce
the number of errors of the morphological analyzer.