Corpus Based Analysis of Hebrew

Principle investigators : Alon Itai and Yoad Winter

The project is part of a joint project conducted in collaboration with the corpus based NLP group of the Computer Science Institute of of the Hebrew University, Jerusalem, headed by Prof. Eli Shamir. The project is sponsored by a grant from the Israel Ministry of Science.

The project's aim is to study ways to arrive at better tools for morphological and syntactic analysis of Hebrew using corpus based techniques.

The project consists of several stages:

A morphological analyzer of Hebrew: developed by Alon Itai and Erel Segal.

An analyzer that finds all the parses of a Hebrew word written in the regular undotted script.
A package that finds the correct parse of a Hebrew word in context. The package combines several heuristics:

Choosing the most common parse,
the similar words technique (see Ornan, Levinger and Itai),
Brill-style correction rules and
a rudimentary syntax parser.

http://www.cs.technion.ac.il/~erelsgl/bxi/hmntx/teud.html

A statistical corpus based syntax parser for Hebrew, developed by Alon Altman, Alon Itai, Noa Nativ (Hebrew University), Khalil Sima'an (University of Amsterdam) and Yoad Winter. The aim is to learn a statistical parser from syntactically annotated corpora (tree-banks) using the Tree-gram Parsing model of Sima'an, which is based on the Data-Oriented Language Processing framework of Scha (University of Amsterdam). We are currently constructing a small morphologically and syntactically annotated corpus of Ha'aretz reports, which will be used for learning a Tree-gram parser using Sima'an's T-gram System.
Combining the statistical parser with the morphological package to reduce the number of errors of the morphological analyzer.