Corpus Based Analysis of Hebrew

Principle investigators : Alon Itai and Yoad Winter

The project is part of a joint project conducted in collaboration with the corpus based NLP group of the Computer Science Institute of of the Hebrew University, Jerusalem, headed by Prof. Eli Shamir. The project is sponsored by a grant from the Israel Ministry of Science.

The project's aim is to study ways to arrive at better tools for morphological and syntactic analysis of Hebrew using corpus based techniques.

The project consists of several stages:

  1. A morphological analyzer of Hebrew: developed by Alon Itai and Erel Segal.
    1. An analyzer that finds all the parses of a Hebrew word written in the regular undotted script.
    2. A package that finds the correct parse of a Hebrew word in context. The package combines several heuristics:
    3. The package correctly parsed over 96% of the words of a test text. The package can be obtained at
  2. A statistical corpus based syntax parser for Hebrew, developed by Alon Altman, Alon Itai, Noa Nativ (Hebrew University), Khalil Sima'an (University of Amsterdam) and Yoad Winter. The aim is to learn a statistical parser  from syntactically annotated corpora (tree-banks) using the Tree-gram Parsing model of Sima'an, which is based on the Data-Oriented Language Processing framework of Scha (University of Amsterdam). We are currently constructing a small morphologically and syntactically annotated corpus of Ha'aretz reports, which will be used for learning a probabilistic parser using Sima'an's T-gram model.
  3. Combining the statistical parser with the morphological package to reduce the number of errors of the morphological analyzer.

A first version of the treebank, which contains 500 Hebrew sentences analyzed syntactically and morphologically is available here.
The following paper describes this treebank, and some experiments performed with it using the Tree-Gram model of Sima'an.

K. Sima'an, A. Itai, Y. Winter, A. Altman, N. Nativ (2001): Building a Tree-Bank of Modern Hebrew Text. ps pdf