Technical Report MSC-2013-14

Title: Syntactic Annotation of the Hebrew CHILDES Corpora
Authors: Shai Gretz
Supervisors: Alon Itai, Shuly Wintner
Abstract: The CHILDES database is a large collection of child—adult spoken interactions in over 25 languages. Automatic annotation of these data facilitates research on child language development and acquisition by providing researchers with a large amount of accurate data. Recently, the English section of the CHILDES database was automatically annotated with labeled dependency relations in a state-of-the-art approach. We describe a similar endeavor, focusing on the Hebrew section of CHILDES. This is done by the following process: First, we design a novel annotation scheme of dependency relations reflecting constructions of child and child-directed utterances, as well as the special phenomena of the Hebrew language. We then annotate a corpus with these dependency relations, and use the manually-annotated data to train a parser with which the rest of the corpora can be annotated. We then evaluate the parsing accuracy. We show the adaptability of our annotation scheme to the CHILDES corpora in numerous evaluation scenarios. We also examine different annotation approaches of linguistic issues relevant to several languages or unique to Hebrew, as well as the contribution of morphological features to the accuracy of dependency parsing of the Hebrew section of CHILDES. This is the first syntactic parser of Hebrew spoken language.
