Technical Report MSC-2007-01

TR#:MSC-2007-01
Class:MSC
Title: The Master-Slave Dependency Model and its Application to Understanding Hebrew
Authors: Yan Tsitrin
Supervisors: Uzzy Ornan
PDFMSC-2007-01.pdf
Abstract: Understanding of natural language texts is a classic example of a task humans are able to perform without de-fining it formally. But in order to teach a computer to cope with this problem we have to describe the under-standing process in terms of formal specifications understood by the computer.

In our research we describe such structures in terms of Dependency Grammar. The concept of Dependency Grammar includes a whole family of theories and formalisms based on the common assumption set. In these grammars an utterance structure is represented by a graph. The nodes of the graph stand for the utterance’s words and its edges symbolize the relations between them. The key asset of dependency grammar is that it al-lows for a clean separation of syntactic dependency and surface word order. For this reason these grammars seemed better suited for the analysis of relatively free word order languages, like Hebrew. In contrast to the Phrase Structure Grammars in this approach the structures produced by the analyzer do not include theoreti-cal variables (like NP or VP).

In this thesis we present a new model for sentences analysis called MASTER-SLAVE dependency model. In this formalism an utterance meaning is related as a tree-based structure which is called Meaning Tree or MOLECULE (Morphological-Lexical-Conceptual Entity).The tree’s nodes stand for the concepts expressed in the utterance (we call them semantic nuclei) and its edges (or connectors) represent conceptual or functional relations between the nuclei. Each nucleus or connector can be orthogonally decomposed into the space’s axes according to its perceptual, structural and conceptual features. A Meaning Tree is a formal meaning rep-resentation in the computer, so we say that the computer understands an utterance when it finds such a tree (or a set of trees for an ambiguous utterance).

Since the utterance analysis should cope with different kinds of ambiguity and for a single utterance several meaning structures can be found, we propose a structure allowing storing all these structures. Such a structure is called Utterance Graph. All the meaning trees of an utterance can be inferred from its utterance graph.

We also propose a hierarchy of grammatical categories of the semantic nuclei. The higher the nucleus’ cate-gory is in the hierarchy, the more important for the utterance meaning the nucleus is.

In contrast to the works of Tesniere and Fillmore, we do not demand a verb to be necessarily the most impor-tant nucleus for an utterance in order to find its meaning. The nucleus whose category is the highest in the hi-erarchy among other utterance’s tokens may stand in the center of the utterance’s meaning. This allows us to analysis incomplete sentences.

We have investigated also the internal structure of the nucleus that relates, among other things, its ability to establish connections with other sentence’s nuclei (nucleus’ valency). In our model we use the concept of valency for all the categories of nuclei, not only for verbs, as it is accepted in other works. It was figured out that the valency of a nucleus may depend on its internal features. We described this phenomenon in terms of adaptation of a nucleus to the environment of the analyzed sentence.

The utterance analyzing process is executed in two stages: 1. Utterance Graph creating – for all the tokens of the analyzed utterance we find all the nuclei the token can represent and then we create all possible connectors between these nuclei; 2. Based on the created Utterance Graph all the consistent Meaning Trees that the graph contains are inferred.

It should be noted that the connectors’ creation and their analysis are performed according to the hierarchy we have built and does not depend on the tokens’ order in the sentence. Since the second stage of our algorithm copes with abstract graph analysis it constitutes an appropriate platform for using parallel architectures for natural text understanding. We have developed algorithms for both the stages and have implemented them in the Java programming language. A set of connectors’ creation rules for a fragment of the Hebrew language was described. We also propose a method for inferring such rules from manually annotated corpora. The algo-rithms’ complexity has been analyzed and the program performance was evaluated. From the experiments we conducted it follows that the first (and correct) meaning-trees for a given utterance may be obtained quite quickly; however in order to find all the meaning-trees the algorithm needs unacceptable amount of time.

Many examples of sentences in different languages (Hebrew, English, Russian, German, Polish, etc) are pre-sented, together with the structures the analyzer assigns them. We consider, as well, the using of our program as a model in more general applications, such as Dialog Systems, Machine Translation Systems, Text Sum-marizers and Search Engines. As a conclusion we suggest some directions for extending and improving the project.

CopyrightThe above paper is copyright by the Technion, Author(s), or others. Please contact the author(s) for more information

Remark: Any link to this technical report should be to this page (http://www.cs.technion.ac.il/users/wwwb/cgi-bin/tr-info.cgi/2007/MSC/MSC-2007-01), rather than to the URL of the PDF or PS files directly. The latter URLs may change without notice.

To the list of the MSC technical reports of 2007
To the main CS technical reports page

Computer science department, Technion