Technical Report MSC-2012-20

TR#:MSC-2012-20
Class:MSC
Title: Morphological and Lexical Decomposition as a Basis for Identifying Multiword Expressions
Authors: Daniel Hurwitz
Supervisors: Alon Itai
PDFMSC-2012-20.pdf
Abstract: A multi-word expression (MWE) is a construct of more than one orthographic word which possesses a single idiosyncratic meaning. MWEs, such as "hot dog", "by and large", "kick the bucket", "spill the beans", and "look up" are extremely prevalent in our vernacular. The identification of MWEs is important for many practical applications including translation and speech-to-text.

This work presents a method for improving the identification of MWEs using a concept termed Text Isolation. It focuses on dissecting (or "isolating") the morphological properties of words in order to discover potential MWEs. Movie subtitle files are exploited in order to align the individual subtitles between different translations for each movie, thus generating a multi-lingual (Spanish-English-Hebrew) parallel corpus. After this subtitle-level alignment is performed, the Text Isolation technique is applied to the corpus. A word-level alignment algorithm is then used to acquire Hebrew MWEs and their translations. This method improves MWE identification in addition to improving the alignments to those MWEs.

CopyrightThe above paper is copyright by the Technion, Author(s), or others. Please contact the author(s) for more information

Remark: Any link to this technical report should be to this page (http://www.cs.technion.ac.il/users/wwwb/cgi-bin/tr-info.cgi/2012/MSC/MSC-2012-20), rather than to the URL of the PDF files directly. The latter URLs may change without notice.

To the list of the MSC technical reports of 2012
To the main CS technical reports page

Computer science department, Technion
admin