Technical Report MSC-2012-20

Title: Morphological and Lexical Decomposition as a Basis for Identifying Multiword Expressions
Authors: Daniel Hurwitz
Supervisors: Alon Itai
Abstract: A multi-word expression (MWE) is a construct of more than one orthographic word which possesses a single idiosyncratic meaning. MWEs, such as "hot dog", "by and large", "kick the bucket", "spill the beans", and "look up" are extremely prevalent in our vernacular. The identification of MWEs is important for many practical applications including translation and speech-to-text.

This work presents a method for improving the identification of MWEs using a concept termed Text Isolation. It focuses on dissecting (or "isolating") the morphological properties of words in order to discover potential MWEs. Movie subtitle files are exploited in order to align the individual subtitles between different translations for each movie, thus generating a multi-lingual (Spanish-English-Hebrew) parallel corpus. After this subtitle-level alignment is performed, the Text Isolation technique is applied to the corpus. A word-level alignment algorithm is then used to acquire Hebrew MWEs and their translations. This method improves MWE identification in addition to improving the alignments to those MWEs.

