|Title:||Hebrew Acronyms: Identification, Expansion, and Disambiguation
|Supervisors:||Alon Itai, Shuly Wintner
|Abstract:||Acronyms are words formed from the initial letters of a phrase. For example, CIA is a well-known acronym for the Central Intelligence Agency, though in other contexts could mean the Culinary Institute of America or Rome's Ciampino Airport. Understanding acronyms is important for many natural language processing applications, including search and machine translation.
While hand-crafted acronym dictionaries exist, they are limited and require frequent updates. We developed a new machine learning method to automatically build a Modern Hebrew acronym dictionary from unstructured text documents. This is the first such technique, in any language, to specifically include acronyms whose expansions do not necessarily appear in the same documents. We also enhanced the dictionary with contextual information to help select the expansions most appropriate for a given acronym in context. When applied to acronym disambiguation, our dictionary achieved better results than dictionaries built using prior techniques.
Additionally, while acronyms have a long history in Hebrew, and have previously been investigated from a linguistic perspective, they have never before been studied quantitatively. We discovered new statistically-based linguistic insights about acronym usage in Modern Hebrew texts, of interest to Hebrew language aficionados and developers of Hebrew natural language processing systems.
|Copyright||The above paper is copyright by the Technion, Author(s), or others. Please contact the author(s) for more information|
Remark: Any link to this technical report should be to this page (http://www.cs.technion.ac.il/users/wwwb/cgi-bin/tr-info.cgi/2014/MSC/MSC-2014-13), rather than to the URL of the PDF files directly. The latter URLs may change without notice.
To the list of the MSC technical reports of 2014
To the main CS technical reports page
Computer science department, Technion