Unsupervised learning of natural languages

Zach Solan, David Horn, Eytan Ruppin (Tel-Aviv University), Shimon Edelman (Cornell)


We describe results of a novel algorithm for grammar induction from a large corpus. The ADIOS (Automatic DIstillation of Structure) algorithm searches for significant patterns, chosen according to context dependent statistical criteria, and builds a hierarchy of such patterns according to a set of rules leading to structured generalization. The corpus is thus generalized into a context free grammar (CFG), Composed of patterns, equivalence classes and words of the initial lexicon. While our general approach is that of Machine Learning, we do not follow the traditional method of testing a set of models and selecting the best suitable for representing the given data. Instead we allow the data to dictate the model. Clearly the rules of our algorithm are such that a special family of models is selected, however it is the data that point the way in a progressive hierarchical search. In the tradition of Machine Learning one considers a training-set, on which the machine is being trained, and a test-set on which its power of generalization is tested. Using linguistic corpora such tests usually go in one direction: seeing whether the machine accepts a new sentence it has not seen before, thus judging it to be grammatical. This defines for us the 'recall' quality. We insist however also on testing 'precision', i.e. whether new sentences generated by the machine are indeed grammatically correct. This can be measured relatively easily if the original corpus is generated from a known grammar, e.g. some artificial CFG. We have evaluated our method both on corpora generated by CFG and on natural language ones. The performance of ADIOS is judged by searching for both good recall (acceptance of correct novel sentences) and good precision (production of correct novel sentences). Acceptance in both cases means that the sentence in question can be exactly generated by the rules of the testing party. Learning a simple CFG: We applied ADIOS to a simple CFG (TA1 grammar, 50 terminals and 28 rules), and showed that it performs well (above 90% recall precision) even when only 200 sentences are used for training. Learning a complex CFG: learning the ATIS CFG (357 terminals and 4592 rules) yielded 70% recall and 80% precision. To test the ability of ADIOS to generate acceptable novel sentences after learning from a natural language corpus, we trained it on 12,700 sentences from ATIS-2 (a natural language corpus of size 13,043) and tested its recall level on the 343 remaining sentences. The small size of the training corpus results in a relatively low recall of 40% and human-judged precision of 70%; for comparison, the ATIS-CFG grammar, hand-constructed to fit the ATIS-2 corpus (with recall of 45% on same data) produces over 99% ungrammatical sentences when used in a generative fashion. The results indicate that, for many practical purposes, ADIOS can perform the task to a large degree of accuracy. Moreover, it can achieve high levels of both precision and recall. The method is scalable to large corpora. To the best of our knowledge, no other method achieves comparable results. We have recently succeeded in applying it to interesting problems in bioinformatics, thus showing that its power of pattern and syntax extraction may help discern not only natural languages but also the language of Nature.

Back to ISCOL'05 homepage