The Emile Program: Introduction

Human being are remarkably good in working with natural languages. Even if someone has no knowledge of the formal structure of a language, he or she will be able to tell when `something' is like `something else'. For instance, Lewis Caroll's famous poem `Jabberwocky' starts with

'Twas brillig, and the slithy toves
Did gyre and gimble in the wabe;
All mimsy were the borogoves
and the mome raths outgrabe.

Even without Humpty Dumpty's annotations, it is immediately obvious what the syntactic structure of the first sentence is: `brillig' and `slithy' are adjectives, `toves' is a noun, `gyre' and `gimble' are verbs, etcetera.

So how do we know such things? The short answer is `from context'. When a sentence starts with ` 'Twas', we are not surprised if the next word is an adjective. Similarly, if a sentence has the pattern `the (.) did (.) and (.) in the (.)', we expect the missing phrases to be a noun-phrase, two verb-phrases and another noun-phrase, respectively.

The notion of grammatical type has many possible definitions. For instance, if we had a context-free grammar of a language, we can view each non-terminal symbol as a grammatical type. In general, one of the properties of a grammatical type is, that wherever some expression is used as an expression of that type, other expressions of that type can be substituted without making the sentence ungrammatical. This gives rise to a notion of a grammatical type as a set of expressions together with a set of contexts. For instance, the type `noun-phrase' could be represented by the set of all noun-phrases, together with the set of all contexts in which a noun-phrase can appear. Combining any of the expressions of a type with any of the contexts will yield a grammatical sentence. Many of these combinations might appear in actual texts, especially the short ones (in terms of number of words).

In this terminology, we can describe the above phenomenon, as the existence of contexts which are characteristic for a type, meaning that whenever something appears in that context, we assume it also belongs to that type. Some types may also have characteristic expressions, with the analogous property.

EMILE 4.1 is a program based on the above concepts. It attempts to learn the grammatical structure of a language from sentences of that language, without being given any prior knowledge of the grammar. For any type in any valid grammar for the language, we can expect context/expression combinations to show up in a sufficiently large sample of sentences of the language. EMILE searches for such clusters of expressions and contexts in the sample, and interprets them as grammatical types. It then tries to find characteristic contexts and expressions, and uses them to extend the types. Finally, it formulates derivation rules based on the types found, in the manner of the rules of a context-free grammar. The program can present the grammatical structure found in several ways, as well as use it to parse other sentences or generate new ones.

More information on EMILE and its precursors may be found in the manual, as well as in the following articles

P. Adriaans, Language Learning from a Categorial Perspective}, PhD thesis, University of Amsterdam, 1992.
P. Adriaans, Learning Shallow Context-Free Languages under Simple Distributions}, ILLC Research Report PP-1999-13, Institute for Logic, Language and Computation, Amsterdam, 1999.
P. Adriaans, M. Trautwein, M. Vervoort, The Emile Approach to Grammar Induction put into Practice, in preparation.
E. Dörnenburg, Extension of the EMILE algorithm for inductive learning of context-free grammars for natural languages, Master's Thesis, University of Dortmund, 1997.

Marco Vervoort

Last modified: Wed Jun 27 12:50:09 MET DST 2001