Project: Learning Stochastic Tree-Grammars from Treebanks (LeStoGram)

NWO Project Page
Status: Project Completed
Publications and Theses
Full Proposal


Principal Investigator:
  Khalil Sima'an  (

People involved:
Detlef Prescher (postdoc),   Remko Scha (co-leader) and  Khalil Sima'an  (leader).
M.Sc. Students graduated:
Andreas Zollmann (now CMU, USA), Thuy Linh Nguyen (now CMU, USA) and Luciano Buratto (now U. of Warwick, UK).

Duration: 3 years (October 2003-September 2006).

Funding: Exacte Wetenschappen, The Netherlands Organization for Scientific Research (NWO).

Ambiguity constitutes a major obstacle for the application of natural language parsers.To resolve ambiguity, current parsing practice employs stochastic grammars, learned from tree-banks, i.e., samples of syntactically annotated utterances. For correct disambiguation, it is important that the probabilities of the grammar productions are conditioned on suitable contextual evidence. Stochastic Tree-Grammars (STGs), in particular, provide extended, yet local, contextual evidence, expressed in productions that are partial parse-trees. The parsers based on STGs currently achieve state-of-the-art performance, warranting our interest.

However, the problem of learning STGs from tree-banks is especially complex, precisely because of their complex productions. In particular, the direct application of Maximum-Likelihood (ML) for estimating the STG probabilities from a tree-bank could result in biased and inconsistent STGs. In fact, a consistent, non-trivial and efficient probability estimation method for learning STGs from tree-banks is not available as of yet.

This project aims at developing and empirically evaluating a correct and efficient ML method for probability estimation for STGs. On the theoretical side, the project will result in a general algorithm, and a study of its consistency. On the practical side, it will result in different implementations of varying efficient approximations of this algorithm, and in empirical studies of the application of these algorithms to learning parsers under existing STG-based parsing models. The project will yield prototype implementations of the learning algorithm, and of parsers based on STGs learned using this algorithm.

Publications [project completed Oct' 2006]

Related Work Employing the Outcomes of the Project

Rens Bod. Is the End of Supervised Parsing in Sight? ACL 2007 Proceedings. This paper extend the DOP* model (Zollmann&Sima'an 2006 above) into an unsupervised version (U-DOP*) that employs the same consistent estimator. The paper exhibits the performance and efficiency advantages that this estimator provides when the training data grows very large.

MSc thesis in this project: