Principal Investigator: Khalil Sima'an
(simaan@science.uva.nl).
People involved:
M.Sc. Students graduated:
Andreas Zollmann (now CMU,
USA), Thuy Linh Nguyen (now CMU, USA) and Luciano Buratto (now U. of
Warwick, UK).
Duration: 3 years (October 2003September 2006).
Funding: Exacte
Wetenschappen, The Netherlands Organization for Scientific Research
(NWO).
Description:
Ambiguity constitutes a major
obstacle for the application of natural language parsers.To resolve
ambiguity,
current parsing practice employs stochastic grammars, learned from
treebanks, i.e., samples of syntactically annotated utterances. For
correct disambiguation, it is important that the probabilities of the
grammar productions are
conditioned on suitable contextual evidence. Stochastic TreeGrammars
(STGs), in particular, provide extended, yet local, contextual
evidence,
expressed in productions that are partial parsetrees. The parsers
based
on STGs currently achieve stateoftheart performance, warranting our
interest.
However, the problem of learning STGs from treebanks is especially
complex, precisely because of their complex productions. In particular,
the direct application of MaximumLikelihood (ML) for estimating the
STG probabilities from a treebank could result in biased and
inconsistent STGs. In fact, a consistent, nontrivial and efficient
probability
estimation method for learning STGs from treebanks is not available as
of yet.
This project aims at developing and empirically evaluating a correct
and efficient ML method for probability estimation for STGs. On the
theoretical side, the project will result in a general algorithm, and a
study of its consistency. On the practical side, it will result in
different implementations of varying efficient approximations of this
algorithm, and in empirical studies of the application of these
algorithms to learning parsers under existing STGbased parsing models.
The project will yield prototype implementations of the learning
algorithm, and of parsers based on STGs learned using this algorithm.

Publications [project
completed Oct' 2006]
 K.
Sima'an and L. Buratto (2003). Backoff
Parameter Estimation for the DOP Model. In Proceedings
of the European Conference on Machine Learning (ECML'03),
Dubrovnik, Croatia..
 M. Hearne
and K. Sima'an (2003). Structured
Parameter Estimation for LFGDOP by Backoff. In Proceedings
of International Conference on Recent Advances in Natural Language
Processing (RANLP'03), Bulgaria.
 D. Prescher, R. Scha,
K. Sima'an and A. Zollmann (2004). On
the Statistical Consistency of DOP Estimators.
In Proceedings Computational Linguistics in The Netherlands
(CLIN'04).
 D. Prescher, R. Scha,
K. Sima'an and A. Zollmann (2004). Treebank
Grammars and
Other Infinite
Parameter Models. ILLC report, University of Amsterdam.
 Andreas
Zollmann and Khalil Sima'an (2005). A
Consistent and Efficient Estimator for DataOriented Parsing.
Journal of Automata, Languages and Combinatorics
(JALC).
2005.
 D. Prescher (2005). HeadDriven
PCFGs with LatentHead Statistics. In Proceedings of the
9th International Workshop on Parsing
Technologies (IWPT 2005).
 D. Prescher (2005). Inducing
HeadDriven PCFGs
with Latent Heads: Refining a Treebank
Grammar for Parsing.
In Proceedings of the 16th European
Conference on Machine Learning (ECML 2005), LNCS
series.
 D. Prescher, R. Scha,
K. Sima'an and A. Zollmann (2004). What
Are Treebank
Grammars? Proceedings BNAIC 2006.
Related Work Employing the
Outcomes of the Project
Rens Bod. Is the End of
Supervised Parsing in Sight? ACL 2007 Proceedings. This paper
extend the DOP* model (Zollmann&Sima'an 2006 above) into an
unsupervised version (UDOP*) that employs the same consistent
estimator. The paper exhibits the performance and efficiency advantages
that this estimator provides when the training data grows very large.
MSc thesis in this project:
