Christof Monz - Research

University of Amsterdam
Informatics Institute

Christof Monz

Home

Projects

Research

Publications

Teaching

Activities

Software

Other Stuff

Bio

The Challenge of Information Access

With the widespread availability of computers, storing vast amounts of data has become efficient and inexpensive. The availability of these resources also raises a number of questions: How do we access these large amounts of data in order to find the information we are interested in? How do we present large amounts of data in concise way that allows one to focus on information that is essential to a given request? How do we access information that is authored in a foreign language?

My research interests cover the areas of information retrieval, document summarization, and machine translation, all of which address aspects of the issues listed above. In particular, I am interested in researching how statistical NLP techniques can help improve information processing applications, such as information retrieval, document analysis, and question answering.

Machine Translation

My work on machine translation focuses on a number of aspects of statistical machine translation. One of the major components of each statistical machine translation system is word alignment, where word translation probabilities are estimated. In joint work with colleagues at the University of Maryland, [1] showed that the alignment error rate can be reduced significantly by using transformation-based learning, a machine learning technique that aims to iteratively reduce the alignment error by searching possible alignments constrained by rule templates.

In [2] we combined several alignment strategies to improve alignment quality. This approach falls under the machine learning paradigm of classifier ensembles. This offers a more flexible strategy to combine evidence from different sources, including linguistic information. This again has lead to significant improvements in alignment quality

Another question is to what extent linguistic information can be beneficial to generating translations. Several approaches have been proposed in recent years. In collaboration with with colleagues from Columbia University and the University of Maryland, we suggested a new approach to enrich a linguistically-driven, symbolic machine translation with statistical information ([3]). In this approach statistically learned phrase translations were used in a symbolic system to improve its coverage and robustness.

References

Necip Fazil Ayan, Bonnie J. Dorr, Christof Monz. Alignment Link Projection using Transformation-Based Learning. In Proceedings of HLT-EMNLP 2005. Pages 185-192, 2005. [ PDF ]
Necip Fazil Ayan, Bonnie J. Dorr, Christof Monz. NeurAlign: Combining Word Alignments Using Neural Networks. In Proceedings of HLT-EMNLP 2005. Pages 65-72, 2005. [ PDF ]
Nizar Habash, Bonnie J. Dorr and Christof Monz. Challenges in Building an Arabic-English GHMT System with SMT Components. In Proceedings of the 7th Biennial Meeting of the Association for Machine Translation in the Americas (AMTA-06). Pages 56-65, 2006. [ PDF ]

Information Retrieval

I have previously pursued several strands of work in this area focusing on exploiting statistical machine translation techniques for cross-language information retrieval. One of the major issues in this area is the estimation of translation probabilities in the absence of a sizable parallel corpus. In [1] we have shown how these probabilities can be learned automatically by just inspecting corpora in the target language.

In joint work with colleagues at the University of Amsterdam I analyzed the impact of morphological normalization on mono-lingual document retrieval for several European languages [2,3]. This comprehensive study compared numerous morphological pre-processing strategies, including ngram-splitting, decompounding, and stemming, and their impact on retrieval effectiveness for the eight languages covered in the study.

In addition to cross-language retrieval myself and some colleagues at the University of Amsterdam also compared a number of retrieval strategies for web retrieval, in particular strategies for finding home pages, as opposed to arbitrary web pages [4]. This work showed that combining evidence from different sources, including the document layout and the context of a document, i.e. information from pages with incoming links, improves retrieval performance.

References

Christof Monz and Bonnie J. Dorr. Iterative Translation Disambiguation for Cross-Language Information Retrieval. In Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Pages 520-527, 2005. [ PDF ]
Vera Hollink, Jaap Kamps, Christof Monz, and Maarten de Rijke. Monolingua l Document Retrieval for European Languages. Information Retrieval 7(1), p. 33-52. [ PDF ]
Christof Monz and Maarten de Rijke. Shallow Morphological Analysis in Mon olingual Information Retrieval for Dutch, German and Italian. In: Post-Conference Proceedi ngs of the Cross Language Evaluation Forum Workshop (CLEF 2001), Springer Lecture Notes in Computer Science [ PDF ]
Jaap Kamps, Christof Monz, Maarten de Rijke, and Börkur Sigurbjörn sson. Approaches to Robust and Web Retrieval. In: Proceedings of TREC 2003, p. 594-600. [ PDF ]

Question Answering

The research area of Question Answering (QA) is a more specific case of information retrieval. Instead of returning a list of documents to a user's query, the aim of a QA system is to return an actual answer to a specific question. This requires methods that are able to analyze the user's information need, i.e. his or her question, and the content of documents potentially holding an answer to a much deeper level than standard document retrieval. As deeper analysis is computationally expensive most state-of-the-art QA systems use document retrieval techniques to identify a smaller subset of documents that are likely to contain an answer and restrict deeper analysis to these documents [1]. My work on Question Answering investigated which document retrieval techniques are best suited to accomplish this task.

[2] showed that some of the retrieval approaches, such as blind relevance feedback, that are known to improved document retrieval effectiveness, perform less well in the context of question answering and even can substantial decrease effectiveness. One of the main differences from standard document retrieval is that for finding documents that contain an answer to a question, the relevant information is expressed much more locally, in a sentence or two. [3] introduced minimal span weighting, a novel and robust proximity-based measure that is sensitive to the degree in with relevant terms occur in close proximity.

References

Christof Monz and Maarten de Rijke. Tequesta: The University of Amsterdam 's Texual Question Answering System. In: Proceedings of Tenth Text Retrieval Conference (TREC-10), 2001, p p. 513-522. [ PDF ]
Christof Monz. Document Retrieval in the Context of Question Answering. In: Fabrizio Sebastiani (Ed.) Proceedings of the 25th European Conference on Information Retrieval Research (ECIR-03), Lecture Notes in Computer Science 2633, Springer, 2003, pages 571-579. [ PDF ]
Christof Monz. Minimal Span Weighting Retrieval for Question Answering. In: Rob Gaizauskas, Mark Greenwood, and Mark Hepple (eds.) Proceedings of the SIGIR Workshop on Information Retrieval for Question Answering, p. 23-30, 2004. [ PDF ]

Summarization

The aim summarization is to provide a user with an automatically generated summary that captures the content of the document. In the context of information retrieval, summarization allows a user to determine quickly whether a retrieved document is relevant with respect to the user's information need. In Together with colleagues at the University of Maryland and BBN Technologies we compared metrics for evaluating summarization performance in an extrinsic task in a novel that reduces inter-assessor disagreement [1].

In a second line of research on summarization we applied sentence trimming techniques based on parsing information to generate multi-document summaries [2]. The same techniques have also been applied to summarize email threads from the Enron email corpus. In earlier work [3], I developed an approach that merges together multiple news items that are topically related and thereby reduces the amount of redundancy between them while increasing the coverage.

References

Stacy President Hobson, Bonnie J. Dorr, Christof Monz and Richard Schwartz. Task-Based Evaluation of Text Summarization using Relevance Prediction. Information Processing and Management, Special Issue on Summarization, Volume 43(6), pp. 1482-1499, 2007. [ PDF ]
David Zajic, Bonnie Dorr, Richard Schwartz, Christof Monz, Jimmy Lin. A Sentence-Trimming Approach to Multi-Document Summarization. In Proceedings of the HLT-EMNLP 2005 Workshop on Text Summarization. 2005. [ PDF ]
Christof Monz. Document Fusion for Comprehensive Event Description. In: M. Maybury (ed.) Proceedings of the ACL 2001 Workshop on Human Langua ge Technology and Knowledge Management, 2001. [ PDF ]

Document Analysis

Together with colleagues at the University of Amsterdam we developed an approach for detecting reading order in scanned document images [1,2,3]. Our approach was the first attempt to succesfully combine layout information and textual content. Earlier approaches relied on templates for specific journals/magazines, whereas this work is more general, and only assumes a left-to-right, top-to-bottom reading direction. Our model nicely integrates insights from different research areas: logical labeling, constraint satisfaction and natural language processing.

References

Marco Aiello, Christof Monz, Leon Todoran and M. Worring. Document Understanding for a Broad Class of Documents. International Journal of Document Analysis and Recognition 5(1), 2002, pages 1-16 [ PDF ]
Leon Todoran, Marco Aiello, Christof Monz and Marcel Worring. Logical Structure Detection for Heterogeneous Document Classes. In: Proceedings of Document Recognition and Retrieval VIII, SPIE, 2001, pp. 99-110 [ PDF ]
Marco Aiello, Christof Monz and Leon Todoran. Combining Linguistic and Sp atial Information for Document Analysis. In: J. Mariani and D. Harman (Eds.) Proceedings of RIAO'2000 Content-Base d Multimedia Information Access, CID, 2000. pp. 266-275. [ PDF ]