The Challenge of Information Access
With the widespread availability of computers, storing vast amounts of
data has become efficient and inexpensive. The availability of these
resources also raises a number of questions: How do we access these
large amounts of data in order to find the information we are
interested in? How do we present large amounts of data in concise way
that allows one to focus on information that is essential to a given
request? How do we access information that is authored in a foreign
language?
My research interests cover the areas of information retrieval,
document summarization, and machine translation, all of which address
aspects of the issues listed above. In particular, I am interested in
researching how statistical NLP techniques can help improve
information processing applications, such as information retrieval,
document analysis, and question answering.
Machine Translation
My work on machine translation focuses on a number of aspects of
statistical machine translation. One of the major components of each
statistical machine translation system is word alignment, where word
translation probabilities are estimated. In joint work with colleagues
at the University of Maryland, [1] showed that the
alignment error rate can be reduced significantly by using
transformation-based learning, a machine learning technique that aims
to iteratively reduce the alignment error by searching possible
alignments constrained by rule templates.
In [2] we combined several alignment strategies to improve
alignment quality. This approach falls under the machine learning
paradigm of classifier ensembles. This offers a more flexible strategy
to combine evidence from different sources, including linguistic
information. This again has lead to significant improvements in
alignment quality
Another question is to what extent linguistic information can be
beneficial to generating translations. Several approaches have been
proposed in recent years. In collaboration with with colleagues from
Columbia University and the University of Maryland, we suggested a new
approach to enrich a linguistically-driven, symbolic machine
translation with statistical information ([3]). In this approach
statistically learned phrase translations were used in a symbolic
system to improve its coverage and robustness.
References
- Necip Fazil Ayan, Bonnie J. Dorr, Christof Monz.
Alignment Link Projection using Transformation-Based Learning.
In Proceedings of HLT-EMNLP 2005. Pages 185-192, 2005.
[ PDF ]
- Necip Fazil Ayan, Bonnie J. Dorr, Christof Monz.
NeurAlign: Combining Word Alignments Using Neural Networks.
In Proceedings of HLT-EMNLP 2005. Pages 65-72, 2005.
[ PDF ]
- Nizar Habash, Bonnie J. Dorr and Christof Monz.
Challenges in Building an Arabic-English GHMT System
with SMT Components. In Proceedings of the 7th Biennial Meeting of the
Association for Machine Translation in the Americas (AMTA-06).
Pages 56-65, 2006.
[ PDF ]
Information Retrieval
I have previously pursued several strands of work in this area
focusing on exploiting statistical machine translation techniques for
cross-language information retrieval. One of the major issues in this
area is the estimation of translation probabilities in the absence of
a sizable parallel corpus. In [1] we have shown how these
probabilities can be learned automatically by just inspecting corpora
in the target language.
In joint work with colleagues at the University of Amsterdam I
analyzed the impact of morphological normalization on mono-lingual
document retrieval for several European languages [2,3]. This
comprehensive study compared numerous morphological pre-processing
strategies, including ngram-splitting, decompounding, and stemming,
and their impact on retrieval effectiveness for the eight languages
covered in the study.
In addition to cross-language retrieval myself and some colleagues at
the University of Amsterdam also compared a number of retrieval
strategies for web retrieval, in particular strategies for finding
home pages, as opposed to arbitrary web pages [4]. This work showed
that combining evidence from different sources, including the document
layout and the context of a document, i.e. information from pages with
incoming links, improves retrieval performance.
References
-
Christof Monz and Bonnie J. Dorr. Iterative Translation Disambiguation
for Cross-Language Information Retrieval. In Proceedings of the
28th Annual International ACM SIGIR Conference on Research and
Development in Information Retrieval. Pages 520-527, 2005.
[ PDF ]
- Vera Hollink, Jaap Kamps, Christof Monz, and Maarten de Rijke. Monolingua
l Document Retrieval
for European Languages. Information Retrieval 7(1), p. 33-52.
[ PDF ]
- Christof Monz and Maarten de Rijke. Shallow Morphological Analysis in Mon
olingual Information
Retrieval for Dutch, German and Italian. In: Post-Conference Proceedi
ngs of the
Cross Language Evaluation Forum Workshop (CLEF 2001),
Springer Lecture Notes in Computer Science
[ PDF ]
- Jaap Kamps, Christof Monz, Maarten de Rijke, and Börkur Sigurbjörn
sson.
Approaches to Robust and Web Retrieval.
In: Proceedings of TREC 2003, p. 594-600.
[ PDF ]
Question Answering
The research area of Question Answering (QA) is a more specific case
of information retrieval. Instead of returning a list of documents to
a user's query, the aim of a QA system is to return an actual answer
to a specific question. This requires methods that are able to analyze
the user's information need, i.e. his or her question, and the content
of documents potentially holding an answer to a much deeper level than
standard document retrieval. As deeper analysis is computationally
expensive most state-of-the-art QA systems use document retrieval
techniques to identify a smaller subset of documents that are likely
to contain an answer and restrict deeper analysis to these documents
[1]. My work on Question Answering investigated
which document retrieval techniques are best suited to accomplish this
task.
[2] showed that some of the retrieval approaches, such as blind
relevance feedback, that are known to improved document retrieval
effectiveness, perform less well in the context of question answering
and even can substantial decrease effectiveness. One of the main
differences from standard document retrieval is that for finding
documents that contain an answer to a question, the relevant
information is expressed much more locally, in a sentence or two. [3]
introduced minimal span weighting, a novel and robust proximity-based
measure that is sensitive to the degree in with relevant terms occur
in close proximity.
References
- Christof Monz and Maarten de Rijke. Tequesta: The University of Amsterdam
's Texual Question
Answering System.
In: Proceedings of Tenth Text Retrieval Conference (TREC-10), 2001, p
p. 513-522.
[ PDF ]
- Christof Monz. Document Retrieval in the Context of Question Answering.
In: Fabrizio Sebastiani (Ed.) Proceedings of the 25th European Conference
on Information Retrieval Research (ECIR-03), Lecture Notes in Computer
Science 2633, Springer, 2003, pages 571-579.
[ PDF ]
- Christof Monz. Minimal Span Weighting Retrieval for Question Answering.
In: Rob Gaizauskas, Mark Greenwood, and Mark Hepple (eds.) Proceedings of
the SIGIR Workshop
on Information Retrieval for Question Answering, p. 23-30, 2004.
[ PDF ]
Summarization
The aim summarization is to provide a user with an automatically
generated summary that captures the content of the document. In the
context of information retrieval, summarization allows a user to
determine quickly whether a retrieved document is relevant with
respect to the user's information need. In
Together with colleagues at the University
of Maryland and BBN Technologies we compared metrics for evaluating
summarization performance in an extrinsic task in a novel that reduces
inter-assessor disagreement [1].
In a second line of research on summarization we
applied sentence trimming techniques based on parsing information to
generate multi-document summaries [2]. The same techniques have also been
applied to summarize email threads from the Enron email corpus. In
earlier work [3], I developed an approach that merges
together multiple news items that are topically related and thereby
reduces the amount of redundancy between them while increasing the
coverage.
References
- Stacy President Hobson, Bonnie J. Dorr, Christof Monz and Richard Schwartz.
Task-Based Evaluation of Text Summarization using Relevance Prediction.
Information Processing and Management, Special Issue on Summarization,
Volume 43(6), pp. 1482-1499, 2007.
[ PDF ]
- David Zajic, Bonnie Dorr, Richard Schwartz, Christof Monz, Jimmy Lin.
A Sentence-Trimming Approach to Multi-Document Summarization.
In Proceedings of the HLT-EMNLP 2005 Workshop on Text Summarization.
2005. [ PDF ]
- Christof Monz. Document Fusion for Comprehensive Event Description.
In: M. Maybury (ed.) Proceedings of the ACL 2001 Workshop on Human Langua
ge Technology
and Knowledge Management, 2001.
[ PDF ]
Document Analysis
Together with colleagues at the University of Amsterdam we developed
an approach for detecting reading order in scanned document images
[1,2,3]. Our approach was the first attempt to succesfully
combine layout information and textual content. Earlier approaches
relied on templates for specific journals/magazines, whereas this work
is more general, and only assumes a left-to-right, top-to-bottom
reading direction. Our model nicely integrates insights from different
research areas: logical labeling, constraint satisfaction and natural
language processing.
References
- Marco Aiello, Christof Monz, Leon Todoran and M. Worring.
Document Understanding for a Broad Class of
Documents. International Journal of Document Analysis and
Recognition 5(1), 2002, pages 1-16
[ PDF ]
- Leon Todoran, Marco Aiello, Christof Monz and Marcel Worring.
Logical Structure Detection for Heterogeneous Document Classes.
In: Proceedings of Document Recognition and Retrieval VIII, SPIE,
2001, pp. 99-110
[ PDF ]
- Marco Aiello, Christof Monz and Leon Todoran. Combining Linguistic
and Sp atial Information for Document Analysis. In: J. Mariani and
D. Harman (Eds.) Proceedings of RIAO'2000 Content-Base d Multimedia
Information Access, CID, 2000. pp. 266-275.
[ PDF ]
|