is Empirical Evidence
The Rest is Belief
Processing (NLP) or Computational_linguistics(CL)
in Computer Science (CS)?
CL and NLP have been mainstream CS topics for sometime
now in many places around the world, some CS colleagues still
wonder why CL/NLP squarely belong in the CS curriculum. Those who are
unfamiliar with CL/NLP might be able to
a suitable motivation for this from the following arguments.
can we make such fantastic computer programs that conduct earthly
tasks such as text/speech translation or summarization, or take
spoken/written orders from humans, conduct intelligent search for
information from large texts, etc.? The CL and NLP communities develop
theories, models and techniques of such human cognitive skills,
implement them in computer systems and empirically evaluate these
theories and models against the gold standard human language use. While
this study is
multidisciplinary by definition, topics such as statistical modeling of
incomplete data belong also in
CS (e.g., also under machine learning).
concept of a language
is central to Computer Science: formal languages (or Sets) constitute
mathematical foundations of computer programs. Crucially, natural
processing can be seen as a means for reviewing how this apparatus is put
test, and how it can be insufficient
for real world tasks (e.g. cognitive capacities like visual and
language peception, predicting physical phenomena like weather,
modeling expert behavior such as medical diagnosis). I think that this
is instructive for a computer
languages are the paradigm in ``traditional" Computer Science, natural
may constitute a paradigm for real world tasks in ``modern" Computer
Uncertainty is the main problem in such
tasks. In NLP, uncertainty manifests itself at all levels in the
of ambiguity (both in the input to and output from a system).
I think that language ambiguity constitutes a natural,
manifestation of the problem of uncertainty in general. Solutions to
in NLP are based on probabilistic models, and on the interplay between
algorithms, statistics and linguistic knowledge, which might be an
example of how to devise systems that mimic human behavior as opposed to traditional, synthetic systems that carry their test of success within.
constructs (e.g. grammars) are not too wild for computer
It is exactly the combination of such knowledge-based constructs with
and statistics that makes NLP so interesting: it offers an excellent
that, despite the ``arbitrariness" of linguistic categories, these
categories might still be
good approximations of prior knowledge, especially when
prior knowlegde is incorporated in a probabilistic model, and weighted
statistical estimation from actual data.
is all about models that can predict what happens with future events in
analogy to events encountered in the past, i.e., it is models that
learn from experience (data -- see also Machine Learning) that
are so important for natural language processing.
Some readings (Why
Probabilistic Models in NLP):
Jelinek. Five speculations and a divertimento on the themes
of H ...
- Fernando Pereira. Formal
grammar and information theory: Together again?. Philosophical
Transactions of the Royal Society,
358(1769):1239-1253, April 2000.
- Khalil. Sima'an. Empirical
validity and technological viability: Probabilistic models of Natural
Language Processing. In R. Bernardi and M. Moortgat
Corpora and Logic Based Grammar Formalisms,
CoLogNET Area 6, 2003.
- Steven Abney. Statistical methods. Encyclopedia
of Cognitive Science,
Macmillian. To appear.
- Christopher D. Manning. 2003. Probabilistic
In Rens Bod, Jennifer Hay, and Stefanie Jannedy (eds), Probabilistic
Linguistics, pp. 289-341. Cambridge, MA: MIT Press.
[The version here is from 12 Jan 2002; it doesn't reflect final
revisions, corrections, and copyediting that appear in the published
- Hermann Ney. Corpus-Based
Statistical Methods in Speech and Language Processing. In S. Young
and G. Bloothooft (eds.), Corpus-Based Statistical Methods in Speech
and Language Processing, pp.1-26.