Statistics
is Empirical Evidence
The Rest is Belief
Why
Teach Natural
Language
Processing (NLP) or Computational_linguistics(CL)
in Computer Science (CS)?
While
CL and NLP have been mainstream CS topics for sometime
now in many places around the world, some CS colleagues still
wonder why CL/NLP squarely belong in the CS curriculum. Those who are
unfamiliar with CL/NLP might be able to
extract
a suitable motivation for this from the following arguments.
|
- How
else
can we make such fantastic computer programs that conduct earthly
tasks such as text/speech translation or summarization, or take
spoken/written orders from humans, conduct intelligent search for
information from large texts, etc.? The CL and NLP communities develop
theories, models and techniques of such human cognitive skills,
implement them in computer systems and empirically evaluate these
theories and models against the gold standard human language use. While
this study is
multidisciplinary by definition, topics such as statistical modeling of
incomplete data belong also in
CS (e.g., also under machine learning).
- The
concept of a language
is central to Computer Science: formal languages (or Sets) constitute
the
mathematical foundations of computer programs. Crucially, natural
language
processing can be seen as a means for reviewing how this apparatus is put
to
test, and how it can be insufficient
for real world tasks (e.g. cognitive capacities like visual and
language peception, predicting physical phenomena like weather,
modeling expert behavior such as medical diagnosis). I think that this
is instructive for a computer
science student!
- As
formal
languages are the paradigm in ``traditional" Computer Science, natural
languages
may constitute a paradigm for real world tasks in ``modern" Computer
Science.
Uncertainty is the main problem in such
complex
tasks. In NLP, uncertainty manifests itself at all levels in the
form
of ambiguity (both in the input to and output from a system).
I think that language ambiguity constitutes a natural,
intuitive
manifestation of the problem of uncertainty in general. Solutions to
ambiguity
in NLP are based on probabilistic models, and on the interplay between
learning
algorithms, statistics and linguistic knowledge, which might be an
instructive
example of how to devise systems that mimic human behavior as opposed to traditional, synthetic systems that carry their test of success within.
- Linguistic
constructs (e.g. grammars) are not too wild for computer
scientists.
It is exactly the combination of such knowledge-based constructs with
probability
and statistics that makes NLP so interesting: it offers an excellent
example
that, despite the ``arbitrariness" of linguistic categories, these
categories might still be
good approximations of prior knowledge, especially when
this
prior knowlegde is incorporated in a probabilistic model, and weighted
by
statistical estimation from actual data.
- In
short, it
is all about models that can predict what happens with future events in
analogy to events encountered in the past, i.e., it is models that
learn from experience (data -- see also Machine Learning) that
are so important for natural language processing.
Khalil
Sima'an
Some readings (Why
Probabilistic Models in NLP):
- Frederick
Jelinek. Five speculations and a divertimento on the themes
of H ...
- Fernando Pereira. Formal
grammar and information theory: Together again?. Philosophical
Transactions of the Royal Society,
358(1769):1239-1253, April 2000.
- Khalil. Sima'an. Empirical
validity and technological viability: Probabilistic models of Natural
Language Processing. In R. Bernardi and M. Moortgat
(eds.), Linguistic
Corpora and Logic Based Grammar Formalisms,
CoLogNET Area 6, 2003.
- Steven Abney. Statistical methods. Encyclopedia
of Cognitive Science,
Nature Publishing
Group,
Macmillian. To appear.
- Christopher D. Manning. 2003. Probabilistic
Syntax.
In Rens Bod, Jennifer Hay, and Stefanie Jannedy (eds), Probabilistic
Linguistics, pp. 289-341. Cambridge, MA: MIT Press.
[The version here is from 12 Jan 2002; it doesn't reflect final
revisions, corrections, and copyediting that appear in the published
version.]
- Hermann Ney. Corpus-Based
Statistical Methods in Speech and Language Processing. In S. Young
and G. Bloothooft (eds.), Corpus-Based Statistical Methods in Speech
and Language Processing, pp.1-26.
|