Statistical Structure in NLP 2013
Dr Khalil Sima’an
Dr Bart Mellebeek
Sophie Arnoult (MSc)
Format This is a research-oriented course. A number of lectures is given by the lecturers and invited speakers. Depending on number of students, the aim is to have the students work in pairs on an article from the list below (see the schedule):
- prepare a 30min presentation of the article,
- present the article to fellow students
- lead and contribute to a discussion about the article
We will likely have a workshop on a day at the end of May where all students present their project work.
Homework for all students every session Read and prepare the articles before the session. You are expected to participate in the discussions during the session and prepare questions about the articles. You could be questioned about the articles or possibly even asked to sketch some technical aspects of the article on the board, should the need arise.
Projects There will be two main programming projects: students work in pairs. Details follow very soon. The first project is in April and the second project is in May.
Grades An interpolation of the grades for presentation, homework and project work. The interpolation factors will be determined soon as the format is completely determined (see format).
Schedule Below is the schedule for the meetings: The lecture topic with between brackets the names of meeting lecturer or chair (when students present). The schedule is still somewhat fluid because of invited speakers from academia and industry.
Schedule (under construction)
Project I (Delivery date 28 April; Deliver report according to guidelines)
- Given: Language pair Dutch-English from Europarl with word alignments.
- Build an efficient tool for extracting all phrase pairs up to length 4 from a given word aligned training parallel corpus with their joint and conditional (two directional) probability estimates.
- Explore the coverage (sparsity) of the phrase table by computing the percentage phrase pairs in a held-out set which
- (a) are available in the training set phrase table, or
- (b)** can be built by concatenating (in any order) n>0 phrase pairs from the training set phrase table: it is important to set an upper-bound on n, E.g., n<=3 which makes this more reasonable to execute. B
- Build a baseline MT System (phrase based): start from word alignment of the parallel corpus and proceed towards full system using Giza++, Moses and then deploy BLEU evaluation tools on a test and development sets.
- You should deliver an article (max 4 pages): Describe the project work details in an article format: abstract, introduction/motivation/background, problem description, possible solutions and selected solution, details of own solution(s) with algorithmic aspects, discussion of existing work on the same topic or related topic, implementation issues and experimental setting, results and analysis, conclusions.
** This is difficult to do efficiently but worthwhile exploring to understand the difficulty of reordering (think about permutations of a range of integers). What is the complexity of building all possible permutations (orders) of the string of integers <1, 2, …, n>?
12 April 9:00-11:00
Basics Phrase-Based SMT and related models (One Session Sophie + Khalil + Bart)
- Phrase based SMT (Koehn, Och and Marcu 2003)
Joint Phrase-based SMT models (Milos + Khalil + Sophie)
- Joint Phrase-based SMT (Wang and Marcu 2005)
- OR Data Oriented Translation (Poutsma 2000)
15 April 9:00-11:00
Hierarchical MT basics (Khalil + Gideon + Sophie+Milos)
- ITG (Dekai Wu´s paper 1997, CL-J): Dekai Wu. Stochastic inversion transduction grammars and bilingual parsing of parallel corpora.
19 April 9:00-11:00
Hierarchical MT basics
- Hiero model (David Chiang 2007)
- Extra: David Chiang´s SCFG tutorial: An introduction to synchronous grammars: part of a tutorial given at ACL 2006 with Kevin Knight.
22 April 9:00-11:00
Syntax driven SMT (first sample)
- Yamada and Knight 2001 "A Syntax-Based Statistical Translation Model"
- OR Tree-let MT models (Microsoft Research: Quirk, Menezes and Cherry ACL 2005)
26 April 9:00-11:00 NO LECTURE EXAM WEEK
3 May 9:00-11:00 (Sophie + Bart)
Meeting dedicated to wrap-up first project and discussion of the second project.
6 May 9:00-11:00
MT and Data Adaptation (One session: Bart + Milos)
- Mixture-Model Adaptation for SMT (Foster and Kuhn, 2007)
- OR Instance Weighting for Domain Adaptation in NLP (Jiang and Zhai, 2007)
13 May 9:00-11:00 (Milos’ + Bart + Sophie)
MT and discriminative models + global lexical selection
- Srinivas Bangalore, Patrick Haffner, Stephan Kanthak: Statistical Machine Translation through Global Lexical Selection and Sentence Reconstruction. ACL 2007
- Extra: Abraham Ittycheriah, Salim Roukos: Direct Translation Model 2. HLT-NAACL
17 May 9:00-11:00
MT and Learning source reordering models (Maxim+Khalil+Bart+Sophie)
Guest lecture by Dr. Maxim Khalilov (TAUS Labs.):
- Source-reordering with fixed rules (Collins, Koehn and Kucerova 2005)
- Extra: Source reordering and the permutation problem (Tromble and Eisner 2009)
- Source reordering with syntax by learning transducers: (Khalilov and Sima'an 2013)
21 May 13:00-15:00
Syntax for SMT (Sample 2): Lexical Syntax (CCG and TAG)
Introduction to lexical syntax (TAG, CCG)
- Learning ITG reordering with syntax (Mylonakis and Sima´an 2010; 2011).
- Hassan et al approach ACL 2007: Syntax for phrase-based SMT
26 May 9:00-11:00
Last lecture: MT and paraphrasing + Wrap up of course material.
- Paraphrasing for increasing MT coverage (Chris Calisson-Birch et al)
- OR Paraphrasing via MT (Chris Calisson-Birch et al)
Reserve theoretical papers
Factorization of Permutations, word alignments into synchronous trees (One session)
- Normalized Decomposition Trees by Zhang, Gildea and Chiang (Coling 2008).
- Optional: Hirarchical Alignment Trees: Word Alignments as synchronous trees (Sima´an and MdBW). To be supplied by lecturers
- Factorization of synchronous context-free grammars in linear time (2007) by Hao Zhang and Daniel Gildea.
- Reserve paper: 'An Overview of Probabilistic Tree Transducers for Natural Language Processing' (Knight and Grael, CICLing 2005)
Student lectures (dates and papers)
15 April Cabot and Nugteren (ITG)
19 April Iepsma and van Bellen (Hiero)
22 April Voskaridis and van Belle (Yamada and Knight)
13 May Mennes and Jongstra (Bangalore et al)
17 May van Balen and Schulz (Tromble and Eisner)
21 May Gooijer and van Rossum (Hassan et al ACL 2007)??
26 May Nagel and Ridzik (Paraphrasing)??
Tasks and procedure
- Everyone reads the papers being presented, not only the presenters.
- 20min per student, approximately 12-14 slides: do not make dense slides
- Build your lecture up like the paper does: start from general stuff, explain the problem and move towards the related work, suggested solution, experiments, analyis, conclusions
- If you have any questions, talk to the assistants
- Put the slides on Blackboard at least two days ahead of the lecture and let us know so we can have a look at the slides and make suggestions for improvement.
- Send a notification to Khalil, Bart and Sophie as soon as you put the slides on Blackboard
- We will keep track of the quality of the slides, the lecture, the content and the discussion brought up, and the questions that students have.
Chapter 4 of this 2007 EuroMatrix report: