Statistical Structure in NLP 2013

Dr Khalil Sima’an 

Dr Bart Mellebeek

Sophie Arnoult (MSc)

Format  This is a research-oriented course. A  number of lectures is given by the lecturers and invited speakers. Depending on number of students, the aim is to have the students work in pairs on an article from the list below (see the schedule):

  1. prepare a 30min presentation of the article,
  2. present the article to fellow students
  3. lead and contribute to a discussion about the article

We will likely have a workshop on a day at the end of May where all students present their project work.

Homework  for all students every session  Read and prepare the articles before the session. You are expected to participate in the discussions during the session and prepare questions about the articles. You could be questioned about the articles or possibly even asked to sketch some technical aspects of the article on the  board, should the need arise.

Projects There will be two main programming projects: students work in pairs. Details follow very soon. The first project is in April and the second project is in May.

Grades An interpolation of the grades for presentation, homework and project work. The interpolation factors will be determined soon as the format is completely determined (see format).

Schedule Below is the schedule for the meetings: The lecture topic with between brackets the names of meeting lecturer or chair (when students present). The schedule is still somewhat fluid because of invited speakers from academia and industry.


Schedule (under construction)

2 April 13:00-15:00 

Introduction, schedule and basics of SMT (Khalil)

An Overview of Latent structure in Machine Translation Models: from word-alignments to tree-alignments through phrase-based models (slides)

  1. Read IBM models paper Extra Reference paper is the IBM alignment models
  2. Follow the practical session on Thursday: a worked out example of the workings of IBM models I (and II) with EM.

5 April 9:00-11:00   

Word Alignment Models and Symmetrized alignments.

  1. HMM Alignment model (Och  & Ney 2002): Franz Josef Och, Hermann Ney: A Systematic Comparison of Various Statistical Alignment Models. Computational Linguistics 29(1): 19-51 (2003). In this paper read the first sections abou word alignments; from the seven models presented in this paper you will read only three: HMM Model and IBM models I and II. Have a look into IBM model III. Study the symmetrization techniques in Section 4, read the experiments and results.

8 April 9:00-11:00 

Background of EMBT

  1. Example-Based MT (Makoto Nagao, 1984)

MT evaluation (One Session -- Bart + Sophie)

  1. Evaluation metrics: BLEU, METEOR and a few other metrics: See chapter 4 in the linked EuroMatrix project report..
  2. Extra: Linguistic measures for MT Evaluation (Gimenez and Marquez 2008)

Project I (Delivery date 28 April; Deliver report according to guidelines)

  1. Given: Language pair Dutch-English from Europarl with word alignments.
  2. Build an efficient tool for extracting all phrase pairs up to length 4 from a given word aligned training parallel corpus with their joint and conditional (two directional) probability estimates.
  3. Explore the coverage (sparsity) of the phrase table by computing the percentage phrase pairs in a held-out set which
  1. (a) are available in the training set phrase table, or
  2. (b)** can be built by concatenating (in any order) n>0 phrase pairs from the training set phrase table: it is important to set an upper-bound on n, E.g., n<=3 which makes this more reasonable to execute. B
  1. Build a baseline MT System (phrase based): start from word alignment of the parallel corpus and proceed towards full system using Giza++, Moses and then deploy BLEU evaluation tools on a test and development sets.
  2. You should deliver an article (max 4 pages): Describe the project work details in an article format: abstract, introduction/motivation/background, problem description, possible solutions and selected solution, details of own solution(s) with algorithmic aspects,  discussion of existing work on the same topic or related topic, implementation issues and experimental setting, results and analysis, conclusions.

** This is difficult to do efficiently but worthwhile exploring to understand the difficulty of reordering (think about permutations of a range of integers). What is the complexity of building all possible permutations (orders) of the string of integers <1, 2, …, n>?

12 April 9:00-11:00 

Basics Phrase-Based SMT and related models (One Session Sophie + Khalil + Bart)

  1. Phrase based SMT  (Koehn, Och and Marcu 2003)  

Joint Phrase-based SMT models (Milos + Khalil + Sophie)

  1. Joint Phrase-based SMT (Wang and Marcu 2005)  
  2. OR  Data Oriented Translation  (Poutsma 2000)

15 April 9:00-11:00

Hierarchical MT basics (Khalil + Gideon + Sophie+Milos)

  1. ITG (Dekai Wu´s paper 1997, CL-J): Dekai Wu. Stochastic inversion transduction grammars and bilingual parsing of parallel corpora.

19 April 9:00-11:00  

Hierarchical MT basics

  1. Hiero model (David Chiang 2007)
  2. Extra: David Chiang´s SCFG tutorial: An introduction to synchronous grammars: part of a tutorial given at ACL 2006 with Kevin Knight.

22 April 9:00-11:00

Syntax driven SMT (first sample)

  1. Yamada and Knight 2001 "A Syntax-Based Statistical Translation Model"
  2. OR Tree-let MT models (Microsoft Research: Quirk, Menezes and  Cherry ACL 2005)

26 April 9:00-11:00 NO LECTURE EXAM WEEK

3 May 9:00-11:00 (Sophie + Bart)

Meeting dedicated to wrap-up first project and discussion of the second project.

6 May 9:00-11:00

MT and Data Adaptation (One session: Bart + Milos)

  1. Mixture-Model Adaptation for SMT (Foster and Kuhn, 2007)
  2. OR Instance Weighting for Domain Adaptation in NLP (Jiang and Zhai, 2007)

13 May 9:00-11:00 (Milos’ + Bart + Sophie)

MT and discriminative models + global lexical selection

  1. Srinivas Bangalore, Patrick Haffner, Stephan Kanthak: Statistical Machine Translation through Global Lexical Selection and Sentence Reconstruction. ACL 2007
  2. Extra: Abraham Ittycheriah, Salim Roukos: Direct Translation Model 2. HLT-NAACL

17 May 9:00-11:00

MT and Learning source reordering models (Maxim+Khalil+Bart+Sophie)

    Guest lecture by Dr. Maxim Khalilov (TAUS Labs.):

  1. Source-reordering with fixed rules (Collins, Koehn and Kucerova 2005)
  1. Extra: Source reordering and the permutation problem (Tromble and Eisner 2009)
  1. Source reordering with syntax by learning transducers: (Khalilov and Sima'an 2013)

21 May 13:00-15:00

Syntax for SMT (Sample 2): Lexical Syntax (CCG and TAG)

  Introduction to lexical syntax (TAG, CCG)

  1. Learning ITG reordering with syntax (Mylonakis and Sima´an 2010; 2011).
  2. Hassan et al approach ACL 2007: Syntax for phrase-based SMT

26 May 9:00-11:00

Last lecture:  MT and paraphrasing + Wrap up of course material.

  1. Paraphrasing for increasing MT coverage (Chris Calisson-Birch et al)
  2. OR Paraphrasing via MT  (Chris Calisson-Birch et al)

Reserve theoretical papers

Factorization of Permutations, word alignments into synchronous trees (One session)

  1. Normalized Decomposition Trees by Zhang, Gildea and Chiang (Coling 2008).
  2. Optional: Hirarchical Alignment Trees: Word Alignments as synchronous trees (Sima´an and MdBW). To be supplied by lecturers
  1. Factorization of synchronous context-free grammars in linear time (2007) by Hao Zhang and Daniel Gildea.
  2. Reserve paper: 'An Overview of Probabilistic Tree Transducers for Natural Language Processing' (Knight and Grael, CICLing 2005)

Reserve Extra: Techniques for optimal disambiguation/decoding algorithms

  1. Joshua Goodman. Parsing algorithms and metrics. In Proceedings of the 34th Annual Meeting of the ACL, pages 177-183, Santa Cruz, CA, June 1996.
  2. Khalil Sima'an.  On Maximizing Metrics for Syntactic Disambiguation  In Proceedings of the International Workshop on Parsing Technologies (IWPT'03).Nancy, France, April 2003
  3. Advanced reading: MBR for MT: MBR (Kumar and Byrne 2004); Lattice MBR

Student lectures (dates and papers)

15 April   Cabot and Nugteren (ITG)

19 April   Iepsma  and van Bellen (Hiero)

22 April   Voskaridis and van Belle (Yamada and Knight)

13 May    Mennes and Jongstra (Bangalore et al)

17 May    van Balen and Schulz (Tromble and Eisner)

21 May    Gooijer  and van Rossum (Hassan et al ACL 2007)??

26 May    Nagel and Ridzik (Paraphrasing)??

Tasks and procedure

  1. Everyone reads the papers being presented, not only the presenters.
  2. 20min per student, approximately 12-14 slides: do not make dense slides
  3. Build your lecture up like the paper does: start from general stuff, explain the problem and move towards the related work, suggested solution, experiments, analyis, conclusions
  4. If you have any questions, talk to the assistants
  5. Put the slides on Blackboard at least two days ahead of the lecture and let us know so we can have a look at the slides and make suggestions for improvement.
  6. Send a notification to Khalil, Bart and Sophie as soon as you put the slides on Blackboard
  7. We will keep track of the quality of the slides, the lecture, the content and the discussion brought up, and the questions that students have.

[a]

[a]Sophie Arnoult:

Chapter 4 of this 2007 EuroMatrix report:

http://www.euromatrix.net/deliverables/Euromatrix_D1.3_Revised.pdf