Neural Machine Translation systems have difficulty translating idiom expressions. There is limited work on building a parallel corpus annotated with idioms, which is necessary to investigate this problem more systematically. We automatically build a new bilingual data set for idiom translation extracted from an existing general-purpose German↔English parallel corpus.

You can find the published work here.

This example showcases the problem of idiom translation:

We automatically build a new bilingual data set for idiom translation extracted from an existing general-purpose German↔English parallel corpus.

The statistics of the data:

The data is available here.

To cite this paper:

   @ARTICLE{2018arXiv180204681F,
   author = {Fadaee, Marzieh  and  Bisazza, Arianna  and  Monz, Christof},
   title = "{Examining the Tip of the Iceberg: A Data Set for Idiom Translation}",
   journal = {ArXiv e-prints},
   archivePrefix = "arXiv",
   eprint = {1802.04681},
   primaryClass = "cs.CL",
   keywords = {Computer Science - Computation and Language},
   year = 2018,
   month = feb,
   adsurl = {http://adsabs.harvard.edu/abs/2018arXiv180204681F},
   adsnote = {Provided by the SAO/NASA Astrophysics Data System}
   }