Neural Machine Translation models perform best when an abundance of parallel data is available. Acquiring human translations for low-resource language pairs is costly. Hence translation of low-frequency words is difficult and often inaccurate.

In this approach we alter existing parallel sentences targeting low-frequency words and augment the data by generating new diverse context for low-frequency words and the corresponding translations.

As a result training with the augmented bitext achieves significant BLEU improvements in a simulated low-resource English⬌German translation setting.

You can find the published work here.

The model we propose:

We evaluate the translation quality of English⬌German translation task. Here are the results:

As a result of our approach rare words are generated more often in the correct context:

We observe that increasing the size of the training data by diversifying the context of rare words yields better translations. Also, generation of correct rare words during translation increases, and the attention scores of rare words are on average 8.8% higher than the baseline model. The generated translation length to reference length ratio is on average 7% higher.

The code is available here.

To cite this paper:

 author    = {Fadaee, Marzieh  and  Bisazza, Arianna  and  Monz, Christof},
 title     = {Data Augmentation for Low-Resource Neural Machine Translation},
 booktitle = {Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)},
 month     = {July},
 year      = {2017},
 address   = {Vancouver, Canada},
 publisher = {Association for Computational Linguistics},
 pages     = {567--573},
 url       = {}