My research interests mainly concern statistical machine translation (SMT) and, to a minor extent, speech recognition.

Improving SMT between distant languages

On many language pairs, the phrase-based SMT framework remains highly competitive despite the recent advances of the tree-based approaches. Most of my work so far has focused on phrase-based SMT and on two of its most challenging aspects: word reordering and morphology.

Considering that in many languages parsing resources are still scarce and inaccurate, I have refrained from using syntactic trees in both pre-processing and decoding phases. Instead, I have used other kinds of annotation that are simpler to automate, less prone to errors, and available in more languages: namely, part-of-speech tagging and shallow syntax chunking.

Word reordering.   While short and medium range reordering is reasonably handled by the phrase-based approach, either within or across adjacent phrases, long-range reordering still represents a challenge for current SMT systems. A major cause of this problem is the inadequacy of existing reordering models and constraints to cope with the distribution of reordering phenomena in different language pairs. During my PhD I have developed several techniques to improve the performance of existing phrase-based systems on language pairs, like Arabic-English and German-English, where long-range reordering concentrates on few patterns.

  Download PhD thesis
  [ full | summary ]

In the first technique (Bisazza and Federico, WMT, 2010), likely long-range word movements for a given input sentence are predicted by a small set of chunk-based rules, and represented as a word reordering lattice. The lattices are then translated by a regular decoder that can perform additional short and medium reordering. In (Bisazza et al., 2012) further improvements were achieved by pruning the lattices with an SVM-based classifier. The second technique (Bisazza and Federico, ACL, 2012) aims at adjusting the reordering constraints for each new input sentence. The same chunk-based rules are used to predict likely reorderings, which are suggested to the decoder in a novel way: namely, by modifying the distortion cost matrix for specific pairs of input words. Compared to the lattice solution, modified distortion matrices provide a more compact, implicit way to encode likely reorderings in a sentence-specific fashion. The third technique (Bisazza and Federico, TACL, 2013) does not require any language-specific reordering rule, but learns to dynamically shape the reordering search space in a fully data-driven way. More specifically, the reordering options admitted by loose reordering constraints are pruned on-the-fly through a binary classifier that predicts whether a given input word should be translated right after another.

Morphology.   Minimizing differences between lexical granularities of source and target language is an effective way to obtain finer word alignments and to better model the translation task. In (Bisazza and Federico, IWSLT, 2009) I addressed the problem of complex agglutinative morphology on the source side of SMT. I focused on the Turkish-English pair and designed several schemes of morphological segmentation to apply to the Turkish training and test data.

More recently I have approached the challenges of translating from English into morphologically rich languages (MRL). When rich morphology is on the target side, one has to deal with data sparsity (e.g. surface forms missing from the models) but also with a notably harder translation selection task. For instance in higly inflectional languages, like Russian, the choice of the right inflection often depends on source and/or target clues that lie outside the current phrase pair or target n-gram.

To capture morphological dependencies on the target side, I have started to work within the well-established framework of class-based language modeling (LM). In (Bisazza and Monz, COLING, 2014) I have presented the first systematic comparison of different forms of class-LMs and different class-LM combination methods in the context of SMT into a MRL. The results of this work suggest that class-LM is a promising technique to improve translation into MRL. However, the best performance was obtained with standard Brown classes and not with morphologically motivated classes, which shows that more work is needed to properly integrate morphology in the target LM.

Finally, exploiting source-side context for inflection prediction is the subject of our work on neural network (NN) translation modeling (Tran, Bisazza and Monz, EMNLP, 2014). In this paper, we proposed a novel NN architecture to predict stem and suffix translation of a word in context. We obtained very good results in terms of prediction accuracy and showed how this model can be tightly integrated in a phrase-based SMT system leading to promising improvements of translation quality.

Other research topics

My other research contributions lie in the area of domain and style adaptation for SMT and speech recognition of morphologically rich languages.

Domain and genre adaptation for SMT.   In (Bisazza et al., IWSLT, 2011) I considered the common scenario where SMT models can only be trained on small amount of in-domain data but large generic corpora are also available. I empirically compared several ways to combine multiple translation models and found a simple but effective technique – called fill-up – to exploit background knowledge and improve model coverage, while preserving the more reliable information coming from the in-domain model.

In (Bisazza and Federico, EACL, 2012) I addressed the problem of language style modeling for the translation of public conference talks. Given the diversity of topics in this task, the in-domain data alone cannot ensure sufficient coverage to the SMT models. Therefore large out-of-domain data collections are typically used. To counteract the bias on an unsuitable language style caused by the large background LM, I proposed to use of a hybrid class-based LM, where only infrequent words are mapped into classes.

More recently, we have worked on clarifying the concept of domain, which is so commonly used but vaguely defined in the SMT adaptation literature. More specifically in (van der Wees et al., ACL, 2015) we disentangled the effect of two distinct properties of text -- genre and topic -- using a special data set with controlled topic and genre distributions. In (van der Wees et al., ACL-WNUT, 2015) we looked more closely at the challenging task of translating user-generated text, and analyzed in detail the behavior of a state-of-the-art SMT system on five different types of informal text.

Speech recognition for morphologically rich languages.   During my years at Fondazione Bruno Kessler I also had the chance to take part in a few speech recognition projects. I adapted an existing ASR system to the Arabic and Turkish languages, working primarily at the level of morphological segmentation for language modeling. In particular, for Turkish, I designed an unsupervised method to lexicalize suffixes and recover their surface form in context, to address the problem of data sparseness caused by suffix allomorphy (Bisazza and Gretter, LREC-Workshop, 2012). For Arabic, I developed a method to perform diacritization/vocalization of the recognition grammar on-the-fly (Bisazza and Gretter, LTC, 2013).