Word & Sentence Representations and Responsible AI

There currently is much discussion (finally!) about the need to study the social consequences of the widespread adoption of natural language processing technology, and to pay much more attention to ethics in our education. One important issue there is the effects of racial, sexist and other biases present in the data that we train our NLP algorithms on. Are the AI systems we build as biased as humans are? Or even worse? Does that lead to real world consequences?

These are difficult questions, in turns out. But if identifying bias is hard, fixing it is even harder. Simple fixes trying to balance the dataset often don’t well, because the accuracy of our systems drops when data are rmoved or artificial data is added.

There is an interesting parallel between the need to both identify and correct bias (w.r.t. ethnicity, gender, sexual orientation, religion, class, education), and the quest to try to identify and influence how modern deep learning systems represent linguistic category information (number, gender, determinacy, animacy, case etc). In this post I would like to explore the extent to which the approach we have been developing for the latter, can be applied to also achieve the former.

Diagnostic classification

Let’s start with that prior work. The approach we have developed to answer linguistic questions about deep learning system is called ‘Diagnostic classification’. The idea that we train a deep learning model — henceforth the ‘target model’ — for some natural language task, for instance, for predicting the next word in a sentence. With current techniques (e.g., two layer LSTMs, a clever training regime, and enormous datasets) we have become really good at that task. But what linguistic information is the deep learning model using to make its excellent predictions?

To figure that out we have tried all the visualization and ‘ablation’ tricks (i.e., systematically damaging the trained network and see what happens) from the literature, but found that they are only of limited use. LSTMs and other deep learning models are (i) high dimensional, and (ii) highly nonlinear. This means that visualization is of little use, because the solutions the LSTM finds have information distributed over hundreds or thousands of dimensions, which our eyes cannot track all at once. Moreover, the solutions often involve interactions between parts, such that the function of a part typically is a different one for each configurations of the other parts; knocking out components one by one is therefore not likely to reveal what is really going on.

The solution we found (inspired by lots of earlier work from other groups, and in parallel to other groups) is to develop a series of meta-models to help figure out the inner workings of the target model. The function of the meta-models is in the first place to diagnose what is going on, and often these models are classifiers (although sometime they are regressors or models producing complex, structured output); hence, we refer to them as diagnostic classifiers.

We published the first paper on diagnostic classifiers in 2016 (introducing the term), where we focused on networks trained to perform simple arithmetics — with only addition, substraction and brackets. In 2017 we published a paper on the same task that used diagnostic classifiers on the same task, but went on to use the insights gained to change the training regime. By adding closeness to the nearest symbolic solution to the loss function, we managed to ‘guide’ the target network to even better performance (“symbolic guidance“).

Our latest paper, to be presented at the upcoming BlackboxNLP workshop, applies the whole framework to language modelling. We build on the work of Linzen et al (2016) and Gulordava et al. (2018), who studied the ability of LSTM-based language models to learn about number agreement and other syntactic dependencies between words in a sentence.



Tal Linzen, Emmanuel Dupoux, and Yoav Goldberg. 2016. Assessing the ability of lstms to learn syntax-sensitive dependencies. Transactions of the Association for Computational Linguistics, 4:521–535.
Kristina Gulordava, Piotr Bojanowski, Edouard Grave, Tal Linzen, and Marco Baroni. 2018. Colorless green recurrent networks dream hierarchically. In Proceedings of the 2018 Conference of the North
American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), volume 1, pages 1195–

Hupkes et al. 2016, 2017, 2018

Other resources

The Mexican restaurant example is mentioned here:

The Science paper on gender and other biases is discussed and link to here:

And here are some general resources on ethics & AI (in particular week 6: Fairness) https://github.com/sblodgett/ai-ethics

Outnumbered (https://www.amazon.com/Outnumbered-Exploring-Algorithms-Control-Lives/dp/147294741X)