I have added a recurrent attention layer on top of the LSTM network, it makes the training a little faster and more stable. It also increases the performance of a simple LSTM network about 1 percent. I am trying to use this model to analyse how the LSTM network is handling the dependency relationships in a sentence.
Beside the fact that recurrent LSTM models don’t know anything about the sentence structure in contrast to recursive LSTM models that take the structure of the sentence as input there is another big different between these two models in the error signals. In the recursive LSTM model there is a sentiment label and so an error signal for each sub-tree in the network whereas in the recurrent model, there is only one valid sentiment label at the end of the sentence. This makes the training process of recurrent LSTM model much harder.
There are three possible approaches to deal with this problem:
- consider the sentiment label at the end of the sequence valid for all steps.
- consider the sentiment label at the end of the sequence valid for all steps but when calculating the error signal at each step give higher weight to the error the closer it gets to the end of the sentence.
- add an attention layer to the network so that the network learns the contribution of each step toward generating the final sentiment of the sentence.
We have tested all these solutions. The first one, which is using the final sentiment label for all step does not work well. In fact, it leads to a worse result.
The second approach, which can be seen as a special case of the third one can actually stabilise the training and speed it up. Adding the attention layer, which means to dynamically assign the weights, can do a little better than using fixed weights.
What is interesting when we employ an attention model is that it speeds up the training it doesn’t improve the final results significantly. More interestingly, when the training is done, monitoring attention values for different steps, indicates that in the end all the attention is given to the last step. Which mean the recurrent LSTM cell keeping all the information it needs to the last step.
The attention layer we have employed is actually a parallel recurrent LSTM network that gets the sentence as input and outputs the attention value that should be assigned to each step. In this model, the output is calculated based the outputs from the two recurrent LSTM network and these networks are trained jointly. Figure ?? is a schematic of our attended LSTM model. The difference between our attention model and the one proposed in is that they use a feed-forward network to calculate the attention and they feed it the hidden states of the main network. To be mentioned we have applied the attention model on a one directional LSTM that may be a reason that all the attention is given to the last step since it is the only step that contains all the information of the whole sentence.