Explain My Surprise: Learning Efficient Long-Term Memory by Predicting Uncertain Outcomes
About
In many sequential tasks, a model needs to remember relevant events from the distant past to make correct predictions. Unfortunately, a straightforward application of gradient based training requires intermediate computations to be stored for every element of a sequence. This requires to store prohibitively large intermediate data if a sequence consists of thousands or even millions elements, and as a result, makes learning of very long-term dependencies infeasible. However, the majority of sequence elements can usually be predicted by taking into account only temporally local information. On the other hand, predictions affected by long-term dependencies are sparse and characterized by high uncertainty given only local information. We propose MemUP, a new training method that allows to learn long-term dependencies without backpropagating gradients through the whole sequence at a time. This method can potentially be applied to any recurrent architecture. LSTM network trained with MemUP performs better or comparable to baselines while requiring to store less intermediate data.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Permuted Image Classification | pMNIST 3136 | Accuracy94.3 | 7 | |
| Sequence Copying | Copy Length 120 | Accuracy100 | 7 | |
| Permuted Image Classification | pMNIST 784 | Accuracy95.4 | 7 | |
| Scattered Sequence Copying | Scattered Copy Length 5020 | Accuracy0.999 | 6 | |
| Sequence Copying | Copy Length 5020 | Accuracy99.3 | 6 | |
| Sequence Addition | Add Length 5000 | MSE0.0053 | 6 |