04/07/2020

Do Transformers Need Deep Long-Range Memory?

Jack Rae, Ali Razavi

Keywords: modelling data, language modelling, Transformers, Deep models

Abstract: Deep attention models have advanced the modelling of sequential data across many domains. For language modelling in particular, the Transformer-XL --- a Transformer augmented with a long-range memory of past activations --- has been shown to be state-of-the-art across a variety of well-studied benchmarks. The Transformer-XL incorporates a long-range memory at every layer of the network, which renders its state to be thousands of times larger than RNN predecessors. However it is unclear whether this is necessary. We perform a set of interventions to show that comparable performance can be obtained with 6X fewer long range memories and better performance can be obtained by limiting the range of attention in lower layers of the network.

 0
 0
 0
 0
This is an embedded video. Talk and the respective paper are published at ACL 2020 virtual conference. If you are one of the authors of the paper and want to manage your upload, see the question "My papertalk has been externally embedded..." in the FAQ section.

Comments

Post Comment
no comments yet
code of conduct: tbd

Similar Papers