Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

RWKV: Reinventing RNNs for the Transformer Era

About

Transformers have revolutionized almost all natural language processing (NLP) tasks but suffer from memory and computational complexity that scales quadratically with sequence length. In contrast, recurrent neural networks (RNNs) exhibit linear scaling in memory and computational requirements but struggle to match the same performance as Transformers due to limitations in parallelization and scalability. We propose a novel model architecture, Receptance Weighted Key Value (RWKV), that combines the efficient parallelizable training of transformers with the efficient inference of RNNs. Our approach leverages a linear attention mechanism and allows us to formulate the model as either a Transformer or an RNN, thus parallelizing computations during training and maintains constant computational and memory complexity during inference. We scale our models as large as 14 billion parameters, by far the largest dense RNN ever trained, and find RWKV performs on par with similarly sized Transformers, suggesting future work can leverage this architecture to create more efficient models. This work presents a significant step towards reconciling trade-offs between computational efficiency and model performance in sequence processing tasks.

Bo Peng, Eric Alcaide, Quentin Anthony, Alon Albalak, Samuel Arcadinho, Stella Biderman, Huanqi Cao, Xin Cheng, Michael Chung, Matteo Grella, Kranthi Kiran GV, Xuzheng He, Haowen Hou, Jiaju Lin, Przemyslaw Kazienko, Jan Kocon, Jiaming Kong, Bartlomiej Koptyra, Hayden Lau, Krishna Sri Ipsit Mantri, Ferdinand Mom, Atsushi Saito, Guangyu Song, Xiangru Tang, Bolun Wang, Johan S. Wind, Stanislaw Wozniak, Ruichong Zhang, Zhenyuan Zhang, Qihang Zhao, Peng Zhou, Qinghua Zhou, Jian Zhu, Rui-Jie Zhu• 2023

Related benchmarks

TaskDatasetResultRank
Commonsense ReasoningHellaSwag
Accuracy70.8
1460
Multi-task Language UnderstandingMMLU--
842
Commonsense ReasoningWinoGrande
Accuracy68.4
776
Question AnsweringARC Challenge
Accuracy46.1
749
Commonsense ReasoningPIQA
Accuracy77.3
647
Language ModelingWikiText-103 (test)
Perplexity25.07
524
Question AnsweringARC Easy
Normalized Acc74.9
385
Physical Commonsense ReasoningPIQA
Accuracy77.09
329
Boolean Question AnsweringBoolQ
Accuracy62.72
307
Commonsense ReasoningCommon Sense Reasoning Tasks
Avg Score50.56
241
Showing 10 of 37 rows

Other info

Code

Follow for update