Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Value Residual Learning

About

While Transformer models have achieved remarkable success in various domains, the effectiveness of information propagation through deep networks remains a critical challenge. Standard hidden state residuals often fail to adequately preserve initial token-level information in deeper layers. This paper introduces ResFormer, a novel architecture that enhances information flow by incorporating value residual connections in addition to hidden state residuals. And a variant is SVFormer, where all layers share the first layer's value embedding. Comprehensive empirical evidence demonstrates ResFormer achieves equivalent validation loss with 16.11\% fewer model parameters and 20.3\% less training data compared to Transformer, while maintaining similar memory usage and computational cost. Besides, SVFormer reduces KV cache size by nearly half with only a small performance penalty and can be integrated with other KV-efficient methods, yielding further reductions in KV cache, with performance influenced by sequence length and cumulative learning rate.

Zhanchao Zhou, Tianyi Wu, Zhiyun Jiang, Fares Obeid, Zhenzhong Lan• 2024

Related benchmarks

TaskDatasetResultRank
Physical Interaction Question AnsweringPIQA
Accuracy68.88
323
Language ModelingLAMBADA
Accuracy53.5
183
ReasoningARC Easy
Accuracy60.9
183
Common Sense ReasoningHellaSwag
Accuracy36.3
164
ReasoningPIQA
Accuracy67.5
133
Common Sense ReasoningBoolQ
Accuracy32.4
131
Multiple-choice Question AnsweringARC Easy
Accuracy64.86
122
Multiple-choice Question AnsweringARC Challenge
Acc33.62
106
Question AnsweringWinoGrande (WG)
Accuracy52.64
98
ReasoningWinoGrande (WG)
Accuracy51.2
87
Showing 10 of 17 rows

Other info

Code

Follow for update