Value Residual Learning
About
While Transformer models have achieved remarkable success in various domains, the effectiveness of information propagation through deep networks remains a critical challenge. Standard hidden state residuals often fail to adequately preserve initial token-level information in deeper layers. This paper introduces ResFormer, a novel architecture that enhances information flow by incorporating value residual connections in addition to hidden state residuals. And a variant is SVFormer, where all layers share the first layer's value embedding. Comprehensive empirical evidence demonstrates ResFormer achieves equivalent validation loss with 16.11\% fewer model parameters and 20.3\% less training data compared to Transformer, while maintaining similar memory usage and computational cost. Besides, SVFormer reduces KV cache size by nearly half with only a small performance penalty and can be integrated with other KV-efficient methods, yielding further reductions in KV cache, with performance influenced by sequence length and cumulative learning rate.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Physical Interaction Question Answering | PIQA | Accuracy68.88 | 323 | |
| Language Modeling | LAMBADA | Accuracy53.5 | 183 | |
| Reasoning | ARC Easy | Accuracy60.9 | 183 | |
| Common Sense Reasoning | HellaSwag | Accuracy36.3 | 164 | |
| Reasoning | PIQA | Accuracy67.5 | 133 | |
| Common Sense Reasoning | BoolQ | Accuracy32.4 | 131 | |
| Multiple-choice Question Answering | ARC Easy | Accuracy64.86 | 122 | |
| Multiple-choice Question Answering | ARC Challenge | Acc33.62 | 106 | |
| Question Answering | WinoGrande (WG) | Accuracy52.64 | 98 | |
| Reasoning | WinoGrande (WG) | Accuracy51.2 | 87 |