Differential Transformer

About

Transformer tends to overallocate attention to irrelevant context. In this work, we introduce Diff Transformer, which amplifies attention to the relevant context while canceling noise. Specifically, the differential attention mechanism calculates attention scores as the difference between two separate softmax attention maps. The subtraction cancels noise, promoting the emergence of sparse attention patterns. Experimental results on language modeling show that Diff Transformer outperforms Transformer in various settings of scaling up model size and training tokens. More intriguingly, it offers notable advantages in practical applications, such as long-context modeling, key information retrieval, hallucination mitigation, in-context learning, and reduction of activation outliers. By being less distracted by irrelevant context, Diff Transformer can mitigate hallucination in question answering and text summarization. For in-context learning, Diff Transformer not only enhances accuracy but is also more robust to order permutation, which was considered as a chronic robustness issue. The results position Diff Transformer as a highly effective and promising architecture to advance large language models.

Tianzhu Ye, Li Dong, Yuqing Xia, Yutao Sun, Yi Zhu, Gao Huang, Furu Wei• 2024

Related benchmarks

Task	Dataset	Result
Commonsense Reasoning	HellaSwag	Accuracy41.29	1896
Commonsense Reasoning	WinoGrande	Accuracy56.12	1442
Question Answering	ARC Challenge	Accuracy27.82	906
Multivariate Forecasting	ETTh1	MSE0.446	830
Physical Commonsense Reasoning	PIQA	Accuracy71.76	696
Question Answering	ARC Easy	Accuracy60.69	597
Multivariate Time-series Forecasting	Weather	MSE0.259	409
Question Answering	OBQA	Accuracy22.2	347
Multivariate Time-series Forecasting	Traffic	MSE0.429	310
Image Classification	CIFAR-100 (test)	Accuracy60.03	295

Showing 10 of 42 rows

Other info

Follow for update

@wizwand_team Discord