Round and Round We Go! What makes Rotary Positional Encodings useful?

About

Positional Encodings (PEs) are a critical component of Transformer-based Large Language Models (LLMs), providing the attention mechanism with important sequence-position information. One of the most popular types of encoding used today in LLMs are Rotary Positional Encodings (RoPE), that rotate the queries and keys based on their relative distance. A common belief is that RoPE is useful because it helps to decay token dependency as relative distance increases. In this work, we argue that this is unlikely to be the core reason. We study the internals of a trained Gemma 7B model to understand how RoPE is being used at a mechanical level. We find that Gemma learns to use RoPE to construct robust "positional" attention patterns by exploiting the highest frequencies. We also find that, in general, Gemma greatly prefers to use the lowest frequencies of RoPE, which we suspect are used to carry semantic information. We mathematically prove interesting behaviours of RoPE and conduct experiments to verify our findings, proposing a modification of RoPE that fixes some highlighted issues and improves performance. We believe that this work represents an interesting step in better understanding PEs in LLMs, which we believe holds crucial value for scaling LLMs to large sizes and context lengths.

Federico Barbero, Alex Vitvitskyi, Christos Perivolaropoulos, Razvan Pascanu, Petar Veli\v{c}kovi\'c• 2024

Related benchmarks

Task	Dataset	Result
Commonsense Reasoning	HellaSwag	Accuracy53.87	1896
Language Understanding	MMLU	Accuracy62.35	844
Reasoning	BBH	Accuracy64.48	726
Question Answering	GPQA	Accuracy29.67	258
Commonsense Reasoning	ARC-C	Accuracy32.17	215
Language Understanding	MMLU-Pro	Accuracy33.95	116
Common Sense Reasoning	PIQA	Accuracy72.03	100
Long-context retrieval and synthetic reasoning	RULER	Accuracy51.28	47
Long-context Understanding	LongBench English	Accuracy19.42	30
Long-context language modeling	HELMET	Summarization Score25.7	27

Showing 10 of 15 rows

Other info

Follow for update

@wizwand_team Discord