On the Nature of Attention Sink that Shapes Decoding Strategy in Omni-LLMs

About

The goal of this paper is to strengthen the reasoning of Omnimodal Large Language Models (Omni-LLMs) at inference time, without additional training. These models jointly process video, audio, and text, and given the large number of tokens they consume, how attention is routed across them is central to their behaviour. We focus specifically on attention sinks, tokens that absorb a disproportionate share of attention mass regardless of their semantic content, to understand how this routing unfolds. To this end, we conduct a systematic analysis of sink behaviour in Omni-LLMs. Our analysis yields two key findings: (i) high sink attention does not solely indicate head redundancy, suggesting that sink value representations play additional functional roles; (ii) the sink value vector acts as a shared bias added to every token's output, serving as a global signal that organises the representation as a whole. Building on this, we propose OutRo, which correspondingly aligns non-sink token representations with the sink in feature space, and relaxes the causal mask for sink tokens at an early layer to sharpen this bias before the rest of decoding proceeds. This design enhances the reasoning process without requiring additional forward passes or access to attention maps. Based on extensive experiments, OutRo consistently improves performance on seven video QA benchmarks and demonstrates strong generalisation, while incurring only a 1.1x decoding overhead.

Suho Yoo, Youngjoon Jang, Joon Son Chung• 2026

Related benchmarks

Task	Dataset	Result
Video Question Answering	ActivityNet-QA	Accuracy44.41	438
Video Question Answering	VideoMME	Accuracy68.33	254
Video Question Answering	ActivityNet	Accuracy49.92	32
Video Question Answering	Video-Holmes	Average Score46.76	12
Audio-Visual QA	OmniBench	Accuracy48.25	6
Audio-Visual QA	AVUT	Accuracy66.57	6
Audio-Visual QA	AVHBench	Accuracy73.78	6
Audio-Visual QA	DailyOmni	Accuracy55.56	6
Visual-Only QA	VideoHolmes	Accuracy47.63	6
Visual-Only QA	VideoMME Medium	Accuracy73.89	6

Showing 10 of 10 rows

Other info

Follow for update

@wizwand_team Discord