DASH: Fast Differentiable Architecture Search for Hybrid Attention in Minutes on a Single GPU

About

Hybrid attention architectures are becoming an increasingly important paradigm for improving LLM inference efficiency while preserving model quality, making hybrid architecture design a central problem. Existing designs often rely on manual empirical rules or proxy-based selector signals for layer-wise operator allocation. Recent NAS-style systems such as Jet-Nemotron demonstrate the promise of automated hybrid architecture search. However, Jet-Nemotron's PostNAS search stages alone use 200B tokens, making such search pipelines difficult to use as routine methods for hybrid architecture design. We introduce DASH, a fast differentiable search framework for hybrid attention architecture design, which relaxes discrete layer-wise attention operator placement into continuous architecture logits, prepares reusable teacher-aligned linear candidates, and performs architecture-only search with model and operator weights frozen to significantly enhance search efficiency. On Qwen2.5-3B-Instruct, DASH consistently outperforms a comprehensive suite of existing selector-style hybrid attention design baselines, showing that direct differentiable search can discover stronger hybrid architectures. Moreover, DASH achieves stronger RULER performance than released Jet-Nemotron models while remaining competitive on overlapping short-context and general benchmarks. Notably, each DASH search run uses only 12.3M tokens and takes about 20 minutes on a single RTX Pro 6000 GPU, corresponding to merely 0.006% of the PostNAS search tokens reported by Jet-Nemotron. These results suggest that high-quality hybrid attention architectures can be obtained through minutes-level differentiable search, providing a promising direction for hybrid architecture design.

Weizhe Chen, Miao Zhang, Junpeng Jiang, Yaping Li, Weili Guan, Liqiang Nie• 2026

Related benchmarks

Task	Dataset	Result
Commonsense Reasoning	WinoGrande	--	1581
Physical Commonsense Reasoning	PIQA	Accuracy78.07	724
Science Question Answering	ARC-E	Accuracy76.98	240
Multiple-choice Question Answering	MMLU	Accuracy63.95	222
Long-context language modeling	RULER	RULER Score0.9142	204
Science Question Answering	ARC-C	ARC-C Score51.88	43
Long-context Understanding	RULER	RULER Score91.42	5

Showing 7 of 7 rows

Other info

Follow for update

@wizwand_team Discord