DualKV: Shared-Prompt Flash Attention for Efficient RL Training with Large Rollouts and Long Contexts

About

Modern RL post-training methods such as GRPO and DAPO train on N response sequences of R tokens sampled from a shared prompt of P tokens, but standard FlashAttention replicates all P prompt tokens N times across both forward and backward passes -- duplicating compute and memory on identical hidden states. In large-rollout, long-context RL training (N>=16, P>=8K), this redundancy dominates the policy update cost. We observe that in decoder-only models, causal masking makes prompt representations invariant across sequences at every layer, so all per-token operations (norms, projections, MLP) and attention can process the prompt once -- a property not yet exploited at the kernel level for training. We propose DualKV, the first FlashAttention kernel variant that eliminates shared-prompt replication during RL training, via (1) fused CUDA forward and backward kernels that iterate over two disjoint KV regions -- shared context and per-sequence response -- in a single kernel launch, and (2) a data-pipeline redesign in veRL that repacks N(P+R) tokens into P+NR tokens per micro-batch, extending the token reduction from attention to the entire model by a factor rho = N(P+R)/(P+NR). DualKV is mathematically equivalent to standard attention and introduces no approximation. On Qwen3-8B GRPO training with 8xH100 GPUs (N=32, 8K-context), DualKV achieves 1.63--2.09x policy-update speedup, enables 2x larger micro-batches, and raises MFU from 36% to 76%. Similar gains hold for DAPO (2.47x speedup, 77% MFU). At 30B MoE scale on 16xH100, DualKV achieves 3.82x policy-update and 3.38x end-to-end step speedup over FlashAttention (which requires 4-way Ulysses sequence parallelism to avoid OOM). DualKV also extends to hybrid sliding/global attention with head dimension 512 (which FA2 does not support) and integrates with Ulysses sequence parallelism, demonstrated on Gemma-4-31B GRPO at 64K context.

Jiading Gai, Shuai Zhang, Xiang Song, Bernie Wang, George Karypis• 2026

Related benchmarks

Task	Dataset	Result
Efficiency Benchmarking	Qwen3-8B Single-layer forward+backward setup	Time (ms)36.6	57
GRPO Training	GSM8K synthetic variants short-prompt (val)	Peak Memory (GB)26.5	8
Kernel-level Attention Speed and Memory Analysis	Qwen3-8B model dimensions (H=32, Hk=8, d=128, GQA 4:1) on A100 GPU (test)	Forward Pass Time (ms)27.1	7
RL Training	LongReason	Peak Memory (GB)80	6
Reasoning	LongReason (val)	Accuracy (val)78	4

Showing 5 of 5 rows

Other info

Follow for update

@wizwand_team Discord