DualKV: Shared-Prompt Flash Attention for Efficient RL Training with Large Rollouts and Long Contexts
About
Modern RL post-training methods such as GRPO and DAPO train on $N$ response sequences of $R$ tokens sampled from a shared prompt of $P$ tokens, but standard FlashAttention replicates all $P$ prompt tokens $N$ times across both forward and backward passes -- duplicating compute and memory on identical hidden states. In large-rollout, long-context RL training ($N{\geq}16$, $P{\geq}8\text{K}$), this redundancy dominates the policy update cost. We observe that in decoder-only models, causal masking makes prompt representations invariant across sequences at every layer, so all per-token operations (norms, projections, MLP) and attention can process the prompt once -- a property not yet exploited at the kernel level for training. We propose \textbf{DualKV}, the first FlashAttention kernel variant that eliminates shared-prompt replication during RL training, via (1)~fused CUDA forward and backward kernels that iterate over two disjoint KV regions -- shared context and per-sequence response -- in a single kernel launch, and (2)~a data-pipeline redesign in veRL that repacks $N(P{+}R)$ tokens into $P{+}NR$ tokens per micro-batch, extending the token reduction from attention to the entire model by a factor $\rho = N(P{+}R)/(P{+}NR)$. DualKV is mathematically equivalent to standard attention and introduces no approximation. On Qwen3-8B GRPO training with 8$\times$H100 GPUs ($N{=}32$, 8K-context), DualKV achieves $1.63$--$2.09\times$ policy-update speedup, enables $2\times$ larger micro-batches, and raises MFU from $36\%$ to $76\%$. Similar gains hold for DAPO ($2.47\times$ speedup, $77\%$ MFU). At 30B MoE scale on 16$\times$H100, DualKV achieves $3.82\times$ policy-update and $3.38\times$ end-to-end step speedup over FlashAttention (which requires 4-way Ulysses sequence parallelism to avoid OOM).
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Efficiency Benchmarking | Qwen3-8B Single-layer forward+backward setup | Time (ms)36.6 | 57 | |
| GRPO Training | GSM8K synthetic variants short-prompt (val) | Peak Memory (GB)26.5 | 8 | |
| Kernel-level Attention Speed and Memory Analysis | Qwen3-8B model dimensions (H=32, Hk=8, d=128, GQA 4:1) on A100 GPU (test) | Forward Pass Time (ms)27.1 | 7 | |
| RL Training | LongReason | Peak Memory (GB)80 | 6 | |
| Reasoning | LongReason (val) | Accuracy (val)78 | 4 |