Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

DualKV: Shared-Prompt Flash Attention for Efficient RL Training with Large Rollouts and Long Contexts

About

Modern RL post-training methods such as GRPO and DAPO train on $N$ response sequences of $R$ tokens sampled from a shared prompt of $P$ tokens, but standard FlashAttention replicates all $P$ prompt tokens $N$ times across both forward and backward passes -- duplicating compute and memory on identical hidden states. In large-rollout, long-context RL training ($N{\geq}16$, $P{\geq}8\text{K}$), this redundancy dominates the policy update cost. We observe that in decoder-only models, causal masking makes prompt representations invariant across sequences at every layer, so all per-token operations (norms, projections, MLP) and attention can process the prompt once -- a property not yet exploited at the kernel level for training. We propose \textbf{DualKV}, the first FlashAttention kernel variant that eliminates shared-prompt replication during RL training, via (1)~fused CUDA forward and backward kernels that iterate over two disjoint KV regions -- shared context and per-sequence response -- in a single kernel launch, and (2)~a data-pipeline redesign in veRL that repacks $N(P{+}R)$ tokens into $P{+}NR$ tokens per micro-batch, extending the token reduction from attention to the entire model by a factor $\rho = N(P{+}R)/(P{+}NR)$. DualKV is mathematically equivalent to standard attention and introduces no approximation. On Qwen3-8B GRPO training with 8$\times$H100 GPUs ($N{=}32$, 8K-context), DualKV achieves $1.63$--$2.09\times$ policy-update speedup, enables $2\times$ larger micro-batches, and raises MFU from $36\%$ to $76\%$. Similar gains hold for DAPO ($2.47\times$ speedup, $77\%$ MFU). At 30B MoE scale on 16$\times$H100, DualKV achieves $3.82\times$ policy-update and $3.38\times$ end-to-end step speedup over FlashAttention (which requires 4-way Ulysses sequence parallelism to avoid OOM).

Jiading Gai, Shuai Zhang, Xiang Song, Bernie Wang, George Karypis• 2026

Related benchmarks

TaskDatasetResultRank
Efficiency BenchmarkingQwen3-8B Single-layer forward+backward setup
Time (ms)36.6
57
GRPO TrainingGSM8K synthetic variants short-prompt (val)
Peak Memory (GB)26.5
8
Kernel-level Attention Speed and Memory AnalysisQwen3-8B model dimensions (H=32, Hk=8, d=128, GQA 4:1) on A100 GPU (test)
Forward Pass Time (ms)27.1
7
RL TrainingLongReason
Peak Memory (GB)80
6
ReasoningLongReason (val)
Accuracy (val)78
4
Showing 5 of 5 rows

Other info

Follow for update