SparseBalance: Load-Balanced Long Context Training with Dynamic Sparse Attention

About

While sparse attention mitigates the computational bottleneck of long-context LLM training, its distributed training process exhibits extreme heterogeneity in both \textit{1)} sequence length and \textit{2)} sparsity sensitivity, leading to a severe imbalance problem and sub-optimal model accuracy. Existing algorithms and training frameworks typically focus on single issue, failing to systematically co-optimize these two problems. Therefore, we propose SparseBalance, a novel algorithm-system co-design framework, which exploits the sparsity and sequence heterogeneity to optimize model accuracy and system efficiency jointly. First, we propose workload-aware dynamic sparsity tuning, which employs a bidirectional sparsity adjustment to eliminate stragglers and exploit inherent bubbles for free accuracy. Second, we propose a sparsity-aware batching strategy to achieve coarse-grained balance, which complements dynamic sparsity tuning. Experimental results demonstrate that SparseBalance achieves up to a 1.33$\times$ end-to-end speedup while still improving the long-context capability by 0.46\% on the LongBench benchmark.

Hongtao Xu, Jianchao Tan, Yuxuan Hu, Pengju Lu, Hongyu Wang, Pingwei Sun, Yerui Sun, Yuchen Xie, Xunliang Cai, Mingzhen Li, Weile Jia• 2026

Related benchmarks

Task	Dataset	Result
Retrieval	Needle-in-a-Haystack L=8k	Accuracy100	24
Exact Retrieval	Needle-in-a-Haystack (NIAH) 16K	Average Accuracy100	5
Exact Retrieval	Needle-in-a-Haystack (NIAH) 32K	Average Accuracy100	5
Exact Retrieval	Needle-in-a-Haystack (NIAH) 64K	Average Accuracy97.53	5

Showing 4 of 4 rows

Other info

Follow for update

@wizwand_team Discord