Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

SparseBalance: Load-Balanced Long Context Training with Dynamic Sparse Attention

About

While sparse attention mitigates the computational bottleneck of long-context LLM training, its distributed training process exhibits extreme heterogeneity in both \textit{1)} sequence length and \textit{2)} sparsity sensitivity, leading to a severe imbalance problem and sub-optimal model accuracy. Existing algorithms and training frameworks typically focus on single issue, failing to systematically co-optimize these two problems. Therefore, we propose SparseBalance, a novel algorithm-system co-design framework, which exploits the sparsity and sequence heterogeneity to optimize model accuracy and system efficiency jointly. First, we propose workload-aware dynamic sparsity tuning, which employs a bidirectional sparsity adjustment to eliminate stragglers and exploit inherent bubbles for free accuracy. Second, we propose a sparsity-aware batching strategy to achieve coarse-grained balance, which complements dynamic sparsity tuning. Experimental results demonstrate that SparseBalance achieves up to a 1.33$\times$ end-to-end speedup while still improving the long-context capability by 0.46\% on the LongBench benchmark.

Hongtao Xu, Jianchao Tan, Yuxuan Hu, Pengju Lu, Hongyu Wang, Pingwei Sun, Yerui Sun, Yuchen Xie, Xunliang Cai, Mingzhen Li, Weile Jia• 2026

Related benchmarks

TaskDatasetResultRank
RetrievalNeedle-in-a-Haystack L=8k
Accuracy100
18
Exact RetrievalNeedle-in-a-Haystack (NIAH) 16K
Average Accuracy100
5
Exact RetrievalNeedle-in-a-Haystack (NIAH) 32K
Average Accuracy100
5
Exact RetrievalNeedle-in-a-Haystack (NIAH) 64K
Average Accuracy97.53
5
Showing 4 of 4 rows

Other info

Follow for update