Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

LLMBoost: Make Large Language Models Stronger with Boosting

About

Ensemble learning of LLMs has emerged as a promising alternative to enhance performance, but existing approaches typically treat models as black boxes, combining the inputs or final outputs while overlooking the rich internal representations and interactions across models.In this work, we introduce LLMBoost, a novel ensemble fine-tuning framework that breaks this barrier by explicitly leveraging intermediate states of LLMs. Inspired by the boosting paradigm, LLMBoost incorporates three key innovations. First, a cross-model attention mechanism enables successor models to access and fuse hidden states from predecessors, facilitating hierarchical error correction and knowledge transfer. Second, a chain training paradigm progressively fine-tunes connected models with an error-suppression objective, ensuring that each model rectifies the mispredictions of its predecessor with minimal additional computation. Third, a near-parallel inference paradigm design pipelines hidden states across models layer by layer, achieving inference efficiency approaching single-model decoding. We further establish the theoretical foundations of LLMBoost, proving that sequential integration guarantees monotonic improvements under bounded correction assumptions. Extensive experiments on commonsense reasoning and arithmetic reasoning tasks demonstrate that LLMBoost consistently boosts accuracy while reducing inference latency.

Zehao Chen, Tianxiang Ai, Yifei Li, Gongxun Li, Yuyang Wei, Wang Zhou, Guanghui Li, Bin Yu, Zhijun Chen, Hailong Sun, Fuzhen Zhuang, Jianxin Li, Deqing Wang, Yikun Ban• 2025

Related benchmarks

TaskDatasetResultRank
Arithmetic ReasoningGSM8K (test)
Accuracy78.9
129
Mathematical ReasoningMAWPS (test)
Accuracy94.1
87
Arithmetic ReasoningAQuA (test)
Accuracy61.8
58
Arithmetic ReasoningSVAMP (test)
Accuracy88.8
54
Commonsense ReasoningCommonsense Reasoning (PIQA, WinoG., HellaS., BoolQ, SIQA, OBQA) (test)
PIQA Accuracy89.9
32
Efficiency EvaluationEfficiency Profiling Workload G.7 Detailed Efficiency Results
End-to-End Latency (s)13.75
9
Agent Toolchain SchedulingCCAD
Accuracy55
8
Arithmetic ReasoningAQuA, GSM8K, MAWPS, SVAMP
AQuA Accuracy60.2
7
Commonsense ReasoningPIQA, WinoGrande, HellaSwag, BoolQ, SocialIQA, OpenBookQA
PIQA Accuracy89.5
7
Showing 9 of 9 rows

Other info

Follow for update