Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Scalable LLM Reasoning Acceleration with Low-rank Distillation

About

Due to long generations, large language model (LLM) math reasoning demands significant computational resources and time. While many existing efficient inference methods have been developed with excellent performance preservation on language tasks, they often severely degrade math performance. In this paper, we propose Caprese, a resource-efficient distillation method to recover lost capabilities from deploying efficient inference methods, focused primarily in feedforward blocks. With original weights unperturbed, roughly 1% of additional parameters, and only 20K synthetic training samples, we are able to recover much if not all of the reasoning capabilities lost from efficient inference for thinking LLMs and without harm to language tasks for instruct LLMs. Moreover, Caprese slashes the number of active parameters (~2B cut for Gemma 2 9B and Llama 3.1 8B) and integrates cleanly into existing model layers to reduce latency (>16% time-to-next-token reduction) while encouraging response brevity (up to 8.5% fewer tokens).

Harry Dong, Bilge Acun, Beidi Chen, Yuejie Chi• 2025

Related benchmarks

TaskDatasetResultRank
Mathematical ReasoningMATH
Accuracy28.16
535
ReasoningMATH 500
Accuracy (%)92
59
ReasoningAIME 24
Accuracy on AIME 2461.67
41
ReasoningGPQA
Accuracy55.05
38
Language GenerationCoQA
Accuracy65.5
35
Language GenerationQasper
Accuracy15.35
35
Language GenerationXsum
Accuracy24.89
35
Language GenerationCNN/DailyMail
Accuracy27.16
35
ReasoningAMC 2023
Accuracy (AMC 2023)88.75
21
ReasoningBRUMO 2025
Accuracy53
21
Showing 10 of 13 rows

Other info

Follow for update