Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

SparseEval: Efficient Evaluation of Large Language Models by Sparse Optimization

About

As large language models (LLMs) continue to scale up, their performance on various downstream tasks has significantly improved. However, evaluating their capabilities has become increasingly expensive, as performing inference on a large number of benchmark samples incurs high computational costs. In this paper, we revisit the model-item performance matrix and show that it exhibits sparsity, that representative items can be selected as anchors, and that the task of efficient benchmarking can be formulated as a sparse optimization problem. Based on these insights, we propose SparseEval, a method that, for the first time, adopts gradient descent to optimize anchor weights and employs an iterative refinement strategy for anchor selection. We utilize the representation capacity of MLP to handle sparse optimization and propose the Anchor Importance Score and Candidate Importance Score to evaluate the value of each item for task-aware refinement. Extensive experiments demonstrate the low estimation error and high Kendall's~$\tau$ of our method across a variety of benchmarks, showcasing its superior robustness and practicality in real-world scenarios. Code is available at {https://github.com/taolinzhang/SparseEval}.

Taolin Zhang, Hang Guo, Wang Lu, Tao Dai, Shu-Tao Xia, Jindong Wang• 2026

Related benchmarks

TaskDatasetResultRank
Model Performance PredictionDeepSeek Model Families (Hold-out)
MAE0.156
45
LLM Performance EstimationARC (test)
MAE (%)1.165
20
LLM Performance EstimationGSM8K (test)
MAE (%)1.619
20
LLM Performance EstimationHELLASWAG (test)
MAE (%)0.827
20
LLM Performance EstimationMMLU (test)
MAE (%)0.842
20
LLM Performance EstimationTruthfulQA (test)
MAE (%)1.027
20
LLM Performance EstimationWinoGrande (test)
MAE1.027
20
Showing 7 of 7 rows

Other info

Follow for update