AVG

Benchmarks

Task Name	Dataset Name	SOTA Result
Question Answering	AVG.	EM47.5	48
Speculative Decoding	Avg.	Speedup6.73	32
Mathematical and Scientific Reasoning	AVG GSM8K MATH-500 AIME 2024 OlympiadBench GPQA Diamond	Accuracy78.4	24
Mathematical Reasoning	Avg. (held out)	Accuracy62.8	24
change detection	Avg across SYSU, LEVIR, GVLM, CLCD, OSCD	Precision84.8	23
General Reasoning	AVG Reasoning Suite	Accuracy77.4	18
Low-Light Image Enhancement	AVG. DICM, MEF, LIME, NPE, VV	NIQE3.589	17
Document Reranking	AVG	NDCG@554.373	14
Reasoning Performance (Aggregate)	AVG	TPF351	14
Question Answering	AVG. Aggregate of NQ, TQA, HQA, 2WIKI (test)	EM42.5	14
Reasoning	AVG AMC 2023, AIME 2025, HumanEval, LiveCodeBench	Accuracy83.4	12
Emotion Recognition in Conversation	AVG IEMOCAP, MELD, EmoryNLP	W-F160.91	11
Selective classification	Avg (all)	AURC (10^-2 Scale)0.215	11
Selective classification	Avg 1K	AURC (Scale 10^-2)0.248	11
Geolocation	AVG (test)	City Acc (25km)8.3	10
Detection	AVG	AUC0.897	10
Sparse-view reconstruction	Avg. across four cases Combined (test)	PSNR40.63	9
Multi-task Language Understanding	AVG Across All Benchmarks	Throughput12.89	8
Scene Text Recognition	AVG 12 benchmarks	Word Accuracy91.33	8
Image Classification	Avg-14	Top-1 Clean Accuracy76.19	7
Category-level pose estimation	Avg. All five datasets	Median Error (Med)25.5	5
Preference Alignment	Avg.	GRA70.6	4
Machine Translation	Avg (News, Flores, Subtitle, Travel) German-English (test aggregate)	DA87.77	4

Showing 23 of 23 rows