Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Models Know Models Best: Evaluation via Model-Preferred Formats

About

Performance of Large Language Models (LLMs) on multiple-choice tasks differs markedly between symbol-based and cloze-style evaluation formats. The observed discrepancies are systematically attributable to task characteristics: natural language continuation benefits from likelihood scoring, whereas explicit comparison is better suited to symbol-based selection. These trends are consistent across various decoder-based LLMs, indicating model-agnostic effects. To address these inconsistencies, a dynamic format-alignment strategy is introduced that employs a lightweight classifier trained on latent model-preference signals. In contrast to human-designed heuristics, which often degrade performance, this approach uses model-generated signals to determine the optimal format for each problem instance. The proposed method achieves substantial and consistent improvements in zero-shot accuracy across reasoning and knowledge benchmarks, better revealing the models' latent capabilities.

Joonhak Lee, Sungmok Jung, Jongyeon Park, Jaejin Lee• 2026

Related benchmarks

TaskDatasetResultRank
Commonsense ReasoningHellaSwag
Accuracy87.2
1891
Commonsense ReasoningWinoGrande
Accuracy81.7
1085
Question AnsweringARC Challenge
Accuracy93.9
906
Question AnsweringARC Easy
Accuracy98.2
597
Physical Commonsense ReasoningPIQA
Accuracy88.2
572
Question AnsweringOpenBookQA
Accuracy84.4
465
Multitask Language UnderstandingMMLU
Accuracy73.5
413
Question AnsweringOpenBookQA
Accuracy84.4
126
Social Commonsense ReasoningSocialIQA
Accuracy79.2
100
Physical Commonsense ReasoningPIQA
Accuracy91
56
Showing 10 of 14 rows

Other info

Follow for update