Models Know Models Best: Evaluation via Model-Preferred Formats
About
Performance of Large Language Models (LLMs) on multiple-choice tasks differs markedly between symbol-based and cloze-style evaluation formats. The observed discrepancies are systematically attributable to task characteristics: natural language continuation benefits from likelihood scoring, whereas explicit comparison is better suited to symbol-based selection. These trends are consistent across various decoder-based LLMs, indicating model-agnostic effects. To address these inconsistencies, a dynamic format-alignment strategy is introduced that employs a lightweight classifier trained on latent model-preference signals. In contrast to human-designed heuristics, which often degrade performance, this approach uses model-generated signals to determine the optimal format for each problem instance. The proposed method achieves substantial and consistent improvements in zero-shot accuracy across reasoning and knowledge benchmarks, better revealing the models' latent capabilities.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Commonsense Reasoning | HellaSwag | Accuracy87.2 | 1891 | |
| Commonsense Reasoning | WinoGrande | Accuracy81.7 | 1085 | |
| Question Answering | ARC Challenge | Accuracy93.9 | 906 | |
| Question Answering | ARC Easy | Accuracy98.2 | 597 | |
| Physical Commonsense Reasoning | PIQA | Accuracy88.2 | 572 | |
| Question Answering | OpenBookQA | Accuracy84.4 | 465 | |
| Multitask Language Understanding | MMLU | Accuracy73.5 | 413 | |
| Question Answering | OpenBookQA | Accuracy84.4 | 126 | |
| Social Commonsense Reasoning | SocialIQA | Accuracy79.2 | 100 | |
| Physical Commonsense Reasoning | PIQA | Accuracy91 | 56 |