Models Know Models Best: Evaluation via Model-Preferred Formats
About
Performance of Large Language Models (LLMs) on multiple-choice tasks differs markedly between symbol-based and cloze-style evaluation formats. The observed discrepancies are systematically attributable to task characteristics: natural language continuation benefits from likelihood scoring, whereas explicit comparison is better suited to symbol-based selection. These trends are consistent across various decoder-based LLMs, indicating model-agnostic effects. To address these inconsistencies, a dynamic format-alignment strategy is introduced that employs a lightweight classifier trained on latent model-preference signals. In contrast to human-designed heuristics, which often degrade performance, this approach uses model-generated signals to determine the optimal format for each problem instance. The proposed method achieves substantial and consistent improvements in zero-shot accuracy across reasoning and knowledge benchmarks, better revealing the models' latent capabilities.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Commonsense Reasoning | HellaSwag | Accuracy87.2 | 1460 | |
| Commonsense Reasoning | WinoGrande | Accuracy81.7 | 776 | |
| Question Answering | ARC Challenge | Accuracy93.9 | 749 | |
| Question Answering | OpenBookQA | Accuracy84.4 | 465 | |
| Question Answering | ARC Easy | Accuracy98.2 | 386 | |
| Physical Commonsense Reasoning | PIQA | Accuracy88.2 | 329 | |
| Multitask Language Understanding | MMLU | Accuracy73.5 | 206 | |
| Question Answering | OpenBookQA | Accuracy84.4 | 84 | |
| Social Commonsense Reasoning | SocialIQA | Accuracy79.2 | 68 | |
| Physical Reasoning | PIQA | Accuracy91.3 | 34 |