HEART: Emotionally-Driven Test-Time Scaling of Language Models
About
Test-time scaling has significantly improved how AI models solve problems, yet current methods often get stuck in repetitive, incorrect patterns of thought. We introduce HEART, a framework that uses emotional cues to guide the model's focus, much like how feelings contribute to human decision-making. By alternating between critical tones to sharpen error detection and encouraging tones to spark new ideas, HEART helps the model break out of dead-end reasoning and find the right solution. We evaluate HEART across seven high-difficulty benchmarks--including Humanity's Last Exam, GPQA Diamond, and LiveCodeBench--demonstrating robustness across diverse models. Results show that emotion facilitates deeper reasoning, yielding consistent accuracy gains over affect-sterile baselines. These findings suggest that the next frontier in machine reasoning lies in the strategic integration of affective regulation to guide logical synthesis.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Question Answering | SimpleQA | Accuracy75.29 | 92 | |
| Mathematical Reasoning | OlympiadBench Math | Accuracy99.86 | 84 | |
| Question Answering | SimpleQA Verified | Accuracy82.5 | 60 | |
| Mathematical Problem Solving | AIME 2024 | Accuracy100 | 60 | |
| Reasoning | Humanity's Last Exam | Accuracy84.61 | 46 | |
| Mathematical Reasoning | AIME 2025 | Accuracy91.67 | 37 | |
| Scientific Reasoning | GPQA Diamond (test) | Accuracy99.37 | 32 | |
| Code Generation | LiveCodeBench Medium | Accuracy96.78 | 23 | |
| Code Generation | LiveCodeBench Hard | Pass@160.76 | 21 | |
| Code Generation | LiveCodeBench | Overall Accuracy87.24 | 15 |