Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

HEART: Emotionally-Driven Test-Time Scaling of Language Models

About

Test-time scaling has significantly improved how AI models solve problems, yet current methods often get stuck in repetitive, incorrect patterns of thought. We introduce HEART, a framework that uses emotional cues to guide the model's focus, much like how feelings contribute to human decision-making. By alternating between critical tones to sharpen error detection and encouraging tones to spark new ideas, HEART helps the model break out of dead-end reasoning and find the right solution. We evaluate HEART across seven high-difficulty benchmarks--including Humanity's Last Exam, GPQA Diamond, and LiveCodeBench--demonstrating robustness across diverse models. Results show that emotion facilitates deeper reasoning, yielding consistent accuracy gains over affect-sterile baselines. These findings suggest that the next frontier in machine reasoning lies in the strategic integration of affective regulation to guide logical synthesis.

Gabriela Pinto, Palash Goyal, Mihir Parmar, Yiwen Song, Souradip Chakraborty, Zifeng Wang, Jinsung Yoon, Tomas Pfister, Hamid Palangi• 2025

Related benchmarks

TaskDatasetResultRank
Question AnsweringSimpleQA
Accuracy75.29
92
Mathematical ReasoningOlympiadBench Math
Accuracy99.86
84
Question AnsweringSimpleQA Verified
Accuracy82.5
60
Mathematical Problem SolvingAIME 2024
Accuracy100
60
ReasoningHumanity's Last Exam
Accuracy84.61
46
Mathematical ReasoningAIME 2025
Accuracy91.67
37
Scientific ReasoningGPQA Diamond (test)
Accuracy99.37
32
Code GenerationLiveCodeBench Medium
Accuracy96.78
23
Code GenerationLiveCodeBench Hard
Pass@160.76
21
Code GenerationLiveCodeBench
Overall Accuracy87.24
15
Showing 10 of 15 rows

Other info

Follow for update