Hearing Between the Lines: Unlocking the Reasoning Power of LLMs for Speech Evaluation

About

Large Language Model (LLM) judges exhibit strong reasoning capabilities but are limited to textual content. This leaves current automatic Speech-to-Speech (S2S) evaluation methods reliant on opaque and expensive Audio Language Models (ALMs). In this work, we propose TRACE (Textual Reasoning over Audio Cues for Evaluation), a novel framework that enables LLM judges to reason over audio cues to achieve cost-efficient and human-aligned S2S evaluation. To demonstrate the strength of the framework, we first introduce a Human Chain-of-Thought (HCoT) annotation protocol to improve the diagnostic capability of existing judge benchmarks by separating evaluation into explicit dimensions: content (C), voice quality (VQ), and paralinguistics (P). Using this data, TRACE constructs a textual blueprint of inexpensive audio signals and prompts an LLM to render dimension-wise judgments, fusing them into an overall rating via a deterministic policy. TRACE achieves higher agreement with human raters than ALMs and transcript-only LLM judges while being significantly more cost-effective. We will release the HCoT annotations and the TRACE framework to enable scalable and human-aligned S2S evaluation.

Arjun Chandra, Kevin Miller, Venkatesh Ravichandran, Constantinos Papayiannis, Venkatesh Saligrama• 2026

Related benchmarks

Task	Dataset	Result	Rank
Speech Quality Assessment	SPEAKBENCH	Content Score63.2		5
Speech Quality Assessment	S2S-ARENA	Content Score58		5

Showing 2 of 2 rows

Other info

Follow for update

@wizwand_team Discord