Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

PredGen: Accelerated Inference of Large Language Models through Input-Time Speculation for Real-Time Speech Interaction

About

Large Language Models (LLMs) are widely used in real-time voice chat applications, typically in combination with text-to-speech (TTS) systems to generate audio responses. However, their large size often leads to noticeable latency between the end of user input and the start of audio output, resulting in suboptimal user experiences. This latency is particularly evident when LLMs are deployed as single-user voice assistants on consumer-grade hardware with limited computing capacity. We discovered that this latency is primarily dominated by the time it takes for the LLMs to generate the first sentence, which is required as input by the TTS systems that synthesize audio responses on a sentence-by-sentence basis. To address this bottleneck, we propose Predictive Generation (PredGen), a novel framework that mitigates-or even eliminates-this delay through speculative decoding at input time. PredGen generates candidate responses while the user is still speaking, enabling the system to begin TTS processing with minimal delay. Simulated experiments on the Lmsys and MT-Bench datasets show that the proposed method can effectively reduce the latency by around 2x across a wide range of use cases, while incurring only minimal additional computation cost at input time-computation that would otherwise go unused.

Shufan Li, Aditya Grover• 2025

Related benchmarks

TaskDatasetResultRank
Spoken Question AnsweringVERA
Accuracy8.69
8
Spoken Question AnsweringSpoken-MQA
Accuracy83.52
8
Audio-based ReasoningBigBench Audio
Accuracy71.6
8
Spoken Question AnsweringOur Bench
Accuracy70.51
8
Streaming Voice-Agent Interaction EfficiencyVERA
NFE176.8
5
Streaming Voice-Agent Interaction EfficiencyBigBench Audio
NFE80.73
5
Streaming Voice-Agent Interaction EfficiencySpoken-MQA
NFE59.39
5
Streaming Voice-Agent Interaction EfficiencyPause-and-Repair
NFE172.5
5
Showing 8 of 8 rows

Other info

Follow for update