TAC: Timestamped Audio Captioning

About

Large Audio Language Models struggle to disentangle overlapping events in complex acoustic scenes, yielding temporally inconsistent captions and frequent hallucinations. We introduce Timestamped Audio Captioner (TAC), a model that produces temporally grounded audio descriptions at varying degrees of detail and resolution. TAC is trained with a synthetic data pipeline that constructs challenging and dynamic mixtures from real-world audio sources, enabling robust learning under realistic polyphonic conditions. Across event detection and dense captioning, TAC outperforms all competing methods, with a low hallucination rate and accurate temporal grounding. We also introduce TAC-V, an audio-visual pipeline to generate semantically rich audio-visual descriptions. We then show that TAC and TAC-V serves as a "semantic bridge" for a text-only reasoner: a simple TAC$\rightarrow$LLM and TAC-V$\rightarrow$LLM cascade achieves state-of-the-art scores on benchmarks for both audio (MMAU-Pro, MMSU, MMAR) and audio-visual (DailyOmni, VideoHolmes) understanding and reasoning respectively.

Sonal Kumar, Prem Seetharaman, Ke Chen, Oriol Nieto, Jiaqi Su, Zhepei Wang, Rithesh Kumar, Dinesh Manocha, Nicholas J. Bryan, Zeyu Jin, Justin Salamon• 2026

Related benchmarks

Task	Dataset	Result
Audiovisual Understanding & Reasoning	Daily-Omni	Score77.9	15
Audiovisual Understanding & Reasoning	World-Sense	Score58.6	14
Audio Understanding & Reasoning	MMSU	Score0.724	9
Audio Understanding & Reasoning	MMAU	MMAU Score73.9	9
Audio-visual understanding	AVHBench	Overall Score81.7	8
Audiovisual Understanding & Reasoning	Video-Holmes	Score59.2	4
Audiovisual Understanding & Reasoning	AVHBench AVM	Score61.6	4
Audiovisual Understanding & Reasoning	AVHBench AVC	Score22.6	4
Audio Understanding & Reasoning	MMAU Sound	Score79.7	3
Audio Understanding & Reasoning	MMAU Speech	Score79.3	3

Showing 10 of 13 rows

Other info

Follow for update

@wizwand_team Discord