Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Do Modern Video-LLMs Need to Listen? A Benchmark Audit and Scalable Remedy

About

Speech and audio encoders developed over years of community effort are routinely excluded from video understanding pipelines -- not because they fail, but because benchmarks never required listening. We audit 10 video benchmarks and find items largely solvable from visual cues alone: a single-frame probe answers ~76% of AVQA without audio, suggesting poor measurement of audio-visual reasoning. Building on LLaVA-OneVision, we attach a speech/audio encoder and compare five compressor architectures under 25x token reduction (25 Hz to 1 Hz). Across 10 benchmarks -- with and without filtering -- audio yields clear gains on tasks requiring speech comprehension or cross-modal grounding, while vision-centric suites remain largely unaffected. Our results show that speech encoders play a larger role in video understanding than current benchmarks suggest. We will fully open-source our work at https://github.com/naver-ai/LLaVA-AV-SSM.

Geewook Kim, Minjoon Seo• 2025

Related benchmarks

TaskDatasetResultRank
Temporal Video UnderstandingTempCompass
Average Score63.7
68
Video Question AnsweringActivityNet--
22
Video Multimodal EvaluationVideo-MME
Original Score65.6
8
Audio-Visual Question AnsweringMUSIC-AVQA
Original Score79.5
6
Audio-Visual Speaker IdentificationAV-Speaker
Original Score46.6
6
Video Grounded ReasoningWorldSense
Original Score44.7
6
Long-form Video UnderstandingLongVideoBench
Original Score53.9
6
Multimodal Video UnderstandingMMMU Video
Original Score36.7
6
Showing 8 of 8 rows

Other info

Follow for update