Do Modern Video-LLMs Need to Listen? A Benchmark Audit and Scalable Remedy
About
Speech and audio encoders developed over years of community effort are routinely excluded from video understanding pipelines -- not because they fail, but because benchmarks never required listening. We audit 10 video benchmarks and find items largely solvable from visual cues alone: a single-frame probe answers ~76% of AVQA without audio, suggesting poor measurement of audio-visual reasoning. Building on LLaVA-OneVision, we attach a speech/audio encoder and compare five compressor architectures under 25x token reduction (25 Hz to 1 Hz). Across 10 benchmarks -- with and without filtering -- audio yields clear gains on tasks requiring speech comprehension or cross-modal grounding, while vision-centric suites remain largely unaffected. Our results show that speech encoders play a larger role in video understanding than current benchmarks suggest. We will fully open-source our work at https://github.com/naver-ai/LLaVA-AV-SSM.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Temporal Video Understanding | TempCompass | Average Score63.7 | 68 | |
| Video Question Answering | ActivityNet | -- | 22 | |
| Video Multimodal Evaluation | Video-MME | Original Score65.6 | 8 | |
| Audio-Visual Question Answering | MUSIC-AVQA | Original Score79.5 | 6 | |
| Audio-Visual Speaker Identification | AV-Speaker | Original Score46.6 | 6 | |
| Video Grounded Reasoning | WorldSense | Original Score44.7 | 6 | |
| Long-form Video Understanding | LongVideoBench | Original Score53.9 | 6 | |
| Multimodal Video Understanding | MMMU Video | Original Score36.7 | 6 |