HumanOmni-Speaker: Identifying Who said What and When
About
While Omni-modal Large Language Models have made strides in joint sensory processing, they fundamentally struggle with a cornerstone of human interaction: deciphering complex, multi-person conversational dynamics to accurately answer ``Who said what and when.'' Current models suffer from an ``illusion of competence'' -- they exploit visual biases in conventional benchmarks to bypass genuine cross-modal alignment, while relying on sparse, low-frame-rate visual sampling that destroys crucial high-frequency dynamics like lip movements. To shatter this illusion, we introduce Visual-Registered Speaker Diarization and Recognition (VR-SDR) and the HumanOmni-Speaker Benchmark. By strictly eliminating visual shortcuts, this rigorous paradigm demands true end-to-end spatio-temporal identity binding using only natural language queries. To overcome the underlying architectural perception gap, we propose HumanOmni-Speaker, powered by a Visual Delta Encoder. By sampling raw video at 25 fps and explicitly compressing inter-frame motion residuals into just 6 tokens per frame, it captures fine-grained visemes and speaker trajectories without triggering a catastrophic token explosion. Ultimately, HumanOmni-Speaker demonstrates strong multimodal synergy, natively enabling end-to-end lip-reading and high-precision spatial localization without intrusive cropping, and achieving superior performance across a wide spectrum of speaker-centric tasks.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Automatic Speech Recognition | LibriSpeech clean (test) | WER1.88 | 1156 | |
| Automatic Speech Recognition | LibriSpeech (test-other) | WER4.56 | 1151 | |
| Automatic Speech Recognition | LibriSpeech (dev-other) | WER4.16 | 462 | |
| Automatic Speech Recognition | LibriSpeech (dev-clean) | WER (%)1.91 | 340 | |
| Visual Speech Recognition | LRS3 | WER0.334 | 63 | |
| Visual Speech Recognition | LRS2 | Mean WER29.8 | 49 | |
| Audio Speech Recognition | LRS3 | WER3.63 | 18 | |
| Audio-Visual Speech Recognition | LRS3 | WER76 | 14 | |
| Speaker-centric multimodal understanding | HumanOmni-Speaker | Speech Recognition (Easy)1.9 | 8 | |
| Audio-Visual Speech Recognition | LRS2 | -- | 4 |