HumanOmni-Speaker: Identifying Who said What and When

About

While Omni-modal Large Language Models have made strides in joint sensory processing, they fundamentally struggle with a cornerstone of human interaction: deciphering complex, multi-person conversational dynamics to accurately answer ``Who said what and when.'' Current models suffer from an ``illusion of competence'' -- they exploit visual biases in conventional benchmarks to bypass genuine cross-modal alignment, while relying on sparse, low-frame-rate visual sampling that destroys crucial high-frequency dynamics like lip movements. To shatter this illusion, we introduce Visual-Registered Speaker Diarization and Recognition (VR-SDR) and the HumanOmni-Speaker Benchmark. By strictly eliminating visual shortcuts, this rigorous paradigm demands true end-to-end spatio-temporal identity binding using only natural language queries. To overcome the underlying architectural perception gap, we propose HumanOmni-Speaker, powered by a Visual Delta Encoder. By sampling raw video at 25 fps and explicitly compressing inter-frame motion residuals into just 6 tokens per frame, it captures fine-grained visemes and speaker trajectories without triggering a catastrophic token explosion. Ultimately, HumanOmni-Speaker demonstrates strong multimodal synergy, natively enabling end-to-end lip-reading and high-precision spatial localization without intrusive cropping, and achieving superior performance across a wide spectrum of speaker-centric tasks.

Detao Bai, Shimin Yao, Weixuan Chen, Zhiheng Ma, Xihan Wei, Jingren Zhou• 2026

Related benchmarks

Task	Dataset	Result
Automatic Speech Recognition	LibriSpeech clean (test)	WER1.88	1207
Automatic Speech Recognition	LibriSpeech (test-other)	WER4.56	1206
Automatic Speech Recognition	LibriSpeech (dev-other)	WER4.16	486
Automatic Speech Recognition	LibriSpeech (dev-clean)	WER (%)1.91	340
Visual Speech Recognition	LRS3	WER0.334	63
Visual Speech Recognition	LRS2	Mean WER29.8	49
Audio Speech Recognition	LRS3	WER3.63	18
Audio-Visual Speech Recognition	LRS3	WER76	14
Speaker-centric multimodal understanding	HumanOmni-Speaker	Speech Recognition (Easy)1.9	8
Audio-Visual Speech Recognition	LRS2	--	4

Showing 10 of 11 rows

Other info

Follow for update

@wizwand_team Discord