ROMA: Real-time Omni-Multimodal Assistant with Interactive Streaming Understanding

About

Recent Omni-multimodal Large Language Models show promise in unified audio, vision, and text modeling. However, streaming audio-video understanding remains challenging, as existing approaches suffer from disjointed capabilities: they typically exhibit incomplete modality support or lack autonomous proactive monitoring. To address this, we present ROMA, a real-time omni-multimodal assistant for unified reactive and proactive interaction. ROMA processes continuous inputs as synchronized multimodal units, aligning dense audio with discrete video frames to handle granularity mismatches. For online decision-making, we introduce a lightweight speak head that decouples response initiation from generation to ensure precise triggering without task conflict. We train ROMA with a curated streaming dataset and a two-stage curriculum that progressively optimizes for streaming format adaptation and proactive responsiveness. To standardize the fragmented evaluation landscape, we reorganize diverse benchmarks into a unified suite covering both proactive (alert, narration) and reactive (QA) settings. Extensive experiments across 12 benchmarks demonstrate ROMA achieves state-of-the-art performance on proactive tasks while competitive in reactive settings, validating its robustness in unified real-time omni-multimodal understanding.

Xueyun Tian, Wei Li, Bingbing Xu, Heng Dong, Yuanzhuo Wang, Huawei Shen• 2026

Related benchmarks

Task	Dataset	Result
Video Question Answering	Video-MME without subtitles	Accuracy (Overall)34.56	81
Temporal Grounding	Charades-STA (test)	--	68
Question Answering	EgoSchema	Accuracy55.4	22
Online Visual-Only Question Answering	OVO-Bench	OCR63.1	13
Visual-Only Question Answering	StreamingBench Visual-Only QA 1.0 (test)	OP77	11
Backward Tracing	OVO-Bench Reactive QA 1.0 (test)	EPM56.57	10
Real-time Visual Perception	OVO-Bench Reactive QA 1.0 (test)	OCR65.1	10
Streaming Narration	Youcook2 (test)	F135.55	10
Reactive Question Answering	StreamingBench excluding PO 1.0 (test)	Overall Performance (OP)76.96	9
Recurring alert	OVO-Bench	Recall33.81	9

Showing 10 of 16 rows

Other info

Follow for update

@wizwand_team Discord