ROMA: Real-time Omni-Multimodal Assistant with Interactive Streaming Understanding
About
Recent Omni-multimodal Large Language Models show promise in unified audio, vision, and text modeling. However, streaming audio-video understanding remains challenging, as existing approaches suffer from disjointed capabilities: they typically exhibit incomplete modality support or lack autonomous proactive monitoring. To address this, we present ROMA, a real-time omni-multimodal assistant for unified reactive and proactive interaction. ROMA processes continuous inputs as synchronized multimodal units, aligning dense audio with discrete video frames to handle granularity mismatches. For online decision-making, we introduce a lightweight speak head that decouples response initiation from generation to ensure precise triggering without task conflict. We train ROMA with a curated streaming dataset and a two-stage curriculum that progressively optimizes for streaming format adaptation and proactive responsiveness. To standardize the fragmented evaluation landscape, we reorganize diverse benchmarks into a unified suite covering both proactive (alert, narration) and reactive (QA) settings. Extensive experiments across 12 benchmarks demonstrate ROMA achieves state-of-the-art performance on proactive tasks while competitive in reactive settings, validating its robustness in unified real-time omni-multimodal understanding.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Temporal Grounding | Charades-STA (test) | -- | 68 | |
| Video Question Answering | Video-MME without subtitles | Accuracy (Overall)34.56 | 28 | |
| Backward Tracing | OVO-Bench Reactive QA 1.0 (test) | EPM56.57 | 10 | |
| Real-time Visual Perception | OVO-Bench Reactive QA 1.0 (test) | OCR65.1 | 10 | |
| Streaming Narration | Youcook2 (test) | F135.55 | 10 | |
| Reactive Question Answering | StreamingBench excluding PO 1.0 (test) | Overall Performance (OP)76.96 | 9 | |
| Recurring alert | OVO-Bench | Recall33.81 | 9 | |
| Single-alert | OVO-Bench | PA37.5 | 9 | |
| Question Answering | EgoSchema | Accuracy55.4 | 9 | |
| Static temporal grounding | QVHighlights (test) | mAP53.7 | 8 |