Em-Garde: A Propose-Match Framework for Proactive Streaming Video Understanding
About
Recent advances in Streaming Video Understanding has enabled a new interaction paradigm where models respond proactively to user queries. Current proactive VideoLLMs rely on per-frame triggering decision making, which suffers from an efficiency-accuracy dilemma. We propose Em-Garde, a novel framework that decouples semantic understanding from streaming perception. At query time, the Instruction-Guided Proposal Parser transforms user queries into structured, perceptually grounded visual proposals; during streaming, a Lightweight Proposal Matching Module performs efficient embedding-based matching to trigger responses. Experiments on StreamingBench and OVO-Bench demonstrate consistent improvements over prior models in proactive response accuracy and efficiency, validating an effective solution for proactive video understanding under strict computational constraints.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Streaming Video Understanding | StreamingBench | -- | 158 | |
| Online Video Understanding | OVO-Bench | Backward Tracing Avg.52.2 | 48 | |
| Real-time Visual Perception | OVO-Bench | OCR76.5 | 27 | |
| Backward Tracing | OVO-Bench | EPM48.2 | 27 | |
| Proactive Video Question Answering | ProactiveVideoQA EGO | PAUC (ω=0.5)52.3 | 8 | |
| Proactive Response | StreamingBench | Accuracy38 | 7 | |
| Online Video Understanding | StreamingBench | Real-time VU Score76.7 | 6 | |
| Proactive Response Timing | OVO-Bench Future Active Responding | CRR Recall47.92 | 5 | |
| Proactive Video Question Answering | ProactiveVideoQA WEB (test) | PAUC44.3 | 4 | |
| Proactive Video Question Answering | ProactiveVideoQA VAD (test) | PAUC27.4 | 4 |