O-MARC: Omni Memory-Augmented Compression Distillation for Efficient Video Understanding
About
Omnimodal large language models enable unified audio video understanding, but long joint token sequences make inference costly, and existing benchmarks do not fully isolate audio visual association in noisy user generated videos. We introduce UGC-AVQA, a public UGC benchmark with 1,000 videos and 4,816 QA pairs, where an audio removal test ensures that benchmark questions require both acoustic and visual evidence. To reduce inference cost, we propose OMAC, a training free plug in compression method that preserves salient visual memory and temporally grounded audio anchors. To further make compact models robust to compressed inputs, we introduce O-MARC, a compression distillation framework for learning with memory compressed multimodal contexts. On Qwen2.5-Omni-3B, O-MARC improves the average score across four benchmarks to 45.8, outperforming full token inference at 44.1 and OmniZip at 41.0. OMAC also keeps inference efficient, reducing latency by 34.6\% (1.53$\times$ speedup) and memory by 34.7\% compared with full token inference.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Audiovisual Association | UGC-AVQA | AVEP65.7 | 13 | |
| Long Video Reasoning | WorldSense | Overall Accuracy48 | 13 | |
| Omnimodal Reasoning | DailyOmni | Overall Accuracy63.2 | 13 | |
| Video Reasoning | OmniVideo | Average Score35.5 | 13 | |
| Audio-Visual Question Answering | UGC-AVQA | -- | 9 |