O-MARC: Omni Memory-Augmented Compression Distillation for Efficient Video Understanding

About

Omnimodal large language models enable unified audio video understanding, but long joint token sequences make inference costly, and existing benchmarks do not fully isolate audio visual association in noisy user generated videos. We introduce UGC-AVQA, a public UGC benchmark with 1,000 videos and 4,816 QA pairs, where an audio removal test ensures that benchmark questions require both acoustic and visual evidence. To reduce inference cost, we propose OMAC, a training free plug in compression method that preserves salient visual memory and temporally grounded audio anchors. To further make compact models robust to compressed inputs, we introduce O-MARC, a compression distillation framework for learning with memory compressed multimodal contexts. On Qwen2.5-Omni-3B, O-MARC improves the average score across four benchmarks to 45.8, outperforming full token inference at 44.1 and OmniZip at 41.0. OMAC also keeps inference efficient, reducing latency by 34.6\% (1.53$\times$ speedup) and memory by 34.7\% compared with full token inference.

Peiran Wu, Yunze Liu, Chi-Hao Wu, Chen Chen, Junxiao Shen• 2026

Related benchmarks

Task	Dataset	Result
Audiovisual Association	UGC-AVQA	AVEP65.7	13
Long Video Reasoning	WorldSense	Overall Accuracy48	13
Omnimodal Reasoning	DailyOmni	Overall Accuracy63.2	13
Video Reasoning	OmniVideo	Average Score35.5	13
Audio-Visual Question Answering	UGC-AVQA	--	9

Showing 5 of 5 rows

Other info

Follow for update

@wizwand_team Discord