Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Beyond Single-Sample: Reliable Multi-Sample Distillation for Video Understanding

About

Traditional black-box distillation for Large Vision-Language Models (LVLMs) typically relies on a single teacher response per input, which often yields high-variance responses and format inconsistencies in multimodal or temporal scenarios. To mitigate this unreliable supervision, we propose R-MSD (Reliable Multi-Sample Distillation), a framework that explicitly models teacher sampling variance to enhance distillation stability. Rather than relying on a single teacher response, our approach leverages a task-adaptive teacher pool to provide robust supervision tailored to both closed-ended and open-ended reasoning. By integrating quality-aware signal matching with an adversarial distillation objective, our approach effectively filters teacher noise while maximizing knowledge transfer. Extensive evaluations across comprehensive video understanding benchmarks demonstrate that R-MSD consistently outperforms single sample distillation methods. We additionally include an original SFT+RL 4B baseline under the same training budget, which shows only marginal gains, while our method achieves significant improvements. With a 4B student model, our approach delivers gains on VideoMME (+1.5%), Video-MMMU (+3.2%), and MathVerse (+3.6%).

Songlin Li, Xin Zhu, Zechao Guan, Peipeng Chen, Jian Yao• 2026

Related benchmarks

TaskDatasetResultRank
Video Question AnsweringVideoMME
Accuracy65.3
210
Video Question AnsweringLongVideoBench
Accuracy58.8
180
Mathematical Visual Question AnsweringMathVista
Accuracy72.1
47
Spatio-Temporal ReasoningV-Star
Chain1 (When) m tIoU25.2
44
Mathematical Visual Question AnsweringMathVerse
Accuracy55.3
37
Video Question AnsweringMLVU MCQ
Accuracy73.2
17
Video Question AnsweringMMMU Video
Delta Knowledge58.6
9
Video Question AnsweringWorldSense
Accuracy49.2
5
Showing 8 of 8 rows

Other info

Follow for update