Beyond Single-Sample: Reliable Multi-Sample Distillation for Video Understanding

About

Traditional black-box distillation for Large Vision-Language Models (LVLMs) typically relies on a single teacher response per input, which often yields high-variance responses and format inconsistencies in multimodal or temporal scenarios. To mitigate this unreliable supervision, we propose R-MSD (Reliable Multi-Sample Distillation), a framework that explicitly models teacher sampling variance to enhance distillation stability. Rather than relying on a single teacher response, our approach leverages a task-adaptive teacher pool to provide robust supervision tailored to both closed-ended and open-ended reasoning. By integrating quality-aware signal matching with an adversarial distillation objective, our approach effectively filters teacher noise while maximizing knowledge transfer. Extensive evaluations across comprehensive video understanding benchmarks demonstrate that R-MSD consistently outperforms single sample distillation methods. We additionally include an original SFT+RL 4B baseline under the same training budget, which shows only marginal gains, while our method achieves significant improvements. With a 4B student model, our approach delivers gains on VideoMME (+1.5%), Video-MMMU (+3.2%), and MathVerse (+3.6%).

Songlin Li, Xin Zhu, Zechao Guan, Peipeng Chen, Jian Yao• 2026

Related benchmarks

Task	Dataset	Result
Video Question Answering	VideoMME	Accuracy65.3	251
Video Question Answering	LongVideoBench	Accuracy58.8	210
Mathematical Visual Question Answering	MathVista	Accuracy72.1	87
Spatio-Temporal Reasoning	V-Star	Chain1 (When) m tIoU25.2	44
Mathematical Visual Question Answering	MathVerse	Accuracy55.3	37
Video Question Answering	MLVU MCQ	Accuracy73.2	17
Video Question Answering	MMMU Video	Delta Knowledge58.6	9
Video Question Answering	WorldSense	Accuracy49.2	5

Showing 8 of 8 rows

Other info

Follow for update

@wizwand_team Discord