Structure Over Scale: Learning Visual Reasoning from Pedagogical Video

About

State-of-the-art vision-language models (VLMs) score impressively on video benchmarks yet stumble on basic visual reasoning tasks involving spatial relations, navigation, and object selection that a preschooler solves easily. We hypothesize that the explicit pedagogical structure, specifically the context-question-pause-answer cycles embedded in children's educational video, provides naturally co-aligned reasoning traces: temporally synchronized visual cues, questions, and answers that emerge only from deliberate pedagogical authoring and cannot be practically reconstructed through manual annotation at scale. To test this, we introduce SoSVQA (Structure over Scale Visual Question Answering), a unified benchmark of 10K question-answer pairs automatically extracted from Dora the Explorer (DoraVQA) and Mickey Mouse Clubhouse (ClubHVQA) with precise timestamp alignment, and fine-tune Qwen2-VL and Qwen3-VL using Group Relative Policy Optimization (GRPO) to leverage the clear correctness signals and structured reasoning traces inherent in educational content. Despite training on just 10K QA pairs from 78 hours of children's television, orders of magnitude less data than GPT and Gemini, our approach delivers generalizable performance gains for Qwen-based VLMs, yielding consistent improvements on NExT-QA (+19.7), Video-MME (+10.6), and MotionBench (+4.9), matching the performance of leading proprietary systems and demonstrating that content structure can compensate for content scale.

Bishoy Galoaa, Xiangyu Bai, Sarah Ostadabbas• 2026

Related benchmarks

Task	Dataset	Result
Video Question Answering	Video-MME	Accuracy76.78	14
Video Question Answering	DoraVQA (test)	Overall Accuracy67.98	13
Vision-Language Reasoning	CVBench	Accuracy86.16	12
Video Question Answering	DoraVQA	Accuracy67.98	12

Showing 4 of 4 rows

Other info

Follow for update

@wizwand_team Discord