Process-of-Thought Reasoning for Videos

About

Video understanding requires not only recognizing visual content but also performing temporally grounded, multi-step reasoning over long and noisy observations. We propose Process-of-Thought (PoT) Reasoning for Videos, a framework that makes the reasoning process explicit by structuring video inference into a sequence of lightweight, verifiable steps. PoT interleaves (i) temporal evidence selection, (ii) step-wise state updates, and (iii) constrained answer synthesis, enabling the model to progressively refine hypotheses while maintaining traceability to video evidence. The framework is designed to be model-agnostic and can be plugged into existing vision-language backbones, supporting both closed-book reasoning and evidence-augmented reasoning with external tools. We further introduce a unified representation for PoT traces that aligns intermediate decisions with temporal segments, which improves robustness to distractors and reduces hallucinated explanations. Extensive experiments on standard video reasoning tasks demonstrate that PoT consistently improves factual correctness and temporal grounding, while providing interpretable reasoning traces for diagnosis and downstream use.

Jusheng Zhang, Kaitong Cai, Jian Wang, Yongsen Zheng, Kwok-Yan Lam, Keze Wang• 2026

Related benchmarks

Task	Dataset	Result
Narrative Reasoning	VIST (test)	BLEURT0.456	14
Narrative Reasoning	Ego4D (test)	BLEURT0.48	14
Narrative Reasoning	MMIU (test)	BLEURT Score0.306	14
Narrative Reasoning	Pororo (test)	BLEURT Score45	14
Narrative Reasoning	WebQA (test)	BLEURT0.623	14
Narrative Reasoning	MSR-VTT (test)	Accuracy Score3.67	14

Showing 6 of 6 rows

Other info

Follow for update

@wizwand_team Discord