Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Process-of-Thought Reasoning for Videos

About

Video understanding requires not only recognizing visual content but also performing temporally grounded, multi-step reasoning over long and noisy observations. We propose Process-of-Thought (PoT) Reasoning for Videos, a framework that makes the reasoning process explicit by structuring video inference into a sequence of lightweight, verifiable steps. PoT interleaves (i) temporal evidence selection, (ii) step-wise state updates, and (iii) constrained answer synthesis, enabling the model to progressively refine hypotheses while maintaining traceability to video evidence. The framework is designed to be model-agnostic and can be plugged into existing vision-language backbones, supporting both closed-book reasoning and evidence-augmented reasoning with external tools. We further introduce a unified representation for PoT traces that aligns intermediate decisions with temporal segments, which improves robustness to distractors and reduces hallucinated explanations. Extensive experiments on standard video reasoning tasks demonstrate that PoT consistently improves factual correctness and temporal grounding, while providing interpretable reasoning traces for diagnosis and downstream use.

Jusheng Zhang, Kaitong Cai, Jian Wang, Yongsen Zheng, Kwok-Yan Lam, Keze Wang• 2026

Related benchmarks

TaskDatasetResultRank
Narrative ReasoningVIST (test)
BLEURT0.456
14
Narrative ReasoningEgo4D (test)
BLEURT0.48
14
Narrative ReasoningMMIU (test)
BLEURT Score0.306
14
Narrative ReasoningPororo (test)
BLEURT Score45
14
Narrative ReasoningWebQA (test)
BLEURT0.623
14
Narrative ReasoningMSR-VTT (test)
Accuracy Score3.67
14
Showing 6 of 6 rows

Other info

Follow for update