Process-of-Thought Reasoning for Videos
About
Video understanding requires not only recognizing visual content but also performing temporally grounded, multi-step reasoning over long and noisy observations. We propose Process-of-Thought (PoT) Reasoning for Videos, a framework that makes the reasoning process explicit by structuring video inference into a sequence of lightweight, verifiable steps. PoT interleaves (i) temporal evidence selection, (ii) step-wise state updates, and (iii) constrained answer synthesis, enabling the model to progressively refine hypotheses while maintaining traceability to video evidence. The framework is designed to be model-agnostic and can be plugged into existing vision-language backbones, supporting both closed-book reasoning and evidence-augmented reasoning with external tools. We further introduce a unified representation for PoT traces that aligns intermediate decisions with temporal segments, which improves robustness to distractors and reduces hallucinated explanations. Extensive experiments on standard video reasoning tasks demonstrate that PoT consistently improves factual correctness and temporal grounding, while providing interpretable reasoning traces for diagnosis and downstream use.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Narrative Reasoning | VIST (test) | BLEURT0.456 | 14 | |
| Narrative Reasoning | Ego4D (test) | BLEURT0.48 | 14 | |
| Narrative Reasoning | MMIU (test) | BLEURT Score0.306 | 14 | |
| Narrative Reasoning | Pororo (test) | BLEURT Score45 | 14 | |
| Narrative Reasoning | WebQA (test) | BLEURT0.623 | 14 | |
| Narrative Reasoning | MSR-VTT (test) | Accuracy Score3.67 | 14 |