Cognitive Chain-of-Thought (CoCoT): Structured Multimodal Reasoning about Social Situations
About
Chain-of-Thought (CoT) prompting helps models think step by step. But naive CoT breaks down in visually grounded social tasks, where models must perceive, understand, and judge all at once; bridging perception with norm-grounded reasoning. Recent work has introduced structured reasoning for multi-turn agent planning and visual QA, decomposing tasks into sequential sub-goals. To extend this to single-shot multimodal social reasoning, we introduce Cognitive Chain-of-Thought (CoCoT), a reasoning framework that structures vision-language-model (VLM) reasoning through three cognitively inspired stages: Perception (extract grounded facts), Situation (infer situations), and Norm (applying social norms). Evaluation across multiple distinct tasks such as multimodal intent disambiguation, multimodal theory of mind, social commonsense reasoning, and safety instruction following, shows consistent improvements (5.9% to 4.6% on average). We further explore the utility of CoCoT for improving models' reasoning through training and show that supervised fine-tuning on CoCoT-structured traces yields 5-6% improvements without explicit CoCoT prompting at inference, demonstrating that models internalize the structured reasoning pattern rather than merely following instructions. We show that structuring model reasoning through cognitively grounded stages enhances interpretability and social alignment, laying the groundwork for more reliable multimodal systems. All code and data will be released publicly.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Multi-modal Reasoning | M3CoT | Accuracy82.68 | 90 | |
| Multi-modal Reasoning | MoMentS | Accuracy72.26 | 48 | |
| Social Video Reasoning | VAGUE (test) | Accuracy77.4 | 48 | |
| Multimodal Commonsense Reasoning | M3CoT social-science and social-commonsense sub-topics | Accuracy Change11.99 | 12 | |
| Multimodal Intent Disambiguation | VAGUE | Direct Accuracy73.05 | 9 | |
| Visual Social Reasoning | MoMentS | Direct70.68 | 9 | |
| Safety Robustness | VLGuard Safe_Unsafe | Attack Success Rate14.9 | 4 | |
| Safety Robustness | VLGuard Unsafe | Attack Success Rate13.4 | 4 |