Cognitive Chain-of-Thought (CoCoT): Structured Multimodal Reasoning about Social Situations

About

Chain-of-Thought (CoT) prompting helps models think step by step. But naive CoT breaks down in visually grounded social tasks, where models must perceive, understand, and judge all at once; bridging perception with norm-grounded reasoning. Recent work has introduced structured reasoning for multi-turn agent planning and visual QA, decomposing tasks into sequential sub-goals. To extend this to single-shot multimodal social reasoning, we introduce Cognitive Chain-of-Thought (CoCoT), a reasoning framework that structures vision-language-model (VLM) reasoning through three cognitively inspired stages: Perception (extract grounded facts), Situation (infer situations), and Norm (applying social norms). Evaluation across multiple distinct tasks such as multimodal intent disambiguation, multimodal theory of mind, social commonsense reasoning, and safety instruction following, shows consistent improvements (5.9% to 4.6% on average). We further explore the utility of CoCoT for improving models' reasoning through training and show that supervised fine-tuning on CoCoT-structured traces yields 5-6% improvements without explicit CoCoT prompting at inference, demonstrating that models internalize the structured reasoning pattern rather than merely following instructions. We show that structuring model reasoning through cognitively grounded stages enhances interpretability and social alignment, laying the groundwork for more reliable multimodal systems. All code and data will be released publicly.

Eunkyu Park, Wesley Hanwen Deng, Gunhee Kim, Motahhare Eslami, Maarten Sap• 2025

Related benchmarks

Task	Dataset	Result
Multi-modal Reasoning	M3CoT	Accuracy82.68	90
Multi-modal Reasoning	MoMentS	Accuracy72.26	48
Social Video Reasoning	VAGUE (test)	Accuracy77.4	48
Multimodal Commonsense Reasoning	M3CoT social-science and social-commonsense sub-topics	Accuracy Change11.99	12
Multimodal Intent Disambiguation	VAGUE	Direct Accuracy73.05	9
Visual Social Reasoning	MoMentS	Direct70.68	9
Safety Robustness	VLGuard Safe_Unsafe	Attack Success Rate14.9	4
Safety Robustness	VLGuard Unsafe	Attack Success Rate13.4	4

Showing 8 of 8 rows

Other info

Follow for update

@wizwand_team Discord