Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Cognitive Chain-of-Thought (CoCoT): Structured Multimodal Reasoning about Social Situations

About

Chain-of-Thought (CoT) prompting helps models think step by step. But naive CoT breaks down in visually grounded social tasks, where models must perceive, understand, and judge all at once; bridging perception with norm-grounded reasoning. Recent work has introduced structured reasoning for multi-turn agent planning and visual QA, decomposing tasks into sequential sub-goals. To extend this to single-shot multimodal social reasoning, we introduce Cognitive Chain-of-Thought (CoCoT), a reasoning framework that structures vision-language-model (VLM) reasoning through three cognitively inspired stages: Perception (extract grounded facts), Situation (infer situations), and Norm (applying social norms). Evaluation across multiple distinct tasks such as multimodal intent disambiguation, multimodal theory of mind, social commonsense reasoning, and safety instruction following, shows consistent improvements (5.9% to 4.6% on average). We further explore the utility of CoCoT for improving models' reasoning through training and show that supervised fine-tuning on CoCoT-structured traces yields 5-6% improvements without explicit CoCoT prompting at inference, demonstrating that models internalize the structured reasoning pattern rather than merely following instructions. We show that structuring model reasoning through cognitively grounded stages enhances interpretability and social alignment, laying the groundwork for more reliable multimodal systems. All code and data will be released publicly.

Eunkyu Park, Wesley Hanwen Deng, Gunhee Kim, Motahhare Eslami, Maarten Sap• 2025

Related benchmarks

TaskDatasetResultRank
Multi-modal ReasoningM3CoT
Accuracy82.68
90
Multi-modal ReasoningMoMentS
Accuracy72.26
48
Social Video ReasoningVAGUE (test)
Accuracy77.4
48
Multimodal Commonsense ReasoningM3CoT social-science and social-commonsense sub-topics
Accuracy Change11.99
12
Multimodal Intent DisambiguationVAGUE
Direct Accuracy73.05
9
Visual Social ReasoningMoMentS
Direct70.68
9
Safety RobustnessVLGuard Safe_Unsafe
Attack Success Rate14.9
4
Safety RobustnessVLGuard Unsafe
Attack Success Rate13.4
4
Showing 8 of 8 rows

Other info

Follow for update