GRASP: Learning to Ground Social Reasoning in Multi-Person Non-Verbal Interactions

About

Understanding social interactions requires reasoning over subtle non-verbal cues, yet current multimodal large language models (MLLMs) often fail to identify who interacts with whom in multi-person videos. We introduce GRASP, a large-scale social reasoning dataset that connects high-level social QA with fine-grained gaze and deictic gesture events. GRASP contains 290K question--answer pairs over 46K videos totaling 749 hours, organized by a 16-category taxonomy spanning gaze, gesture, and joint gaze--gesture reasoning, together with GRASP-Bench for evaluation. Unlike prior resources that focus on either isolated cues or high-level social QA, GRASP builds questions from identity-consistent gaze trajectories, deictic gestures, and their joint compositions into social events. Moreover, we propose Social Grounding Reward (SGR), a learning signal that uses these social events to encourage models to reason about the participants involved in each interaction. Experiments show that SGR improves performance on GRASP-Bench while maintaining zero-shot performance on related social video QA benchmarks.

Junho Kim, Xu Cao, Houze Yang, Bikram Boote, Ana Jojic, Fiona Ryan, Bolin Lai, Sangmin Lee, James M. Rehg• 2026

Related benchmarks

Task	Dataset	Result
Social Reasoning	GRASP-Bench (test)	T1 Accuracy37.1	18
Social Reasoning	MMSI	STI71.2	15
Social Reasoning	TVQA+	Accuracy73.2	15
Social Reasoning	Online-MMSI	STI60.6	15

Showing 4 of 4 rows

Other info

Follow for update

@wizwand_team Discord