Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

GRASP: Learning to Ground Social Reasoning in Multi-Person Non-Verbal Interactions

About

Understanding social interactions requires reasoning over subtle non-verbal cues, yet current multimodal large language models (MLLMs) often fail to identify who interacts with whom in multi-person videos. We introduce GRASP, a large-scale social reasoning dataset that connects high-level social QA with fine-grained gaze and deictic gesture events. GRASP contains 290K question--answer pairs over 46K videos totaling 749 hours, organized by a 16-category taxonomy spanning gaze, gesture, and joint gaze--gesture reasoning, together with GRASP-Bench for evaluation. Unlike prior resources that focus on either isolated cues or high-level social QA, GRASP builds questions from identity-consistent gaze trajectories, deictic gestures, and their joint compositions into social events. Moreover, we propose Social Grounding Reward (SGR), a learning signal that uses these social events to encourage models to reason about the participants involved in each interaction. Experiments show that SGR improves performance on GRASP-Bench while maintaining zero-shot performance on related social video QA benchmarks.

Junho Kim, Xu Cao, Houze Yang, Bikram Boote, Ana Jojic, Fiona Ryan, Bolin Lai, Sangmin Lee, James M. Rehg• 2026

Related benchmarks

TaskDatasetResultRank
Social ReasoningGRASP-Bench (test)
T1 Accuracy37.1
18
Social ReasoningMMSI
STI71.2
15
Social ReasoningTVQA+
Accuracy73.2
15
Social ReasoningOnline-MMSI
STI60.6
15
Showing 4 of 4 rows

Other info

Follow for update