Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

MotionLLM: Understanding Human Behaviors from Human Motions and Videos

About

This study delves into the realm of multi-modality (i.e., video and motion modalities) human behavior understanding by leveraging the powerful capabilities of Large Language Models (LLMs). Diverging from recent LLMs designed for video-only or motion-only understanding, we argue that understanding human behavior necessitates joint modeling from both videos and motion sequences (e.g., SMPL sequences) to capture nuanced body part dynamics and semantics effectively. In light of this, we present MotionLLM, a straightforward yet effective framework for human motion understanding, captioning, and reasoning. Specifically, MotionLLM adopts a unified video-motion training strategy that leverages the complementary advantages of existing coarse video-text data and fine-grained motion-text data to glean rich spatial-temporal insights. Furthermore, we collect a substantial dataset, MoVid, comprising diverse videos, motions, captions, and instructions. Additionally, we propose the MoVid-Bench, with carefully manual annotations, for better evaluation of human behavior understanding on video and motion. Extensive experiments show the superiority of MotionLLM in the caption, spatial-temporal comprehension, and reasoning ability.

Ling-Hao Chen, Shunlin Lu, Ailing Zeng, Hao Zhang, Benyou Wang, Ruimao Zhang, Lei Zhang• 2024

Related benchmarks

TaskDatasetResultRank
Text-to-motion generationHumanML3D (test)
FID0.491
481
Action RecognitionNTU-60 48/12 split
Top-1 Acc26.98
103
Action RecognitionNTU-120 96/24 split
Top-1 Acc33.62
84
Action RecognitionNTU 60 (55/5 split)
Top-1 Acc50.24
57
Action RecognitionNTU-120 110/10 split
Top-1 Acc49.8
56
Text-driven Motion GenerationHumanML3D (test)
R-Precision@151.5
54
Action RecognitionPKU-MMD (XSub)
Top-1 Acc46.4
43
Action RecognitionNTU 60 (40-20 seen-unseen)
Top-1 Acc21.58
18
Action RecognitionPKU-MMD cross-subject (39/12)
Top-1 Accuracy27.8
12
Action RecognitionPKU-MMD cross-view Xview (39/12)
Top-1 Accuracy20.9
12
Showing 10 of 24 rows

Other info

Follow for update