Human Motion Instruction Tuning

About

This paper presents LLaMo (Large Language and Human Motion Assistant), a multimodal framework for human motion instruction tuning. In contrast to conventional instruction-tuning approaches that convert non-linguistic inputs, such as video or motion sequences, into language tokens, LLaMo retains motion in its native form for instruction tuning. This method preserves motion-specific details that are often diminished in tokenization, thereby improving the model's ability to interpret complex human behaviors. By processing both video and motion data alongside textual inputs, LLaMo enables a flexible, human-centric analysis. Experimental evaluations across high-complexity domains, including human behaviors and professional activities, indicate that LLaMo effectively captures domain-specific knowledge, enhancing comprehension and prediction in motion-intensive scenarios. We hope LLaMo offers a foundation for future multimodal AI systems with broad applications, from sports analytics to behavioral prediction. Our code and models are available on the project website: https://github.com/ILGLJ/LLaMo.

Lei Li, Sen Jia, Jianhao Wang, Zhongyu Jiang, Feng Zhou, Ju Dai, Tianfang Zhang, Zongkai Wu, Jenq-Neng Hwang• 2024

Related benchmarks

Task	Dataset	Result
Motion behavior comprehension	profession-swing dataset	Reasonableness Acc21.1	5
Motion Reasoning	MoVid-Bench Motion 1.0	Body Acc.0.593	5
Video Reasoning	MoVid-Bench Video Expected Comparison 1.0	Body Accuracy33.83	5
Repetition Counting	Mo-RepCount	OBO0.389	5
Motion Question Answering	BABEL-QA	Overall Accuracy45.8	4

Showing 5 of 5 rows

Other info

Follow for update

@wizwand_team Discord