LLMs are Good Action Recognizers

About

Skeleton-based action recognition has attracted lots of research attention. Recently, to build an accurate skeleton-based action recognizer, a variety of works have been proposed. Among them, some works use large model architectures as backbones of their recognizers to boost the skeleton data representation capability, while some other works pre-train their recognizers on external data to enrich the knowledge. In this work, we observe that large language models which have been extensively used in various natural language processing tasks generally hold both large model architectures and rich implicit knowledge. Motivated by this, we propose a novel LLM-AR framework, in which we investigate treating the Large Language Model as an Action Recognizer. In our framework, we propose a linguistic projection process to project each input action signal (i.e., each skeleton sequence) into its ``sentence format'' (i.e., an ``action sentence''). Moreover, we also incorporate our framework with several designs to further facilitate this linguistic projection process. Extensive experiments demonstrate the efficacy of our proposed framework.

Haoxuan Qu, Yujun Cai, Jun Liu• 2024

Related benchmarks

Task	Dataset	Result
Action Recognition	NTU RGB+D 120 (X-set)	Accuracy91.5	779
Action Recognition	NTU RGB+D (Cross-View)	Accuracy98.4	663
Action Recognition	NTU RGB+D 60 (X-sub)	Accuracy95	496
Action Recognition	NTU RGB+D X-sub 120	Accuracy88.7	482
Action Recognition	NTU RGB+D X-View 60	Accuracy98.4	218
Action Recognition	NTU-RGB+D (X-Sub)	--	101
Action Recognition	Toyota SmartHome (TSH) (CV1)	Accuracy36.1	68
Action Recognition	NTU RGB+D Xsub 60 (Cross-Subject 55/5)	Accuracy95	66
Action Recognition	NTU-RGBD 120 (xsub)	Accuracy88.7	24
Human Action Recognition	UAV-Human X-Sub	Accuracy46.3	15

Showing 10 of 12 rows

Other info

Code

Follow for update

@wizwand_team Discord