Action Quality Assessment with Temporal Parsing Transformer

About

Action Quality Assessment(AQA) is important for action understanding and resolving the task poses unique challenges due to subtle visual differences. Existing state-of-the-art methods typically rely on the holistic video representations for score regression or ranking, which limits the generalization to capture fine-grained intra-class variation. To overcome the above limitation, we propose a temporal parsing transformer to decompose the holistic feature into temporal part-level representations. Specifically, we utilize a set of learnable queries to represent the atomic temporal patterns for a specific action. Our decoding process converts the frame representations to a fixed number of temporally ordered part representations. To obtain the quality score, we adopt the state-of-the-art contrastive regression based on the part representations. Since existing AQA datasets do not provide temporal part-level labels or partitions, we propose two novel loss functions on the cross attention responses of the decoder: a ranking loss to ensure the learnable queries to satisfy the temporal order in cross attention and a sparsity loss to encourage the part representations to be more discriminative. Extensive experiments show that our proposed method outperforms prior work on three public AQA benchmarks by a considerable margin.

Yang Bai, Desen Zhou, Songyang Zhang, Jian Wang, Errui Ding, Yu Guan, Yang Long, Jingdong Wang• 2022

Related benchmarks

Task	Dataset	Result
Action Quality Assessment	MTL-AQA (test)	Spearman Correlation0.9607	29
Action Quality Assessment	Fis-V	TES Spearman Correlation0.57	22
Action Quality Assessment	JIGSAWS 11 (test)	SRCC (Suturing)0.88	11
Action Quality Assessment	FineDiving (test)	SRCC93.33	9
Action Quality Assessment	MTL-NAE	Spearman's Rho0.961	7
Action Quality Assessment	FineGym NAE	Spearman Correlation (ρ)0.764	7

Showing 6 of 6 rows

Other info

Follow for update

@wizwand_team Discord