A Video Is Worth Three Views: Trigeminal Transformers for Video-based Person Re-identification

About

Video-based person re-identification (Re-ID) aims to retrieve video sequences of the same person under non-overlapping cameras. Previous methods usually focus on limited views, such as spatial, temporal or spatial-temporal view, which lack of the observations in different feature domains. To capture richer perceptions and extract more comprehensive video representations, in this paper we propose a novel framework named Trigeminal Transformers (TMT) for video-based person Re-ID. More specifically, we design a trigeminal feature extractor to jointly transform raw video data into spatial, temporal and spatial-temporal domain. Besides, inspired by the great success of vision transformer, we introduce the transformer structure for video-based person Re-ID. In our work, three self-view transformers are proposed to exploit the relationships between local features for information enhancement in spatial, temporal and spatial-temporal domains. Moreover, a cross-view transformer is proposed to aggregate the multi-view features for comprehensive video representations. The experimental results indicate that our approach can achieve better performance than other state-of-the-art approaches on public Re-ID benchmarks. We will release the code for model reproduction.

Xuehu Liu, Pingping Zhang, Chenyang Yu, Huchuan Lu, Xuesheng Qian, Xiaoyun Yang• 2021

Related benchmarks

Task	Dataset	Result
Video Person Re-ID	MARS	Rank-1 Acc91.2	106
Video Person Re-ID	iLIDS-VID	Rank-191.3	80
Video Person Re-Identification	MARS v1 (test)	mAP86.5	41
Video Person Re-Identification	G2A-VReID Ground to Aerial	mAP55.9	25
Video Person Re-Identification	AG-VPReID Aerial to Ground	mAP60.8	20
Video-based Person Re-identification	iLIDS-VID v1 (test)	Rank-1 Accuracy91.3	18

Showing 6 of 6 rows

Other info

Follow for update

@wizwand_team Discord