DRFormer: A Dual-Regularized Bidirectional Transformer for Person Re-identification
About
Both fine-grained discriminative details and global semantic features can contribute to solving person re-identification challenges, such as occlusion and pose variations. Vision foundation models (\textit{e.g.}, DINO) excel at mining local textures, and vision-language models (\textit{e.g.}, CLIP) capture strong global semantic difference. Existing methods predominantly rely on a single paradigm, neglecting the potential benefits of their integration. In this paper, we analyze the complementary roles of these two architectures and propose a framework to synergize their strengths by a \textbf{D}ual-\textbf{R}egularized Bidirectional \textbf{Transformer} (\textbf{DRFormer}). The dual-regularization mechanism ensures diverse feature extraction and achieves a better balance in the contributions of the two models. Extensive experiments on five benchmarks show that our method effectively harmonizes local and global representations, achieving competitive performance against state-of-the-art methods.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Person Re-Identification | Market 1501 | mAP92.9 | 999 | |
| Person Re-Identification | MSMT17 | mAP0.787 | 404 | |
| Person Re-Identification | DukeMTMC | R1 Accuracy92.5 | 120 | |
| Person Re-Identification | Occluded-Duke | mAP0.653 | 97 | |
| Person Re-Identification | CUHK03 NP | Rank-189.6 | 64 |