From Global to Local: Rethinking CLIP Feature Aggregation for Person Re-Identification
About
CLIP-based person re-identification (ReID) methods aggregate spatial features into a single global \texttt{[CLS]} token optimized for image-text alignment rather than spatial selectivity, making representations fragile under occlusion and cross-camera variation. We propose SAGA-ReID, which reconstructs identity representations by aligning intermediate patch tokens with anchor vectors parameterized in CLIP's text embedding space -- emphasizing spatially stable evidence while suppressing corrupted or absent regions, without requiring textual descriptions of individual images. Controlled experiments isolate the aggregation mechanism under two qualitatively distinct conditions -- synthetic masking, where identity signal is absent, and realistic human distractors, where an overlapping person introduces semantically confusing signal -- with SAGA's advantage over global pooling growing substantially as occlusion increases across both conditions. Benchmark evaluations confirm consistent gains over CLIP-ReID across standard and occluded settings, with the largest improvements where global pooling is most unreliable: up to +10.6 Rank-1 on occluded benchmarks. SAGA's aggregation outperforms dedicated sequential patch aggregation on a stronger backbone, confirming that structured reconstruction addresses a bottleneck that backbone quality and architectural complexity alone cannot resolve. Code available at https://github.com/ipl-uw/Structured-Anchor-Guided-Aggregation-for-ReID.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Person Re-Identification | MSMT17 | mAP0.784 | 546 | |
| Person Re-Identification | DukeMTMC | R1 Accuracy92.3 | 206 | |
| Person Re-Identification | Market1501 | mAP0.927 | 143 | |
| Person Re-Identification | Occluded-Duke | mAP0.708 | 131 | |
| Person Re-Identification | Occluded-reID | R-194.5 | 104 | |
| Person Re-Identification | P-DukeMTMC | Rank-1 Acc94.4 | 23 | |
| Person Re-Identification | Occluded-Market | Rank-1 Accuracy90.1 | 17 | |
| Classification | DomainNet (held-out target) | Average Accuracy61 | 3 |