Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Fine-Grained Image-Text Correspondence with Cost Aggregation for Open-Vocabulary Part Segmentation

About

Open-Vocabulary Part Segmentation (OVPS) is an emerging field for recognizing fine-grained parts in unseen categories. We identify two primary challenges in OVPS: (1) the difficulty in aligning part-level image-text correspondence, and (2) the lack of structural understanding in segmenting object parts. To address these issues, we propose PartCATSeg, a novel framework that integrates object-aware part-level cost aggregation, compositional loss, and structural guidance from DINO. Our approach employs a disentangled cost aggregation strategy that handles object and part-level costs separately, enhancing the precision of part-level segmentation. We also introduce a compositional loss to better capture part-object relationships, compensating for the limited part annotations. Additionally, structural guidance from DINO features improves boundary delineation and inter-part understanding. Extensive experiments on Pascal-Part-116, ADE20K-Part-234, and PartImageNet datasets demonstrate that our method significantly outperforms state-of-the-art approaches, setting a new baseline for robust generalization to unseen part categories.

Jiho Choi, Seonho Lee, Minhyun Lee, Seungho Lee, Hyunjung Shim• 2025

Related benchmarks

TaskDatasetResultRank
Part SegmentationPascal-Part-116 (test)
mIoU (Unseen)22.88
18
Open-Vocabulary Part SegmentationPascal-Part-116 zero-shot
mIoU (Seen)57.49
13
Part SegmentationPartImageNet
Seen73.83
12
Part SegmentationADE20K Part-234
Seen Performance0.5313
11
Open-Vocabulary Part SegmentationADE20K Part zero-shot 234
Seen Recall64.81
10
Part SegmentationPartImageNet OOD (test)
mIoU (Unseen)66.15
8
Open-Vocabulary Part SegmentationPascal-Part zero-shot 116
Seen Recall67.15
5
Showing 7 of 7 rows

Other info

Code

Follow for update