DouC: Dual-Branch CLIP for Training-Free Open-Vocabulary Segmentation
About
Open-vocabulary semantic segmentation requires assigning pixel-level semantic labels while supporting an open and unrestricted set of categories. Training-free CLIP-based approaches preserve strong zero-shot generalization but typically rely on a single inference mechanism, limiting their ability to jointly address unreliable local tokens and insufficient spatial coherence. We propose DouC, a training-free dual-branch CLIP framework that decomposes dense prediction into two complementary components. OG-CLIP improves patch-level reliability via lightweight, inference-time token gating, while FADE-CLIP injects external structural priors through proxy attention guided by frozen vision foundation models. The two branches are fused at the logit level, enabling local token reliability and structure-aware patch interactions to jointly influence final predictions, with optional instance-aware correction applied as post-processing. DouC introduces no additional learnable parameters, requires no retraining, and preserves CLIP's zero-shot generalization. Extensive experiments across eight benchmarks and multiple CLIP backbones demonstrate that DouC consistently outperforms prior training-free methods and scales favorably with model capacity.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Semantic segmentation | ADE20K | mIoU29.8 | 567 | |
| Semantic segmentation | Cityscapes | mIoU50.1 | 497 | |
| Semantic segmentation | PASCAL VOC with background category VOC21 2012 | mIoU74.3 | 51 | |
| Semantic segmentation | Pascal Context 60 with background | mIoU47.8 | 43 | |
| Semantic segmentation | COCO-Stuff without background class | mIoU48.6 | 42 | |
| Semantic segmentation | Pascal VOC without background 2012 V20 | mIoU92.3 | 42 | |
| Semantic segmentation | COCO-60 with background | mIoU43.3 | 23 | |
| Semantic segmentation | Pascal Context Stuff without background | mIoU33.4 | 23 |