Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

DouC: Dual-Branch CLIP for Training-Free Open-Vocabulary Segmentation

About

Open-vocabulary semantic segmentation requires assigning pixel-level semantic labels while supporting an open and unrestricted set of categories. Training-free CLIP-based approaches preserve strong zero-shot generalization but typically rely on a single inference mechanism, limiting their ability to jointly address unreliable local tokens and insufficient spatial coherence. We propose DouC, a training-free dual-branch CLIP framework that decomposes dense prediction into two complementary components. OG-CLIP improves patch-level reliability via lightweight, inference-time token gating, while FADE-CLIP injects external structural priors through proxy attention guided by frozen vision foundation models. The two branches are fused at the logit level, enabling local token reliability and structure-aware patch interactions to jointly influence final predictions, with optional instance-aware correction applied as post-processing. DouC introduces no additional learnable parameters, requires no retraining, and preserves CLIP's zero-shot generalization. Extensive experiments across eight benchmarks and multiple CLIP backbones demonstrate that DouC consistently outperforms prior training-free methods and scales favorably with model capacity.

Mohamad Zamini, Diksha Shukla• 2026

Related benchmarks

TaskDatasetResultRank
Semantic segmentationADE20K
mIoU29.8
567
Semantic segmentationCityscapes
mIoU50.1
497
Semantic segmentationPASCAL VOC with background category VOC21 2012
mIoU74.3
51
Semantic segmentationPascal Context 60 with background
mIoU47.8
43
Semantic segmentationCOCO-Stuff without background class
mIoU48.6
42
Semantic segmentationPascal VOC without background 2012 V20
mIoU92.3
42
Semantic segmentationCOCO-60 with background
mIoU43.3
23
Semantic segmentationPascal Context Stuff without background
mIoU33.4
23
Showing 8 of 8 rows

Other info

Follow for update