DouC: Dual-Branch CLIP for Training-Free Open-Vocabulary Segmentation

About

Open-vocabulary semantic segmentation requires assigning pixel-level semantic labels while supporting an open and unrestricted set of categories. Training-free CLIP-based approaches preserve strong zero-shot generalization but typically rely on a single inference mechanism, limiting their ability to jointly address unreliable local tokens and insufficient spatial coherence. We propose DouC, a training-free dual-branch CLIP framework that decomposes dense prediction into two complementary components. OG-CLIP improves patch-level reliability via lightweight, inference-time token gating, while FADE-CLIP injects external structural priors through proxy attention guided by frozen vision foundation models. The two branches are fused at the logit level, enabling local token reliability and structure-aware patch interactions to jointly influence final predictions, with optional instance-aware correction applied as post-processing. DouC introduces no additional learnable parameters, requires no retraining, and preserves CLIP's zero-shot generalization. Extensive experiments across eight benchmarks and multiple CLIP backbones demonstrate that DouC consistently outperforms prior training-free methods and scales favorably with model capacity.

Mohamad Zamini, Diksha Shukla• 2026

Related benchmarks

Task	Dataset	Result
Semantic segmentation	ADE20K	mIoU29.8	699
Semantic segmentation	Cityscapes	mIoU50.1	526
Semantic segmentation	PASCAL VOC with background category VOC21 2012	mIoU74.3	51
Semantic segmentation	Pascal Context 60 with background	mIoU47.8	43
Semantic segmentation	COCO-Stuff without background class	mIoU48.6	42
Semantic segmentation	Pascal VOC without background 2012 V20	mIoU92.3	42
Semantic segmentation	COCO-60 with background	mIoU43.3	23
Semantic segmentation	Pascal Context Stuff without background	mIoU33.4	23

Showing 8 of 8 rows

Other info

Follow for update

@wizwand_team Discord