Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Vision Harnessing Agent for Open Ad-hoc Segmentation

About

Segmentation has become easy when the concept is known, requiring retrieval of a learned visual grounding from text. It remains hard for open ad-hoc concepts, where the grounding may not exist as one learned mask and must often be constructed from image evidence through parts, relations, exclusions, and collections. We propose a Vision-guided Ad-hoc Segmentation Agent (VASA), the first vision harnessing agent for open ad-hoc segmentation. VASA is training-free and couples a VLM agent, a segmentation foundation model, and a visually grounded workflow. Rather than revising text prompts alone, VASA uses a persistent working mask to reason, construct, and validate a solution. It plans visual operations, invokes segmentation tools, inspects results, edits the mask, and recovers from errors. We construct PARS, a new benchmark that turns part-level labels in PartImageNet into open ad-hoc concepts through long-form definition queries. On PARS, VASA outperforms open-vocabulary, reasoning-based, and agentic baselines, surpassing SAM3 Agent by 14-25%. On RefCOCOm, a standard multi-granularity referring segmentation benchmark, VASA improves over SAM3 Agent by 5-9% and over other agentic baselines by up to 20%. These results validate agentic visual construction for open ad-hoc segmentation. Our work points to a path for AI agents beyond wrapping foundation models as tools: Programming them with task knowledge, VLM behavior, visual routines, working memory, and failure-aware workflows.

Zilin Wang, Stella X. Yu• 2026

Related benchmarks

TaskDatasetResultRank
Open-Vocabulary Part SegmentationPARS Ad-hoc Concepts
gIoU56.9
28
Multi-granularity Referring Expression SegmentationRefCOCOm (val)
gIoU (Part)45
14
Multi-granularity Referring Expression SegmentationRefCOCOm (testA)
gIoU (Part)43.2
14
Multi-granularity Referring Expression SegmentationRefCOCOm (testB)
gIoU (Part)47.6
14
Open-Vocabulary Part SegmentationPARS Common Concepts
gIoU60.8
14
Showing 5 of 5 rows

Other info

Follow for update