Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

What "Not" to Detect: Negation-Aware VLMs via Structured Reasoning and Token Merging

About

State-of-the-art vision-language models (VLMs) suffer from a critical failure in understanding negation, often referred to as affirmative bias. This limitation is particularly severe in described object detection (DOD) tasks. To address this, we propose two primary contributions: (1) a new dataset pipeline and (2) a novel, lightweight adaptation recipe. First, we introduce CoVAND, a dataset constructed with a systematic chain-of-thought (CoT) and VQA-based pipeline to generate high-quality, instance-grounded negation data. Second, we propose NegToMe, a novel text token merging module that directly tackles the architectural cause of affirmative bias. NegToMe fundamentally addresses the structural loss of negation cues in tokenization, grouping them with attributes into coherent semantic phrases. It maintains correct polarity at the input level, enabling robust negation understanding even with limited data. For instance, to prevent a model from treating the fragmented tokens "not" and "girl" as simply "girl", NegToMe binds them into a single token whose meaning is correctly distinguished from that of "girl" alone. This module is integrated with a parameter-efficient and strategic LoRA fine-tuning approach. Our method significantly improves performance on challenging negation benchmarks with a lowered false positive rate, boosting NMS-AP by up to +10.8 points on OVDEval and demonstrating generalization to SoTA VLMs. This work marks a crucial step forward in addressing negation understanding for real-world detection applications.

Inha Kang, Youngsun Lim, Seonho Lee, Jiho Choi, Junsuk Choe, Hyunjung Shim• 2025

Related benchmarks

TaskDatasetResultRank
Described Object DetectionD3 (Full)
mAP32.5
16
Described Object DetectionD3 (Pres)
mAP32.9
16
Described Object DetectionD3 (Abs)
mAP31.5
16
Described Object DetectionD3 (M)
mAP35.3
14
Described Object DetectionD3 L
mAP31.3
14
Described Object DetectionD3 XL
mAP25.4
14
Described Object DetectionD3 (S)
mAP33.2
14
Object DetectionOVDEval Negation (test)
AP57.2
7
Binary DiscriminationFG-CXR
Accuracy62.55
4
Multiple Choice QuestionNegBench COCO subset
Overall Accuracy32.55
4
Showing 10 of 12 rows

Other info

Follow for update