Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Reasoning with Pixel-level Precision: QVLM Architecture and SQuID Dataset for Quantitative Geospatial Analytics

About

Current Vision-Language Models (VLMs) fail at quantitative spatial reasoning because their architectures destroy pixel-level information required for counting and measurements. Vision encoders compress images through patch embeddings, reducing spatial indexing and losing the precise pixel-level tracking required for accurate counting. We present two contributions to address this fundamental limitation. First, we introduce SQuID (Satellite Quantitative Intelligence Dataset), a benchmark of 2,000 satellite image Question-Answer pairs with both numerical range and categorical answers, designed to evaluate quantitative spatial reasoning. The dataset spans three difficulty tiers with annotations automatically generated from human labels and their learned variability. Second, we propose QVLM (Quantitative Vision-Language Model), a code-generation architecture that maintains pixel precision by decoupling language understanding from visual analysis. Instead of encoding images into embeddings, QVLM generates executable code that first calls a segmentation model to obtain pixel-level masks, then operates directly on these masks, preserving spatial indexing throughout the reasoning process. Our experiments show that QVLM using GPT-5 as coder achieves 42.0% accuracy on SQuID compared to 28.1% for a VLM prompted with image-question pairs. Our work reveals that, for quantitative spatial reasoning, architectural decoupling enables better accuracy on quantitative tasks.

Peter A. Massih, Eric Cosatto• 2026

Related benchmarks

TaskDatasetResultRank
Spatial reasoning on satellite imagerySQuID Tier 1
Accuracy53.52
6
Spatial reasoning on satellite imagerySQuID Tier 2
Accuracy54.06
6
Spatial reasoning on satellite imagerySQuID (Tier 3)
Accuracy18.84
6
Spatial reasoning on satellite imagerySQuID Overall
Accuracy42
6
Spatial reasoning on satellite imagerySQuID (fragmentation)
Accuracy81.63
6
Spatial reasoning on satellite imagerySQuID (connectivity)
Accuracy74.04
6
Showing 6 of 6 rows

Other info

Follow for update