Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

MolmoPoint: Better Pointing for VLMs with Grounding Tokens

About

Grounding has become a fundamental capability of vision-language models (VLMs). Most existing VLMs point by generating coordinates as part of their text output, which requires learning a complicated coordinate system and results in a high token count. Instead, we propose a more intuitive pointing mechanism that directly selects the visual tokens that contain the target concept. Our model generates a special pointing token that cross-attends to the input image or video tokens and selects the appropriate one. To make this model more fine-grained, we follow these pointing tokens with an additional special token that selects a fine-grained subpatch within the initially selected region, and then a third token that specifies a location within that subpatch. We further show that performance improves by generating points sequentially in a consistent order, encoding the relative position of the previously selected point, and including a special no-more-points class when selecting visual tokens. Using this method, we set a new state-of-the-art on image pointing (70.7% on PointBench), set a new state-of-the-art among fully open models on GUI pointing (61.1% on ScreenSpotPro), and improve video pointing (59.1% human preference win rate vs. a text coordinate baseline) and tracking (+6.3% gain on Molmo2Track). We additionally show that our method achieves much higher sample efficiency and discuss the qualitative differences that emerge from this design change.

Christopher Clark, Yue Yang, Jae Sung Park, Zixian Ma, Jieyu Zhang, Rohun Tripathi, Mohammadreza Salehi, Sangho Lee, Taira Anderson, Winson Han, Ranjay Krishna• 2026

Related benchmarks

TaskDatasetResultRank
GUI GroundingScreenSpot v2
Avg Accuracy93.9
283
Text-based Visual Question AnsweringTextVQA (val)
Accuracy86
262
Referring Video Object SegmentationRef-YouTube-VOS (val)
J&F Score70.5
244
Long Video UnderstandingLongVideoBench (val)
Accuracy68
210
GUI GroundingScreenSpot Pro
Accuracy61.1
163
Referring Video Object SegmentationMeViS (val)
J&F Score0.635
161
Video UnderstandingMVBench (test)
Accuracy75.9
151
Visual Question AnsweringVQA v2 (val)
Accuracy87.2
144
GUI GroundingOSWorld-G--
107
Mathematical ReasoningMathVista (testmini)
Accuracy59.4
103
Showing 10 of 45 rows

Other info

Follow for update