MolmoPoint: Better Pointing for VLMs with Grounding Tokens
About
Grounding has become a fundamental capability of vision-language models (VLMs). Most existing VLMs point by generating coordinates as part of their text output, which requires learning a complicated coordinate system and results in a high token count. Instead, we propose a more intuitive pointing mechanism that directly selects the visual tokens that contain the target concept. Our model generates a special pointing token that cross-attends to the input image or video tokens and selects the appropriate one. To make this model more fine-grained, we follow these pointing tokens with an additional special token that selects a fine-grained subpatch within the initially selected region, and then a third token that specifies a location within that subpatch. We further show that performance improves by generating points sequentially in a consistent order, encoding the relative position of the previously selected point, and including a special no-more-points class when selecting visual tokens. Using this method, we set a new state-of-the-art on image pointing (70.7% on PointBench), set a new state-of-the-art among fully open models on GUI pointing (61.1% on ScreenSpotPro), and improve video pointing (59.1% human preference win rate vs. a text coordinate baseline) and tracking (+6.3% gain on Molmo2Track). We additionally show that our method achieves much higher sample efficiency and discuss the qualitative differences that emerge from this design change.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| GUI Grounding | ScreenSpot v2 | Avg Accuracy93.9 | 283 | |
| Text-based Visual Question Answering | TextVQA (val) | Accuracy86 | 262 | |
| Referring Video Object Segmentation | Ref-YouTube-VOS (val) | J&F Score70.5 | 244 | |
| Long Video Understanding | LongVideoBench (val) | Accuracy68 | 210 | |
| GUI Grounding | ScreenSpot Pro | Accuracy61.1 | 163 | |
| Referring Video Object Segmentation | MeViS (val) | J&F Score0.635 | 161 | |
| Video Understanding | MVBench (test) | Accuracy75.9 | 151 | |
| Visual Question Answering | VQA v2 (val) | Accuracy87.2 | 144 | |
| GUI Grounding | OSWorld-G | -- | 107 | |
| Mathematical Reasoning | MathVista (testmini) | Accuracy59.4 | 103 |