Allocentric Perceiver: Disentangling Allocentric Reasoning from Egocentric Visual Priors via Frame Instantiation
About
With the rising need for spatially grounded tasks such as Vision-Language Navigation/Action, allocentric perception capabilities in Vision-Language Models (VLMs) are receiving growing focus. However, VLMs remain brittle on allocentric spatial queries that require explicit perspective shifts, where the answer depends on reasoning in a target-centric frame rather than the observed camera view. Thus, we introduce Allocentric Perceiver, a training-free strategy that recovers metric 3D states from one or more images with off-the-shelf geometric experts, and then instantiates a query-conditioned allocentric reference frame aligned with the instruction's semantic intent. By deterministically transforming reconstructed geometry into the target frame and prompting the backbone VLM with structured, geometry-grounded representations, Allocentric Perceriver offloads mental rotation from implicit reasoning to explicit computation. We evaluate Allocentric Perciver across multiple backbone families on spatial reasoning benchmarks, observing consistent and substantial gains ($\sim$10%) on allocentric tasks while maintaining strong egocentric performance, and surpassing both spatial-perception-finetuned models and state-of-the-art open-source and proprietary models.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Egocentric Spatial Reasoning | 3DSRBench Egocentric (test) | Orientation Accuracy (Cam.V)57.43 | 24 | |
| Allocentric Spatial Reasoning | Viewspatial-Bench Allocentric (test) | Psn.V Orientation0.4762 | 15 | |
| Allocentric Spatial Reasoning | Viewspatial-Bench and 3DSRBench Allocentric Tasks (test) | Psn.V Orient.47.62 | 9 |