Allocentric Perceiver: Disentangling Allocentric Reasoning from Egocentric Visual Priors via Frame Instantiation

About

With the rising need for spatially grounded tasks such as Vision-Language Navigation/Action, allocentric perception capabilities in Vision-Language Models (VLMs) are receiving growing focus. However, VLMs remain brittle on allocentric spatial queries that require explicit perspective shifts, where the answer depends on reasoning in a target-centric frame rather than the observed camera view. Thus, we introduce Allocentric Perceiver, a training-free strategy that recovers metric 3D states from one or more images with off-the-shelf geometric experts, and then instantiates a query-conditioned allocentric reference frame aligned with the instruction's semantic intent. By deterministically transforming reconstructed geometry into the target frame and prompting the backbone VLM with structured, geometry-grounded representations, Allocentric Perceriver offloads mental rotation from implicit reasoning to explicit computation. We evaluate Allocentric Perciver across multiple backbone families on spatial reasoning benchmarks, observing consistent and substantial gains ($\sim$10%) on allocentric tasks while maintaining strong egocentric performance, and surpassing both spatial-perception-finetuned models and state-of-the-art open-source and proprietary models.

Hengyi Wang, Ruiqiang Zhang, Chang Liu, Guanjie Wang, Zehua Ma, Han Fang, Weiming Zhang• 2026

Related benchmarks

Task	Dataset	Result
Egocentric Spatial Reasoning	3DSRBench Egocentric (test)	Orientation Accuracy (Cam.V)57.43	24
Allocentric Spatial Reasoning	Viewspatial-Bench Allocentric (test)	Psn.V Orientation0.4762	15
Allocentric Spatial Reasoning	Viewspatial-Bench and 3DSRBench Allocentric Tasks (test)	Psn.V Orient.47.62	9

Showing 3 of 3 rows

Other info

Follow for update

@wizwand_team Discord