Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Emergent Outlier View Rejection in Visual Geometry Grounded Transformers

About

Reliable 3D reconstruction from in-the-wild image collections is often hindered by "noisy" images-irrelevant inputs with little or no view overlap with others. While traditional Structure-from-Motion pipelines handle such cases through geometric verification and outlier rejection, feed-forward 3D reconstruction models lack these explicit mechanisms, leading to degraded performance under in-the-wild conditions. In this paper, we discover that the existing feed-forward reconstruction model, e.g., VGGT, despite lacking explicit outlier-rejection mechanisms or noise-aware training, can inherently distinguish distractor images. Through an in-depth analysis under varying proportions of synthetic distractors, we identify a specific layer that naturally exhibits outlier-suppressing behavior. Further probing reveals that this layer encodes discriminative internal representations that enable an effective noise-filtering capability, which we simply leverage to perform outlier-view rejection in feed-forward 3D reconstruction without any additional fine-tuning or supervision. Extensive experiments on both controlled and in-the-wild datasets demonstrate that this implicit filtering mechanism is consistent and generalizes well across diverse scenarios.

Jisang Han, Sunghwan Hong, Jaewoo Jung, Wooseok Jang, Honggyu An, Qianqian Wang, Seungryong Kim, Chen Feng• 2025

Related benchmarks

TaskDatasetResultRank
Multi-view Depth EstimationETH3D
Relative Error (rel)3.01
12
Camera pose estimationPhototourism Small noise
ATE0.2641
8
Camera pose estimationPhototourism Medium noise
ATE0.2645
8
Camera pose estimationPhototourism Large noise
ATE0.2664
8
Camera pose estimationOn-the-Go Small noise
ATE0.0521
8
Camera pose estimationOn-the-Go Medium noise
ATE0.0568
8
Camera pose estimationETH3D Small noise
ATE0.6224
8
Camera pose estimationETH3D Medium noise
ATE0.7673
8
Camera pose estimationETH3D Large noise
ATE0.6874
8
Multi-view Depth EstimationETH3D Small noise level
AbsRel0.0288
8
Showing 10 of 16 rows

Other info

Follow for update