Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

LongStream: Long-Sequence Streaming Autoregressive Visual Geometry

About

Long-sequence streaming 3D reconstruction remains a significant open challenge. Existing autoregressive models often fail when processing long sequences. They typically anchor poses to the first frame, which leads to attention decay, scale drift, and extrapolation errors. We introduce LongStream, a novel gauge-decoupled streaming visual geometry model for metric-scale scene reconstruction across thousands of frames. Our approach is threefold. First, we discard the first-frame anchor and predict keyframe-relative poses. This reformulates long-range extrapolation into a constant-difficulty local task. Second, we introduce orthogonal scale learning. This method fully disentangles geometry from scale estimation to suppress drift. Finally, we solve Transformer cache issues such as attention-sink reliance and long-term KV-cache contamination. We propose cache-consistent training combined with periodic cache refresh. This approach suppresses attention degradation over ultra-long sequences and reduces the gap between training and inference. Experiments show LongStream achieves state-of-the-art performance. It delivers stable, metric-scale reconstruction over kilometer-scale sequences at 18 FPS. Project Page: https://3dagentworld.github.io/longstream/

Chong Cheng, Xianda Chen, Tao Xie, Wei Yin, Weiqiang Ren, Qian Zhang, Xiaoyuang Guo, Hao Wang• 2026

Related benchmarks

TaskDatasetResultRank
3D Reconstruction7 Scenes--
32
Camera pose estimationOxford Spires
ATE19.815
8
Camera pose estimationWaymo (held-out)
ATE0.737
8
Camera pose estimationvKITTI
ATE (Scene 01)1.422
8
Pose EstimationTUM-RGBD
ATE0.076
8
Pose EstimationKITTI (Sequences 00-10)
KITTI Seq 01 Result46.01
8
3D ReconstructionTUM
CD0.225
8
Showing 7 of 7 rows

Other info

Follow for update