S2GS: Streaming Semantic Gaussian Splatting for Online Scene Understanding and Reconstruction
About
Existing offline feed-forward methods for joint scene understanding and reconstruction on long image streams often repeatedly perform global computation over an ever-growing set of past observations, causing runtime and GPU memory to increase rapidly with sequence length and limiting scalability. We propose Streaming Semantic Gaussian Splatting (S2GS), a strictly causal, incremental 3D Gaussian semantic field framework: it does not leverage future frames and continuously updates scene geometry, appearance, and instance-level semantics without reprocessing historical frames, enabling scalable online joint reconstruction and understanding. S2GS adopts a geometry-semantic decoupled dual-backbone design: the geometry branch performs causal modeling to drive incremental Gaussian updates, while the semantic branch leverages a 2D foundation vision model and a query-driven decoder to predict segmentation masks and identity embeddings, further stabilized by query-level contrastive alignment and lightweight online association with an instance memory. Experiments show that S2GS matches or outperforms strong offline baselines on joint reconstruction-and-understanding benchmarks, while significantly improving long-horizon scalability: it processes 1,000+ frames with much slower growth in runtime and GPU memory, whereas offline global-processing baselines typically run out of memory at around 80 frames under the same setting.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Novel View Synthesis | ScanNet | PSNR18.71 | 130 | |
| Novel View Synthesis | Replica | PSNR15.66 | 69 | |
| Novel View Synthesis | ScanNet++ | PSNR15.33 | 67 | |
| Semantic segmentation | ScanNet short-sequence | mIoU52.35 | 21 | |
| Novel View Synthesis | ScanNet short-sequence | PSNR24.9 | 16 | |
| Semantic segmentation | Replica | -- | 16 | |
| Semantic segmentation | ScanNet++ | Mean IoU (mIoU)41.67 | 15 | |
| Temporal Instance Consistency | ScanNet short-sequence | T-mIoU44.89 | 12 | |
| Online Scene Understanding and Reconstruction | ScanNet 2017 | Processing Time (s)0.1 | 7 | |
| Cross-frame Instance Consistency | ScanNet | T-mIoU26.71 | 3 |