Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Featurising Pixels from Dynamic 3D Scenes with Linear In-Context Learners

About

One of the most exciting applications of vision models involve pixel-level reasoning. Despite the abundance of vision foundation models, we still lack representations that effectively embed spatio-temporal properties of visual scenes at the pixel level. Existing frameworks either train on image-based pretext tasks, which do not account for dynamic elements, or on video sequences for action-level reasoning, which does not scale to dense pixel-level prediction. We present a framework that learns pixel-accurate feature descriptors from videos, LILA. The core element of our training framework is linear in-context learning. LILA leverages spatio-temporal cue maps -- depth and motion -- estimated with off-the-shelf networks. Despite the noisy nature of those cues, LILA trains effectively on uncurated video datasets, embedding semantic and geometric properties in a temporally consistent manner. We demonstrate compelling empirical benefits of the learned representation across a diverse suite of vision tasks: video object segmentation, surface normal estimation and semantic segmentation.

Nikita Araslanov, Martin Sundermeyer, Hidenobu Matsuki, David Joseph Tan, Federico Tombari• 2026

Related benchmarks

TaskDatasetResultRank
Video Object SegmentationDAVIS 2017 (val)
J mean72.5
1226
Semantic segmentationADE20K
mIoU47.5
1028
Semantic segmentationCOCO Stuff (val)
mIoU63.3
167
Surface Normal EstimationNYUv2 (val)--
19
Semantic segmentationCOCO-Stuff coarse set annotation (C=27) (held-out set (seen and 15 unseen categories))
mIoU (Seen)34.6
4
Showing 5 of 5 rows

Other info

Follow for update