Object-Centric Learning for Real-World Videos by Predicting Temporal Feature Similarities
About
Unsupervised video-based object-centric learning is a promising avenue to learn structured representations from large, unlabeled video collections, but previous approaches have only managed to scale to real-world datasets in restricted domains. Recently, it was shown that the reconstruction of pre-trained self-supervised features leads to object-centric representations on unconstrained real-world image datasets. Building on this approach, we propose a novel way to use such pre-trained features in the form of a temporal feature similarity loss. This loss encodes semantic and temporal correlations between image patches and is a natural way to introduce a motion bias for object discovery. We demonstrate that this loss leads to state-of-the-art performance on the challenging synthetic MOVi datasets. When used in combination with the feature reconstruction loss, our model is the first object-centric video model that scales to unconstrained video datasets such as YouTube-VIS.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Video segmentation | DAVIS | -- | 14 | |
| Unsupervised Video Object Segmentation | DAVIS U17 (val) | J&F Mean Score29 | 11 | |
| object dynamics prediction | YouTube VIS 2021 (test) | FG-ARI28.9 | 9 | |
| Unsupervised object-centric learning | Abdominal surgical dataset (test) | mBO-V46.3 | 8 | |
| Unsupervised object-centric learning | Cholecystectomy surgical dataset (test) | mBO-V30.1 | 8 | |
| Unsupervised object-centric learning | Thoracic surgical dataset (test) | mBO-V21.9 | 8 | |
| Object Discovery | MOVi-E v1 (test) | FG-ARI73.9 | 7 | |
| Unsupervised image segmentation | MOVi-E individual frames | -- | 7 | |
| Object Discovery | MOVi-C v1 (test) | FG-ARI64.8 | 6 | |
| Unsupervised image segmentation | MOVi-C individual frames | -- | 6 |