Advancing Complex Video Object Segmentation via Progressive Concept Construction
About
We propose Segment Concept (SeC), a concept-driven video object segmentation (VOS) framework that shifts from conventional feature matching to the progressive construction and utilization of high-level, object-centric representations. SeC employs Large Vision-Language Models (LVLMs) to integrate visual cues across diverse frames, constructing robust conceptual priors. To balance semantic reasoning with computational overhead, SeC forwards the LVLMs only when a new scene appears, injecting concept-level features at those points. To rigorously assess VOS methods in scenarios demanding high-level conceptual reasoning and robust semantic understanding, we introduce the Semantic Complex Scenarios Video Object Segmentation benchmark (SeCVOS). SeCVOS comprises 160 manually annotated multi-scenario videos designed to challenge models with substantial appearance variations and dynamic scene transformations. Empirical evaluations demonstrate that SeC substantially outperforms state-of-the-art approaches, including SAM 2 and its advanced variants, on both SeCVOS and standard VOS benchmarks. In particular, SeC achieves an 11.8-point improvement over SAM 2.1 on SeCVOS, establishing a new state-of-the-art in concept-aware VOS.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Video Object Segmentation | DAVIS 2017 (val) | -- | 1193 | |
| Video Object Segmentation | YouTube-VOS 2019 (val) | -- | 231 | |
| Video Object Segmentation | SA-V (val) | J&F Score82.7 | 114 | |
| Video Object Segmentation | SA-V (test) | J&F81.7 | 110 | |
| Video Object Segmentation | LVOS v2 (val) | J&F86.5 | 54 | |
| Video Object Segmentation | MOSE v2 (val) | J&F Score53.8 | 17 | |
| Video Object Segmentation | MOSE v1 (val) | J&F Score75.3 | 17 | |
| Video Object Segmentation | M³-VOS core | Jaccard (J)67.2 | 12 | |
| Video Object Segmentation | SeCVOS No Scene Change | J&F Score84.2 | 7 | |
| Video Object Segmentation | SeCVOS Single Scene Change | J&F69.6 | 7 |