SegDAC: Visual Generalization in Reinforcement Learning via Dynamic Object Tokens
About
Visual reinforcement learning policies trained on pixel observations often struggle to generalize when visual conditions change at test time. Object-centric representations are a promising alternative, but most approaches use fixed-size slot representations, require image reconstruction, or need auxiliary losses to learn object decompositions. As a result, it remains unclear how to learn RL policies directly from object-level inputs without these constraints. We propose SegDAC, a Segmentation-Driven Actor-Critic that operates on a variable-length set of object token embeddings. At each timestep, text-grounded segmentation produces object masks from which spatially aware token embeddings are extracted. A transformer-based actor-critic processes these dynamic tokens, using segment positional encoding to preserve spatial information across objects. We ablate these design choices and show that both segment positional encoding and variable-length processing are individually necessary for strong performance. We evaluate SegDAC on 8 ManiSkill3 manipulation tasks under 12 visual perturbation types across 3 difficulty levels. SegDAC improves over prior visual generalization methods by 15% on easy, 66% on medium, and 88% on the hardest settings. SegDAC matches the sample efficiency of the state-of-the-art visual RL methods while achieving improved generalization under visual changes. Project Page: https://segdac.github.io/
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| LiftPegUpright | ManiSkill3 Medium Lighting Direction v1 (test) | Success Rate42 | 7 | |
| LiftPegUpright | ManiSkill3 Hard Mo Texture v1 (test) | Return28 | 7 | |
| LiftPegUpright | ManiSkill3 (Hard Ground Color Test) | Success Rate38 | 7 | |
| LiftPegUpright | ManiSkill3 Easy Mo Color v1 (test) | Success Rate40 | 7 | |
| LiftPegUpright | ManiSkill Medium Mo Texture 3 (test) | Success Rate30 | 7 | |
| LiftPegUpright | ManiSkill3 Medium Lighting Color v1 (test) | Success Rate43 | 7 | |
| LiftPegUpright | ManiSkill3 Hard Mo Color (test) | Success Rate39 | 7 | |
| LiftPegUpright | ManiSkill Easy Camera Fov v3 (test) | Success Rate27 | 7 | |
| LiftPegUpright | ManiSkill3 Hard Ground Texture v1 (test) | Success Rate40 | 7 | |
| PickCube | ManiSkill3 Medium Table Color (test) | Success Rate17 | 7 |