SegDAC: Visual Generalization in Reinforcement Learning via Dynamic Object Tokens

About

Visual reinforcement learning policies trained on pixel observations often struggle to generalize when visual conditions change at test time. Object-centric representations are a promising alternative, but most approaches use fixed-size slot representations, require image reconstruction, or need auxiliary losses to learn object decompositions. As a result, it remains unclear how to learn RL policies directly from object-level inputs without these constraints. We propose SegDAC, a Segmentation-Driven Actor-Critic that operates on a variable-length set of object token embeddings. At each timestep, text-grounded segmentation produces object masks from which spatially aware token embeddings are extracted. A transformer-based actor-critic processes these dynamic tokens, using segment positional encoding to preserve spatial information across objects. We ablate these design choices and show that both segment positional encoding and variable-length processing are individually necessary for strong performance. We evaluate SegDAC on 8 ManiSkill3 manipulation tasks under 12 visual perturbation types across 3 difficulty levels. SegDAC improves over prior visual generalization methods by 15% on easy, 66% on medium, and 88% on the hardest settings. SegDAC matches the sample efficiency of the state-of-the-art visual RL methods while achieving improved generalization under visual changes. Project Page: https://segdac.github.io/

Alexandre Brown, Glen Berseth• 2025

Related benchmarks

Task	Dataset	Result
LiftPegUpright	ManiSkill3 Medium Lighting Direction v1 (test)	Success Rate42	7
LiftPegUpright	ManiSkill3 Hard Mo Texture v1 (test)	Return28	7
LiftPegUpright	ManiSkill3 (Hard Ground Color Test)	Success Rate38	7
LiftPegUpright	ManiSkill3 Easy Mo Color v1 (test)	Success Rate40	7
LiftPegUpright	ManiSkill Medium Mo Texture 3 (test)	Success Rate30	7
LiftPegUpright	ManiSkill3 Medium Lighting Color v1 (test)	Success Rate43	7
LiftPegUpright	ManiSkill3 Hard Mo Color (test)	Success Rate39	7
LiftPegUpright	ManiSkill Easy Camera Fov v3 (test)	Success Rate27	7
LiftPegUpright	ManiSkill3 Hard Ground Texture v1 (test)	Success Rate40	7
PickCube	ManiSkill3 Medium Table Color (test)	Success Rate17	7

Showing 10 of 232 rows

...

Other info

Follow for update

@wizwand_team Discord