Do-Undo: Generating and Reversing Physical Actions in Vision-Language Models
About
We introduce the Do-Undo task and benchmark to address a critical gap in vision-language models: understanding and generating physically plausible scene transformations driven by real-world actions. Unlike prior work focused on object-level edits, Do-Undo requires models to simulate the outcome of a physical action and then accurately reverse it, reflecting true cause-and-effect in the visual world. We curate a large-scale dataset of reversible actions from real-world videos and design a training strategy enforcing consistency for robust action grounding. Our experiments reveal that current models struggle with physical reversibility, underscoring the importance of this task for embodied AI, robotics, and physics-aware generative modeling. Do-Undo establishes an intuitive testbed for evaluating and advancing physical reasoning in multimodal systems.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Action-conditioned image generation evaluation | Do-Undo zero-shot 1.0 | Instruction Adherence (IF)8.05 | 10 | |
| Action-based Image Editing | Do-Undo FULL (test) | DINO-R0.81 | 10 | |
| Action Generation | Do-Undo (evaluation) | DINO-R81.9 | 6 | |
| Physics-aware Action Generation | Do-Undo Forward | Instruction Following (IF)7.81 | 5 | |
| Action Generation | Human Evaluation 48 instances (test) | Preference Rate58.3 | 3 |