Do-Undo Bench: Reversibility for Action Understanding in Image Generation
About
We introduce the Do-Undo task and benchmark to address a critical gap in vision-language models: understanding and generating plausible scene transformations driven by real-world actions. Unlike prior work that relies on prompt-based image generation and editing to perform action-conditioned image manipulation, our training hypothesis requires models to simulate the outcome of a real-world action and then reverse it to the original state. This forward-reverse requirement tests genuine cause-and-effect understanding rather than stylistic or semantic edits. We curate a high-quality benchmark of reversible actions from real-world scenarios to enable robust action grounding. Our experiments reveal that current models struggle with action reversibility, highlighting the need to evaluate action understanding. Do-Undo provides an intuitive testbed for evaluating and advancing action-aware generation in multimodal systems that must reason about real-world dynamics.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Action-conditioned image generation evaluation | Do-Undo zero-shot 1.0 | Instruction Adherence (IF)8.05 | 10 | |
| Action-based Image Editing | Do-Undo FULL (test) | DINO-R0.81 | 10 | |
| Action Generation | Do-Undo (evaluation) | DINO-R81.9 | 6 | |
| Physics-aware Action Generation | Do-Undo Forward | Instruction Following (IF)7.81 | 5 | |
| Action Generation | Human Evaluation 48 instances (test) | Preference Rate58.3 | 3 |