Do-Undo Bench: Reversibility for Action Understanding in Image Generation

About

We introduce the Do-Undo task and benchmark to address a critical gap in vision-language models: understanding and generating plausible scene transformations driven by real-world actions. Unlike prior work that relies on prompt-based image generation and editing to perform action-conditioned image manipulation, our training hypothesis requires models to simulate the outcome of a real-world action and then reverse it to the original state. This forward-reverse requirement tests genuine cause-and-effect understanding rather than stylistic or semantic edits. We curate a high-quality benchmark of reversible actions from real-world scenarios to enable robust action grounding. Our experiments reveal that current models struggle with action reversibility, highlighting the need to evaluate action understanding. Do-Undo provides an intuitive testbed for evaluating and advancing action-aware generation in multimodal systems that must reason about real-world dynamics.

Shweta Mahajan, Shreya Kadambi, Hoang Le, Rajeev Yasarla, Apratim Bhattacharyya, Munawar Hayat, Fatih Porikli• 2025

Related benchmarks

Task	Dataset	Result
Action-conditioned image generation evaluation	Do-Undo zero-shot 1.0	Instruction Adherence (IF)8.05	10
Action-based Image Editing	Do-Undo FULL (test)	DINO-R0.81	10
Action Generation	Do-Undo (evaluation)	DINO-R81.9	6
Physics-aware Action Generation	Do-Undo Forward	Instruction Following (IF)7.81	5
Action Generation	Human Evaluation 48 instances (test)	Preference Rate58.3	3

Showing 5 of 5 rows

Other info

Follow for update

@wizwand_team Discord