Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Do-Undo Bench: Reversibility for Action Understanding in Image Generation

About

We introduce the Do-Undo task and benchmark to address a critical gap in vision-language models: understanding and generating plausible scene transformations driven by real-world actions. Unlike prior work that relies on prompt-based image generation and editing to perform action-conditioned image manipulation, our training hypothesis requires models to simulate the outcome of a real-world action and then reverse it to the original state. This forward-reverse requirement tests genuine cause-and-effect understanding rather than stylistic or semantic edits. We curate a high-quality benchmark of reversible actions from real-world scenarios to enable robust action grounding. Our experiments reveal that current models struggle with action reversibility, highlighting the need to evaluate action understanding. Do-Undo provides an intuitive testbed for evaluating and advancing action-aware generation in multimodal systems that must reason about real-world dynamics.

Shweta Mahajan, Shreya Kadambi, Hoang Le, Rajeev Yasarla, Apratim Bhattacharyya, Munawar Hayat, Fatih Porikli• 2025

Related benchmarks

TaskDatasetResultRank
Action-conditioned image generation evaluationDo-Undo zero-shot 1.0
Instruction Adherence (IF)8.05
10
Action-based Image EditingDo-Undo FULL (test)
DINO-R0.81
10
Action GenerationDo-Undo (evaluation)
DINO-R81.9
6
Physics-aware Action GenerationDo-Undo Forward
Instruction Following (IF)7.81
5
Action GenerationHuman Evaluation 48 instances (test)
Preference Rate58.3
3
Showing 5 of 5 rows

Other info

Follow for update