FiVE: A Fine-grained Video Editing Benchmark for Evaluating Emerging Diffusion and Rectified Flow Models

About

Numerous text-to-video (T2V) editing methods have emerged recently, but the lack of a standardized benchmark for fair evaluation has led to inconsistent claims and an inability to assess model sensitivity to hyperparameters. Fine-grained video editing is crucial for enabling precise, object-level modifications while maintaining context and temporal consistency. To address this, we introduce FiVE, a Fine-grained Video Editing Benchmark for evaluating emerging diffusion and rectified flow models. Our benchmark includes 74 real-world videos and 26 generated videos, featuring 6 fine-grained editing types, 420 object-level editing prompt pairs, and their corresponding masks. Additionally, we adapt the latest rectified flow (RF) T2V generation models, Pyramid-Flow and Wan2.1, by introducing FlowEdit, resulting in training-free and inversion-free video editing models Pyramid-Edit and Wan-Edit. We evaluate five diffusion-based and two RF-based editing methods on our FiVE benchmark using 15 metrics, covering background preservation, text-video similarity, temporal consistency, video quality, and runtime. To further enhance object-level evaluation, we introduce FiVE-Acc, a novel metric leveraging Vision-Language Models (VLMs) to assess the success of fine-grained video editing. Experimental results demonstrate that RF-based editing significantly outperforms diffusion-based methods, with Wan-Edit achieving the best overall performance and exhibiting the least sensitivity to hyperparameters. More video demo available on the anonymous website: https://sites.google.com/view/five-benchmark

Minghan Li, Chenxi Xie, Yichen Wu, Lei Zhang, Mengyu Wang• 2025

Related benchmarks

Task	Dataset	Result
Video Editing	NRVBench V1 (full)	Distortion (x10^3)17.66	14
Video Editing	FiVE-Bench (test)	Structural Distance12.53	11
Instructional Video Editing	FiVE (test)	FiVE YN41.41	9
Video Editing	FiVE (test)	Distance (x1000)12.53	8
Video Editing	FiVE-Bench	CLIP-T27.96	8
Video Editing	Anchor-Bench	CLIP Temporal Score23.07	8
Video Editing	NRVBench V0 (pilot)	Distortion (x1000)18.04	7
Video Editing	Dataset 15 × 3 × 150 frames V0	Distance (Scaled by 1e3)18.04	7
Video Editing	NRVBench	S_phy73.22	6
Video Editing	V1	Sphy73.22	6

Showing 10 of 11 rows

Other info

Follow for update

@wizwand_team Discord