Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

CutClaw: Agentic Hours-Long Video Editing via Music Synchronization

About

Editing the video content with audio alignment forms a digital human-made art in current social media. However, the time-consuming and repetitive nature of manual video editing has long been a challenge for filmmakers and professional content creators alike. In this paper, we introduce CutClaw, an autonomous multi-agent framework designed to edit hours-long raw footage into meaningful short videos that leverages the capabilities of multiple Multimodal Language Models~(MLLMs) as an agent system. It produces videos with synchronized music, followed by instructions, and a visually appealing appearance. In detail, our approach begins by employing a hierarchical multimodal decomposition that captures both fine-grained details and global structures across visual and audio footage. Then, to ensure narrative consistency, a Playwriter Agent orchestrates the whole storytelling flow and structures the long-term narrative, anchoring visual scenes to musical shifts. Finally, to construct a short edited video, Editor and Reviewer Agents collaboratively optimize the final cut via selecting fine-grained visual content based on rigorous aesthetic and semantic criteria. We conduct detailed experiments to demonstrate that CutClaw significantly outperforms state-of-the-art baselines in generating high-quality, rhythm-aligned videos. The code is available at: https://github.com/GVCLab/CutClaw.

Shifang Zhao, Yihan Hu, Ying Shan, Yunchao Wei, Xiaodong Cun• 2026

Related benchmarks

TaskDatasetResultRank
Music-synchronized Video EditingFilm
Visual Quality79.2
4
Music-synchronized Video EditingObject-centric Instruction (Obj)
Instruction Follow66.6
4
Music-synchronized Video EditingNarrative-driven Instruction Nar
Instruction Follow73.4
4
Video EditingUser Study Narrative
Visual Quality45.2
4
Video EditingUser Study Object
Visual Quality54.4
4
Video EditingUser Study Film
Visual Quality47.3
4
Video EditingUser Study Average
Visual Quality49.8
4
Music-synchronized Video EditingVlog
Visual Quality76
3
Video EditingUser Study Vlog
Visual Quality54
3
Showing 9 of 9 rows

Other info

GitHub

Follow for update