Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

FlowPortrait: Reinforcement Learning for Audio-Driven Portrait Video Generation

About

Generating realistic talking-head videos remains challenging due to persistent issues such as imperfect lip synchronization, unnatural motion, and evaluation metrics that correlate poorly with human perception. We propose FlowPortrait, a reinforcement-learning framework for audio-driven portrait animation built on a multimodal backbone for autoregressive audio-to-video generation. FlowPortrait introduces a human-aligned evaluation system based on Multimodal Large Language Models (MLLMs) to assess lip-sync accuracy, expressiveness, and motion quality. These signals are combined with perceptual and temporal consistency regularizers to form a stable composite reward, which is used to post-train the generator via Group Relative Policy Optimization (GRPO). Extensive experiments, including both automatic evaluations and human preference studies, demonstrate that FlowPortrait consistently produces higher-quality talking-head videos, highlighting the effectiveness of reinforcement learning for portrait animation.

Weiting Tan, Andy T. Liu, Ming Tu, Xinghua Qu, Philipp Koehn, Lu Lu• 2026

Related benchmarks

TaskDatasetResultRank
Talking Head GenerationInternal talking-head dataset In-domain (test)
Lip-sync Score4.42
6
Talking Head GenerationInternal talking-head dataset Out-domain (test)
Lip Sync Score4.5
6
Portrait AnimationHuman evaluation (test)
Lip-sync Score4.16
4
Showing 3 of 3 rows

Other info

Follow for update