FlowPortrait: Reinforcement Learning for Audio-Driven Portrait Video Generation

About

Generating realistic talking-head videos remains challenging due to persistent issues such as imperfect lip synchronization, unnatural motion, and evaluation metrics that correlate poorly with human perception. We propose FlowPortrait, a reinforcement-learning framework for audio-driven portrait animation built on a multimodal backbone for autoregressive audio-to-video generation. FlowPortrait introduces a human-aligned evaluation system based on Multimodal Large Language Models (MLLMs) to assess lip-sync accuracy, expressiveness, and motion quality. These signals are combined with perceptual and temporal consistency regularizers to form a stable composite reward, which is used to post-train the generator via Group Relative Policy Optimization (GRPO). Extensive experiments, including both automatic evaluations and human preference studies, demonstrate that FlowPortrait consistently produces higher-quality talking-head videos, highlighting the effectiveness of reinforcement learning for portrait animation.

Weiting Tan, Andy T. Liu, Ming Tu, Xinghua Qu, Philipp Koehn, Lu Lu• 2026

Related benchmarks

Task	Dataset	Result
Talking Head Generation	Internal talking-head dataset In-domain (test)	Lip-sync Score4.42	6
Talking Head Generation	Internal talking-head dataset Out-domain (test)	Lip Sync Score4.5	6
Portrait Animation	Human evaluation (test)	Lip-sync Score4.16	4

Showing 3 of 3 rows

Other info

Follow for update

@wizwand_team Discord