Self-Evolving Vision-Language Models for Image Quality Assessment via Voting and Ranking

About

Improving vision-language models (VLMs) in the post-training stage typically relies on supervised fine-tuning or reinforcement learning, methods that necessitate costly, human-annotated data. While self-supervised techniques have proven effective for enhancing reasoning capabilities, their application to perceptual domains such as image quality assessment (IQA) remains largely unexplored. In this work, we introduce EvoQuality, a novel framework that enables a VLM to autonomously refine its quality perception capabilities without any ground-truth labels. EvoQuality adapts the principle of self-consistency to the ranking-based nature of IQA. It generates pseudo-labels by performing pairwise majority voting on the VLM's own outputs to establish a consensus on relative quality. These pseudo-rankings are then formulated into a fidelity reward that guides the model's iterative evolution through group relative policy optimization (GRPO). By iteratively leveraging its own predictions, EvoQuality progressively refines the VLM's perceptual capability. Extensive experiments show that EvoQuality boosts the base VLM's zero-shot performance by 31.8% on PLCC across diverse IQA benchmarks. Remarkably, despite being entirely self-supervised, EvoQuality achieves performance that is competitive with, or even surpasses, state-of-the-art supervised VLM-based IQA models, outperforming these models on 5 out of 7 IQA benchmarks. Furthermore, the framework demonstrates significant flexibility, allowing it to be stacked with pre-trained IQA models to bolster generalization on unseen datasets. Codes and checkpoints will be available at https://github.com/bytedance/EvoQuality.

Wen Wen, Tianwu Zhi, Kanglong Fan, Yang Li, Xinge Peng, Yabin Zhang, Yiting Liao, Junlin Li, Li Zhang• 2025

Related benchmarks

Task	Dataset	Result
No-Reference Image Quality Assessment	KADID-10K	SROCC0.807	189
Image Quality Assessment	KonIQ	SRCC0.731	167
No-Reference Image Quality Assessment	CSIQ	SROCC0.839	144
No-Reference Image Quality Assessment	KonIQ-10k	SROCC0.794	144
No-Reference Image Quality Assessment	SPAQ	SROCC0.9	136
No-Reference Image Quality Assessment	TID 2013	SRCC0.611	136
No-Reference Image Quality Assessment	LiveW	PLCC84.7	50
No-Reference Image Quality Assessment	PIPAL	PLCC0.649	35
No-Reference Image Quality Assessment	AG-IQA	PLCC0.839	17
No-Reference Image Quality Assessment	Weighted average (SPAQ, AGIQA, LIVEW, KADID, PIPAL, TID2013, CSIQ)	PLCC (WAVG)0.762	17

Showing 10 of 21 rows

Other info

Follow for update

@wizwand_team Discord