LL-Bench: Rethinking Low-Level Vision Evaluation in the Era of Large-Scale Generative Models

About

Large-scale generative models have demonstrated remarkable capabilities across image generation and editing tasks. However, their performance in low-level vision tasks, which require pixel-wise control, remains insufficiently studied. To address this gap, we introduce \textbf{LL-Bench}, a comprehensive \textbf{Benchmark} for evaluating the capabilities of large-scale generative models on \textbf{L}ow-\textbf{L}evel vision tasks. The benchmark comprises 2,469 real-world degraded images covering 16 low-level degradation tasks, and 28,919 restored images produced by 10 state-of-the-art large-scale generative models and 21 conventional restoration models, which are annotated with 152,020 expert-level pairwise human preferences and 28,334 quality scores. Built upon LL-Bench, we present a systematic diagnosis that reveals the performance boundaries and unique failure modes of large-scale generative models across diverse low-level vision tasks, compared with conventional representative restoration approaches. Moreover, we investigate the effectiveness of current quality evaluation metrics on LL-Bench, which exhibit significant discrepancy with human ratings. To better align restored-image quality assessment with human preferences, we further propose \textbf{LL-Score}, an MLLM-based evaluator that captures both restoration quality and hallucination existence. Extensive experiments demonstrate that LL-score not only outperforms existing image quality assessment metrics, but also serves as a promising reward model for training generative models on low-level vision tasks.

Lu Liu, Huiyu Duan, Chenxin Zhu, Jintong Lu, Haoyun Jiang, Liu Yang, Qiang Hu, Guangtao Zhai, Xiaoyun Zhang• 2026

Related benchmarks

Task	Dataset	Result
Flare Removal Quality Assessment	LL-Bench (test)	SRCC0.806	36
Defocus Deblurring Quality Assessment	LL-Bench (test)	SRCC0.8576	18
Dehazing Quality Assessment	LL-Bench (test)	SRCC0.7404	18
Deraining Quality Assessment	LL-Bench (test)	SRCC0.7309	18
Desnowing Quality Assessment	LL-Bench (test)	SRCC0.7908	18
HDR Enhancement Quality Assessment	LL-Bench (test)	SRCC0.9005	18
Low-level vision quality assessment (Overall)	LL-Bench (test)	SRCC (Avg)0.66	18
Low-Light Enhancement Quality Assessment	LL-Bench (test)	SRCC0.836	18
Motion Deblurring Quality Assessment	LL-Bench (test)	SRCC0.7893	18
Raindrop Removal Quality Assessment	LL-Bench (test)	SRCC0.7618	18

Showing 10 of 17 rows

Other info

Follow for update

@wizwand_team Discord