Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

LL-Bench: Rethinking Low-Level Vision Evaluation in the Era of Large-Scale Generative Models

About

Large-scale generative models have demonstrated remarkable capabilities across image generation and editing tasks. However, their performance in low-level vision tasks, which require pixel-wise control, remains insufficiently studied. To address this gap, we introduce \textbf{LL-Bench}, a comprehensive \textbf{Benchmark} for evaluating the capabilities of large-scale generative models on \textbf{L}ow-\textbf{L}evel vision tasks. The benchmark comprises 2,469 real-world degraded images covering 16 low-level degradation tasks, and 28,919 restored images produced by 10 state-of-the-art large-scale generative models and 21 conventional restoration models, which are annotated with 152,020 expert-level pairwise human preferences and 28,334 quality scores. Built upon LL-Bench, we present a systematic diagnosis that reveals the performance boundaries and unique failure modes of large-scale generative models across diverse low-level vision tasks, compared with conventional representative restoration approaches. Moreover, we investigate the effectiveness of current quality evaluation metrics on LL-Bench, which exhibit significant discrepancy with human ratings. To better align restored-image quality assessment with human preferences, we further propose \textbf{LL-Score}, an MLLM-based evaluator that captures both restoration quality and hallucination existence. Extensive experiments demonstrate that LL-score not only outperforms existing image quality assessment metrics, but also serves as a promising reward model for training generative models on low-level vision tasks.

Lu Liu, Huiyu Duan, Chenxin Zhu, Jintong Lu, Haoyun Jiang, Liu Yang, Qiang Hu, Guangtao Zhai, Xiaoyun Zhang• 2026

Related benchmarks

TaskDatasetResultRank
Flare Removal Quality AssessmentLL-Bench (test)
SRCC0.806
36
Defocus Deblurring Quality AssessmentLL-Bench (test)
SRCC0.8576
18
Dehazing Quality AssessmentLL-Bench (test)
SRCC0.7404
18
Deraining Quality AssessmentLL-Bench (test)
SRCC0.7309
18
Desnowing Quality AssessmentLL-Bench (test)
SRCC0.7908
18
HDR Enhancement Quality AssessmentLL-Bench (test)
SRCC0.9005
18
Low-level vision quality assessment (Overall)LL-Bench (test)
SRCC (Avg)0.66
18
Low-Light Enhancement Quality AssessmentLL-Bench (test)
SRCC0.836
18
Motion Deblurring Quality AssessmentLL-Bench (test)
SRCC0.7893
18
Raindrop Removal Quality AssessmentLL-Bench (test)
SRCC0.7618
18
Showing 10 of 17 rows

Other info

Follow for update