IQA-Spider: Unifying Multi-Granularity Image Quality Assessment with Reasoning, Grounding and Referring
About
We present IQA-Spider, the first image quality assessment (IQA) framework that unifies reasoning, grounding, and referring into a single LMM-based framework for multi-granularity quality understanding. Existing LMM-based IQA methods typically support only partial perception dimensions, such as quality description and question answering~(\textit{i.e.}, reasoning) or pixel-level grounding. This limitation largely stems from the absence of (i) a unified task and data formulation and (ii) effective optimization paradigms for multi-granularity learning. To address these limitations, we formulate a rigorous four-task paradigm covering global and local quality description, pixel-level grounding, and region-level referring. Based on this formulation, we construct a corresponding IQA dataset with a scalable and automatic annotation pipeline, thereby providing a solid foundation for unified multi-granularity learning. To further enable unified perception, we adopt a conflict-free two-stage design that progressively extends text-level multi-granularity understanding to pixel-level grounding: (i) the first stage equips the model with fine-grained text-level reasoning across multiple IQA tasks, and (ii) the second stage introduces a training-free text-to-point grounding paradigm, which bridges textual semantics and pixel-level perception by mapping token logits to spatial coordinates. Based on these efforts, we achieve IQA-Spider with unified multi-granularity explainable image quality assessment. Extensive experiments across multiple benchmarks demonstrate strong performance, validating the effectiveness and versatility of the proposed formulation and framework.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Image Quality Assessment | KADID-10K | SRCC0.815 | 62 | |
| Vision Question Answering | Q-Bench LLVisionQA 1.0 (dev) | Overall Score74.45 | 29 | |
| Global Image Quality Description | IQA-Spider | Global Description Score (GPT-4V)7.12 | 9 | |
| Local Image Quality Description | IQA-Spider | Local Description Score7.1 | 9 | |
| Quality Referring | IQA-Spider short | Accuracy (Short Ref)59.4 | 9 | |
| Quality Referring | IQA-Spider long | Ref_long Accuracy48.4 | 9 | |
| Visual quality grounding | IQA-Spider | GPT-4V Grounding Score2.41 | 9 | |
| Visual quality grounding | Q-Ground (test) | mIoU33.8 | 6 |