VILA: Learning Image Aesthetics from User Comments with Vision-Language Pretraining
About
Assessing the aesthetics of an image is challenging, as it is influenced by multiple factors including composition, color, style, and high-level semantics. Existing image aesthetic assessment (IAA) methods primarily rely on human-labeled rating scores, which oversimplify the visual aesthetic information that humans perceive. Conversely, user comments offer more comprehensive information and are a more natural way to express human opinions and preferences regarding image aesthetics. In light of this, we propose learning image aesthetics from user comments, and exploring vision-language pretraining methods to learn multimodal aesthetic representations. Specifically, we pretrain an image-text encoder-decoder model with image-comment pairs, using contrastive and generative objectives to learn rich and generic aesthetic semantics without human labels. To efficiently adapt the pretrained model for downstream IAA tasks, we further propose a lightweight rank-based adapter that employs text as an anchor to learn the aesthetic ranking concept. Our results show that our pretrained aesthetic vision-language model outperforms prior works on image aesthetic captioning over the AVA-Captions dataset, and it has powerful zero-shot capability for aesthetic tasks such as zero-shot style classification and zero-shot IAA, surpassing many supervised baselines. With only minimal finetuning parameters using the proposed adapter module, our model achieves state-of-the-art IAA performance over the AVA dataset.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Image Quality Assessment | KonIQ-10k (test) | SRCC0.919 | 91 | |
| Image Aesthetic Assessment | AVA | SRCC0.774 | 68 | |
| Aesthetic Assessment | AVA (test) | SRCC0.774 | 53 | |
| Visual Rating (Image Aesthetic Assessment) | TAD66K | SRCC0.413 | 40 | |
| Fine-Grained Aesthetic Assessment (Series-level) | FGAesthetics AIGC | Series-level Accuracy50.4 | 15 | |
| Fine-Grained Aesthetic Assessment (Pair-level) | FGAesthetics AIGC | Accuracy65 | 15 | |
| Fine-Grained Aesthetic Assessment (Series-level) | FGAesthetics Cropping | Series Accuracy42.3 | 15 | |
| Fine-Grained Aesthetic Assessment (Pair-level) | FGAesthetics Natural | Accuracy69.3 | 15 | |
| Fine-Grained Aesthetic Assessment (Series-level) | FGAesthetics Natural | s-Acc64.3 | 15 | |
| Fine-Grained Aesthetic Assessment (Pair-level) | FGAesthetics Cropping | Accuracy72.6 | 15 |