OmniFashion: Towards Generalist Fashion Intelligence via Multi-Task Vision-Language Learning

About

Fashion intelligence spans multiple tasks, i.e., retrieval, recommendation, recognition, and dialogue, yet remains hindered by fragmented supervision and incomplete fashion annotations. These limitations jointly restrict the formation of consistent visual-semantic structures, preventing recent vision-language models (VLMs) from serving as a generalist fashion brain that unifies understanding and reasoning across tasks. Therefore, we construct FashionX, a million-scale dataset that exhaustively annotates visible fashion items within an outfit and organizes attributes from global to part-level. Built upon this foundation, we propose OmniFashion, a unified vision-language framework that bridges diverse fashion tasks under a unified fashion dialogue paradigm, enabling both multi-task reasoning and interactive dialogue. Experiments on multi-subtasks and retrieval benchmarks show that OmniFashion achieves strong task-level accuracy and cross-task generalization, highlighting its offering of a scalable path toward universal, dialogue-oriented fashion intelligence.

Zhengwei Yang, Andi Long, Hao Li, Zechao Hu, Kui Jiang, Zheng Wang• 2026

Related benchmarks

Task	Dataset	Result
Overall Style Recognition	FashionX (test)	Top-1 Accuracy89.7	20
Color Recognition	FashionX (test)	A@179.1	10
Composed Image Retrieval	FashionX (test)	Accuracy75.4	10
Image-to-Image Retrieval	FashionX (test)	Accuracy89.4	10
Part Style Recognition	FashionX (test)	Top-1 Accuracy66.2	10
Text-to-Image Retrieval	FashionX (test)	Accuracy98.7	10
Image-to-Text Retrieval	FashionX (test)	Accuracy99.5	10
Cross-Domain Image Retrieval	Deepfashion Consumer2Shop	R@143.9	6
Image Retrieval	Deepfashion InShop (test)	R@195.2	6

Showing 9 of 9 rows

Other info

Follow for update

@wizwand_team Discord