Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Towards Open-Vocabulary Industrial Defect Understanding with a Large-Scale Multimodal Dataset

About

We present IMDD-1M, the first large-scale Industrial Multimodal Defect Dataset comprising 1,000,000 aligned image-text pairs, designed to advance multimodal learning for manufacturing and quality inspection. IMDD-1M contains high-resolution real-world defects spanning over 60 material categories and more than 400 defect types, each accompanied by expert-verified annotations and fine-grained textual descriptions detailing defect location, severity, and contextual attributes. This dataset enables a wide spectrum of applications, including classification, segmentation, retrieval, captioning, and generative modeling. Building upon IMDD-1M, we train a diffusion-based vision-language foundation model from scratch, specifically tailored for industrial scenarios. The model serves as a generalizable foundation that can be efficiently adapted to specialized domains through lightweight fine-tuning. With less than 5% of the task-specific data required by dedicated expert models, it achieves comparable performance, highlighting the potential of data-efficient foundation model adaptation for industrial inspection and generation, paving the way for scalable, domain-adaptive, and knowledge-grounded manufacturing intelligence. Additional details and resources can be found in this URL: https://ninaneon.github.io/projectpage/

TsaiChing Ni, ZhenQi Chen, YuanFu Yang• 2025

Related benchmarks

TaskDatasetResultRank
Anomaly DetectionMVTec-AD (test)
P-AUROC96.1
132
Anomaly DetectionVisA (test)--
91
Defect ClassificationMVTec AD
Accuracy98.3
19
Defect ClassificationVisA
Accuracy97.7
19
Defect ClassificationMagnetic Tile
Accuracy96.2
1
Defect ClassificationSteel Surface
Accuracy94.5
1
Industrial Anomaly DetectionMVTec-AD (test)
Acc91
1
Pixel-level SegmentationMVTec AD bottle
Accuracy92.25
1
Pixel-level SegmentationMVTec AD cable
Accuracy89.7
1
Pixel-level SegmentationVisA candle
Accuracy90.3
1
Showing 10 of 11 rows

Other info

GitHub

Follow for update