Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

SPROUT: A Scalable Diffusion Foundation Model for Agricultural Vision

About

Vision Foundation Models (VFM) pre-trained on large-scale unlabeled data have achieved remarkable success on general computer vision tasks, yet typically suffer from significant domain gaps when applied to agriculture. In this context, we introduce $SPROUT$ ($S$calable $P$lant $R$epresentation model via $O$pen-field $U$nsupervised $T$raining), a multi-crop, multi-task agricultural foundation model trained via diffusion denoising. SPROUT leverages a VAE-free Pixel-space Diffusion Transformer to learn rich, structure-aware representations through denoising and enabling efficient end-to-end training. We pre-train SPROUT on a curated dataset of 2.6 million high-quality agricultural images spanning diverse crops, growth stages, and environments. Extensive experiments demonstrate that SPROUT consistently outperforms state-of-the-art web-pretrained and agricultural foundation models across a wide range of downstream tasks, while requiring substantially lower pre-training cost. The code and model are available at https://github.com/UTokyo-FieldPhenomics-Lab/SPROUT.

Shuai Xiang, Wei Guo, James Burridge, Shouyang Liu, Hao Lu, Tokihiro Fukatsu• 2026

Related benchmarks

TaskDatasetResultRank
Image ClassificationCassava Leaf Disease
Accuracy91.61
24
Organ-level semantic segmentationApple Flower
mIoU65.92
18
Organ-level semantic segmentationApple Fruit
IoU74.28
18
Organ-level semantic segmentationPeach Flower
mIoU62.09
18
Organ-level semantic segmentationPear Flower
mIoU73.03
18
Organ-level semantic segmentationGrape Fruit
IoU90.63
18
Organ-level semantic segmentationWheat
Spike IoU85.54
18
Organ-level semantic segmentationRice
Green Veg IoU87.11
18
ClassificationBanana
Average Accuracy90.77
16
Object CountingWheat Spikes (val)
MAE3.62
11
Showing 10 of 28 rows

Other info

Follow for update