Vision-Language Foundation Models for Comprehensive Automated Pavement Condition Assessment
About
General-purpose vision-language models demonstrate strong performance in everyday domains but struggle with specialized technical fields requiring precise terminology, structured reasoning, and adherence to engineering standards. This work addresses whether domain-specific instruction tuning can enable comprehensive pavement condition assessment through vision-language models. PaveInstruct, a dataset containing 278,889 image-instruction-response pairs spanning 32 task types, was created by unifying annotations from nine heterogeneous pavement datasets. PaveGPT, a pavement foundation model trained on this dataset, was evaluated against state-of-the-art vision-language models across perception, understanding, and reasoning tasks. Instruction tuning transformed model capabilities, achieving improvements exceeding 20% in spatial grounding, reasoning, and generation tasks while producing ASTM D6433-compliant outputs. These results enable transportation agencies to deploy unified conversational assessment tools that replace multiple specialized systems, simplifying workflows and reducing technical expertise requirements. The approach establishes a pathway for developing instruction-driven AI systems across infrastructure domains including bridge inspection, railway maintenance, and building condition assessment.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Spatial Grounding | PaveInstruct | mIoU32.68 | 13 | |
| Captioning | PaveInstruct | BLEU-46.21 | 13 | |
| Reasoning | PaveInstruct | Judge Score6.14 | 13 | |
| PCI Prediction | PaveInstruct | MAE19.32 | 10 | |
| Region Analysis | PaveInstruct | Distress Score22.54 | 7 | |
| VQA | PaveInstruct | Exact Accuracy15.07 | 7 | |
| Classification | PaveInstruct | Severity64.3 | 7 | |
| Inference Efficiency | Inference Efficiency Benchmark | TTFT (ms)236 | 6 |