Are Large Vision-Language Models Ready to Guide Blind and Low-Vision Individuals?

About

Large Vision-Language Models (LVLMs) demonstrate a promising direction for assisting individuals with blindness or low-vision (BLV). Yet, measuring their true utility in real-world scenarios is challenging because evaluating whether their descriptions are BLV-informative requires a fundamentally different approach from assessing standard scene descriptions. While the "VLM-as-a-metric" or "LVLM-as-a-judge" paradigm has emerged, existing evaluators still fall short of capturing the unique requirements of BLV-centric evaluation, lacking at least one of the following key properties: (1) High correlation with human judgments, (2) Long instruction understanding, (3) Score generation efficiency, and (4) Multi-dimensional assessment. To this end, we propose a unified framework to bridge the gap between automated evaluation and actual BLV needs. First, we conduct an in-depth user study with BLV participants to understand and quantify their navigational preferences, curating VL-GUIDEDATA, a large-scale BLV user-simulated preference dataset containing image-request-response-score pairs. We then leverage the dataset to develop an accessibility-aware evaluator, VL-GUIDE-S, which outperforms existing (L)VLM judges in both human alignment and inference efficiency. Notably, its effectiveness extends beyond a single domain, demonstrating strong performance across multiple fine-grained, BLV-critical dimensions. We hope our work lays as a foundation for automatic AI judges that advance safe, barrier-free navigation for BLV users.

Eunki Kim, Na Min An, Wan Ju Kang, Sangryul Kim, James Thorne, Hyunjung Shim• 2025

Related benchmarks

Task	Dataset	Result
Multimodal Preference Evaluation	FOILR1	Preference Accuracy95	10
Multimodal Preference Evaluation	FOIL R4	P-Acc95	10
Multimodal Preference Evaluation	Polaris	tau_c53.9	10
Multimodal Preference Evaluation	Pascal	P-Acc82.3	10
Multimodal Preference Evaluation	FlickrExp	tau_c51.7	10
Multimodal Preference Evaluation	FlickrCF	Tau-b Score35.8	10
Multimodal Preference Evaluation	VL-GUIDEDATA-B	Kendall's Tau10.28	8
Multimodal Preference Evaluation	OID	P-Acc59.3	7
Multimodal Preference Evaluation	ImgREW	P-Acc57.8	7

Showing 9 of 9 rows

Other info

Follow for update

@wizwand_team Discord