Are Large Vision-Language Models Ready to Guide Blind and Low-Vision Individuals?
About
Large Vision-Language Models (LVLMs) demonstrate a promising direction for assisting individuals with blindness or low-vision (BLV). Yet, measuring their true utility in real-world scenarios is challenging because evaluating whether their descriptions are BLV-informative requires a fundamentally different approach from assessing standard scene descriptions. While the "VLM-as-a-metric" or "LVLM-as-a-judge" paradigm has emerged, existing evaluators still fall short of capturing the unique requirements of BLV-centric evaluation, lacking at least one of the following key properties: (1) High correlation with human judgments, (2) Long instruction understanding, (3) Score generation efficiency, and (4) Multi-dimensional assessment. To this end, we propose a unified framework to bridge the gap between automated evaluation and actual BLV needs. First, we conduct an in-depth user study with BLV participants to understand and quantify their navigational preferences, curating VL-GUIDEDATA, a large-scale BLV user-simulated preference dataset containing image-request-response-score pairs. We then leverage the dataset to develop an accessibility-aware evaluator, VL-GUIDE-S, which outperforms existing (L)VLM judges in both human alignment and inference efficiency. Notably, its effectiveness extends beyond a single domain, demonstrating strong performance across multiple fine-grained, BLV-critical dimensions. We hope our work lays as a foundation for automatic AI judges that advance safe, barrier-free navigation for BLV users.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Multimodal Preference Evaluation | FOILR1 | Preference Accuracy95 | 10 | |
| Multimodal Preference Evaluation | FOIL R4 | P-Acc95 | 10 | |
| Multimodal Preference Evaluation | Polaris | tau_c53.9 | 10 | |
| Multimodal Preference Evaluation | Pascal | P-Acc82.3 | 10 | |
| Multimodal Preference Evaluation | FlickrExp | tau_c51.7 | 10 | |
| Multimodal Preference Evaluation | FlickrCF | Tau-b Score35.8 | 10 | |
| Multimodal Preference Evaluation | VL-GUIDEDATA-B | Kendall's Tau10.28 | 8 | |
| Multimodal Preference Evaluation | OID | P-Acc59.3 | 7 | |
| Multimodal Preference Evaluation | ImgREW | P-Acc57.8 | 7 |