Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Seeing Clearly without Training: Mitigating Hallucinations in Multimodal LLMs for Remote Sensing

About

Multimodal large language models (MLLMs) suffer from pronounced hallucinations in remote sensing visual question-answering (RS-VQA), primarily caused by visual grounding failures in large-scale scenes or misinterpretation of fine-grained small targets. To systematically analyze these issues, we introduce RSHBench, a protocol-based benchmark for fine-grained diagnosis of factual and logical hallucinations. To mitigate grounding-induced factual hallucinations, we further propose Relative Attention-Driven Actively Reasoning (RADAR), a training-free inference method that leverages intrinsic attention in MLLMs to guide progressive localization and fine-grained local reasoning at test time. Extensive experiments across diverse MLLMs demonstrate that RADAR consistently improves RS-VQA performance and reduces both factual and logical hallucinations. Code and data will be publicly available at: https://github.com/MiliLab/RADAR

Yi Liu, Jing Zhang, Di Wang, Xiaoyu Tian, Haonan Guo, Bo Du• 2026

Related benchmarks

TaskDatasetResultRank
Remote Sensing Visual Question AnsweringLRS-VQA
FAIR31.21
11
Remote Sensing Visual Question AnsweringMME-RealWorld-RS
Position Score58.15
11
Remote Sensing Visual Question AnsweringLHRS-Bench
Accuracy67.47
11
Hallucination DiagnosisRSHBench
Object Accuracy (OBJ)28.03
9
Showing 4 of 4 rows

Other info

Follow for update