Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Bridging Coarse and Fine Recognition: A Hybrid Approach for Open-Ended Multi-Granularity Object Recognition in Interactive Educational Games

About

Recent advances in Multimodal Large Language Models (MLLMs) have enabled open-ended object recognition, yet they struggle with fine-grained tasks. In contrast, CLIP-style models excel at fine-grained recognition but lack broad coverage of general object categories. To bridge this gap, we propose \textbf{HyMOR}, a \textbf{Hy}brid \textbf{M}ulti-granularity open-ended \textbf{O}bject \textbf{R}ecognition framework that integrates an MLLM with a CLIP model. In HyMOR, the MLLM performs open-ended and coarse-grained object recognition, while the CLIP model specializes in fine-grained identification of domain-specific objects such as animals and plants. This hybrid design enables accurate object understanding across multiple semantic granularities, serving as a robust perceptual foundation for downstream multi-modal content generation and interactive gameplay. To support evaluation in content-rich and educational scenarios, we introduce TBO (TextBook Objects), a dataset containing 20,942 images annotated with 8,816 object categories extracted from textbooks. Extensive experiments demonstrate that HyMOR narrows the fine-grained recognition gap with CLIP to 0.2\% while improving general object recognition by 2.5\% over a baseline MLLM, measured by average Sentence-BERT (SBert) similarity. Overall, HyMOR achieves a 23.2\% improvement in average SBert across all evaluated datasets, highlighting its effectiveness in enabling accurate perception for multi-modal game content generation and interactive learning applications.

Hanling Yi, Feng Lin, Mao Luo, Yifan Yang, Xiaotian Yu, Rong Xiao• 2026

Related benchmarks

TaskDatasetResultRank
General Object RecognitionImageNet 1k (test)
EM26.9
9
General Object RecognitionObjectNet 313 (test)
Exact Match (EM)24.4
9
General Object RecognitionTBO-8k (test)
EM19.8
9
Fine-grained Visual RecognitionDog-120 (test)
EM85.7
9
Fine-grained Visual RecognitionPet-37 (test)
EM84.4
9
Fine-grained Visual RecognitionBird-200 (test)
EM89
9
Fine-grained Visual RecognitionFlower-102 (test)
EM96
9
Showing 7 of 7 rows

Other info

Follow for update