Bridging Coarse and Fine Recognition: A Hybrid Approach for Open-Ended Multi-Granularity Object Recognition in Interactive Educational Games

About

Recent advances in Multimodal Large Language Models (MLLMs) have enabled open-ended object recognition, yet they struggle with fine-grained tasks. In contrast, CLIP-style models excel at fine-grained recognition but lack broad coverage of general object categories. To bridge this gap, we propose \textbf{HyMOR}, a \textbf{Hy}brid \textbf{M}ulti-granularity open-ended \textbf{O}bject \textbf{R}ecognition framework that integrates an MLLM with a CLIP model. In HyMOR, the MLLM performs open-ended and coarse-grained object recognition, while the CLIP model specializes in fine-grained identification of domain-specific objects such as animals and plants. This hybrid design enables accurate object understanding across multiple semantic granularities, serving as a robust perceptual foundation for downstream multi-modal content generation and interactive gameplay. To support evaluation in content-rich and educational scenarios, we introduce TBO (TextBook Objects), a dataset containing 20,942 images annotated with 8,816 object categories extracted from textbooks. Extensive experiments demonstrate that HyMOR narrows the fine-grained recognition gap with CLIP to 0.2\% while improving general object recognition by 2.5\% over a baseline MLLM, measured by average Sentence-BERT (SBert) similarity. Overall, HyMOR achieves a 23.2\% improvement in average SBert across all evaluated datasets, highlighting its effectiveness in enabling accurate perception for multi-modal game content generation and interactive learning applications.

Hanling Yi, Feng Lin, Mao Luo, Yifan Yang, Xiaotian Yu, Rong Xiao• 2026

Related benchmarks

Task	Dataset	Result
General Object Recognition	ImageNet 1k (test)	EM26.9	9
General Object Recognition	ObjectNet 313 (test)	Exact Match (EM)24.4	9
General Object Recognition	TBO-8k (test)	EM19.8	9
Fine-grained Visual Recognition	Dog-120 (test)	EM85.7	9
Fine-grained Visual Recognition	Pet-37 (test)	EM84.4	9
Fine-grained Visual Recognition	Bird-200 (test)	EM89	9
Fine-grained Visual Recognition	Flower-102 (test)	EM96	9

Showing 7 of 7 rows

Other info

Follow for update

@wizwand_team Discord