Bridging Coarse and Fine Recognition: A Hybrid Approach for Open-Ended Multi-Granularity Object Recognition in Interactive Educational Games
About
Recent advances in Multimodal Large Language Models (MLLMs) have enabled open-ended object recognition, yet they struggle with fine-grained tasks. In contrast, CLIP-style models excel at fine-grained recognition but lack broad coverage of general object categories. To bridge this gap, we propose \textbf{HyMOR}, a \textbf{Hy}brid \textbf{M}ulti-granularity open-ended \textbf{O}bject \textbf{R}ecognition framework that integrates an MLLM with a CLIP model. In HyMOR, the MLLM performs open-ended and coarse-grained object recognition, while the CLIP model specializes in fine-grained identification of domain-specific objects such as animals and plants. This hybrid design enables accurate object understanding across multiple semantic granularities, serving as a robust perceptual foundation for downstream multi-modal content generation and interactive gameplay. To support evaluation in content-rich and educational scenarios, we introduce TBO (TextBook Objects), a dataset containing 20,942 images annotated with 8,816 object categories extracted from textbooks. Extensive experiments demonstrate that HyMOR narrows the fine-grained recognition gap with CLIP to 0.2\% while improving general object recognition by 2.5\% over a baseline MLLM, measured by average Sentence-BERT (SBert) similarity. Overall, HyMOR achieves a 23.2\% improvement in average SBert across all evaluated datasets, highlighting its effectiveness in enabling accurate perception for multi-modal game content generation and interactive learning applications.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| General Object Recognition | ImageNet 1k (test) | EM26.9 | 9 | |
| General Object Recognition | ObjectNet 313 (test) | Exact Match (EM)24.4 | 9 | |
| General Object Recognition | TBO-8k (test) | EM19.8 | 9 | |
| Fine-grained Visual Recognition | Dog-120 (test) | EM85.7 | 9 | |
| Fine-grained Visual Recognition | Pet-37 (test) | EM84.4 | 9 | |
| Fine-grained Visual Recognition | Bird-200 (test) | EM89 | 9 | |
| Fine-grained Visual Recognition | Flower-102 (test) | EM96 | 9 |