FTibSuite: A Comprehensive Resource Suite for Tibetan Vision-Language Modeling
About
Vision-language models have progressed rapidly, but Tibetan remains a severely underserved low-resource language due to the lack of reproducible training and evaluation infrastructure. To fill this gap, we introduce FTibSuite, a comprehensive resource suite for Tibetan vision-language research, consisting of FTibData (human-verified multimodal training corpora spanning continual pretraining, image-text alignment, and instruction tuning data), FTibBench (Tibetan adaptations of five mainstream multimodal benchmarks with a hierarchical quality-control workflow to reduce translation noise), and FTibVLM, a reproducible baseline built on Qwen3-VL-8B-Instruct via a three-stage adaptation pipeline. Experiments on FTibBench show FTibVLM delivers consistent performance gains across all tasks, such as improving MMBench accuracy from 42.97 to 67.78 and POPE-random accuracy from 47.53 to 80.56, while retaining the backbone's original Chinese capabilities with minimal degradation, providing the first standardized foundation for Tibetan multimodal research.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Object Hallucination | POPE Adversarial | Accuracy78.63 | 353 | |
| Object Hallucination Evaluation | POPE Popular | Accuracy81.7 | 96 | |
| Hallucination Robustness | POPE (Random) | Accuracy80.56 | 2 | |
| Multimodal capability profiling | MME Tibetan | Existence Accuracy (Acc)88.33 | 2 | |
| Multimodal Understanding and Reasoning | MMBench Tibetan | Overall Score67.78 | 2 | |
| Visual Entailment | SNLI-VE | Accuracy54.32 | 2 | |
| Visual Question Answering | BinaryVQA | Accuracy76.01 | 2 | |
| Visual Question Answering | COREVQA | Accuracy50.85 | 2 |