Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

FTibSuite: A Comprehensive Resource Suite for Tibetan Vision-Language Modeling

About

Vision-language models have progressed rapidly, but Tibetan remains a severely underserved low-resource language due to the lack of reproducible training and evaluation infrastructure. To fill this gap, we introduce FTibSuite, a comprehensive resource suite for Tibetan vision-language research, consisting of FTibData (human-verified multimodal training corpora spanning continual pretraining, image-text alignment, and instruction tuning data), FTibBench (Tibetan adaptations of five mainstream multimodal benchmarks with a hierarchical quality-control workflow to reduce translation noise), and FTibVLM, a reproducible baseline built on Qwen3-VL-8B-Instruct via a three-stage adaptation pipeline. Experiments on FTibBench show FTibVLM delivers consistent performance gains across all tasks, such as improving MMBench accuracy from 42.97 to 67.78 and POPE-random accuracy from 47.53 to 80.56, while retaining the backbone's original Chinese capabilities with minimal degradation, providing the first standardized foundation for Tibetan multimodal research.

Guixian Xu, Yide Liang, Zeli Su, Xuexian Song, Ziyin Zhang, Yushuang Dong, Ting Zhang, Xu Han• 2026

Related benchmarks

TaskDatasetResultRank
Object HallucinationPOPE Adversarial
Accuracy78.63
353
Object Hallucination EvaluationPOPE Popular
Accuracy81.7
96
Hallucination RobustnessPOPE (Random)
Accuracy80.56
2
Multimodal capability profilingMME Tibetan
Existence Accuracy (Acc)88.33
2
Multimodal Understanding and ReasoningMMBench Tibetan
Overall Score67.78
2
Visual EntailmentSNLI-VE
Accuracy54.32
2
Visual Question AnsweringBinaryVQA
Accuracy76.01
2
Visual Question AnsweringCOREVQA
Accuracy50.85
2
Showing 8 of 8 rows

Other info

Follow for update