FTibSuite: A Comprehensive Resource Suite for Tibetan Vision-Language Modeling

About

Vision-language models have progressed rapidly, but Tibetan remains a severely underserved low-resource language due to the lack of reproducible training and evaluation infrastructure. To fill this gap, we introduce FTibSuite, a comprehensive resource suite for Tibetan vision-language research, consisting of FTibData (human-verified multimodal training corpora spanning continual pretraining, image-text alignment, and instruction tuning data), FTibBench (Tibetan adaptations of five mainstream multimodal benchmarks with a hierarchical quality-control workflow to reduce translation noise), and FTibVLM, a reproducible baseline built on Qwen3-VL-8B-Instruct via a three-stage adaptation pipeline. Experiments on FTibBench show FTibVLM delivers consistent performance gains across all tasks, such as improving MMBench accuracy from 42.97 to 67.78 and POPE-random accuracy from 47.53 to 80.56, while retaining the backbone's original Chinese capabilities with minimal degradation, providing the first standardized foundation for Tibetan multimodal research.

Guixian Xu, Yide Liang, Zeli Su, Xuexian Song, Ziyin Zhang, Yushuang Dong, Ting Zhang, Xu Han• 2026

Related benchmarks

Task	Dataset	Result
Object Hallucination	POPE Adversarial	Accuracy78.63	367
Object Hallucination Evaluation	POPE Popular	Accuracy81.7	100
Hallucination Robustness	POPE (Random)	Accuracy80.56	2
Multimodal capability profiling	MME Tibetan	Existence Accuracy (Acc)88.33	2
Multimodal Understanding and Reasoning	MMBench Tibetan	Overall Score67.78	2
Visual Entailment	SNLI-VE	Accuracy54.32	2
Visual Question Answering	BinaryVQA	Accuracy76.01	2
Visual Question Answering	COREVQA	Accuracy50.85	2

Showing 8 of 8 rows

Other info

Follow for update

@wizwand_team Discord