Xuanwu: Evolving General Multimodal Models into an Industrial-Grade Foundation for Content Ecosystems

About

In recent years, multimodal large models have continued to improve on general benchmarks. However, in real-world content moderation and adversarial settings, mainstream models still suffer from degraded generalization and catastrophic forgetting because of limited fine-grained visual perception and insufficient modeling of long-tail noise. In this paper, we present Xuanwu VL-2B as a case study of how general multimodal models can be developed into an industrial-grade foundation model for content ecosystems. The model adopts a compact InternViT-300M + MLP + Qwen3 1.7B architecture, balancing fine-grained visual perception, language-semantic alignment, and deployment cost within an approximately 2B-parameter budget. To balance business specialization with the retention of general capabilities, we developed a data iteration and curation mechanism and trained the model through a progressive three-stage pipeline: pre-training, mid-training, and post-training. Ablation studies and offline business evaluations show that Xuanwu VL-2B achieves an average score of 67.90 across seven OpenCompass multimodal metrics (vs. 64.27 for InternVL 3.5 2B), an average recall of 94.38% over seven independent business moderation tasks, and a weighted overall recall of 82.82% on policy-violating text in challenging adversarial OCR scenarios, outperforming Gemini-2.5-Pro (76.72%). These results show that, under a limited parameter budget, Xuanwu VL-2B achieves a practical balance among business alignment, visual perception, general capability retention, and deployment cost.

Zhiqian Zhang, Xu Zhao, Xiaoqing Xu, Guangdong Liang, Weijia Wang, Xiaolei Lv, Bo Li, Jun Gao• 2026

Related benchmarks

Task	Dataset	Result
Code Generation	HumanEval	--	1043
Instruction Following	IFEval	--	836
Reasoning	BBH	--	726
Optical Character Recognition	OCRBench	--	433
Multimodal Understanding	MMStar	Accuracy60.47	407
Multi-task Language Understanding	MMLU	Accuracy62.94	353
Diagram Understanding	AI2D	Accuracy82.19	317
Mathematical Multimodal Reasoning	MathVista	Accuracy68.4	258
Multi-discipline Multimodal Understanding	MMMU (val)	Accuracy48.11	212
Coding	MBPP	Pass@1 Accuracy50.6	78

Showing 10 of 17 rows

Other info

Follow for update

@wizwand_team Discord