Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

DataArc-SynData-Toolkit: A Unified Closed-Loop Framework for Multi-Path, Multimodal, and Multilingual Data Synthesis

About

Synthetic data has emerged as a crucial solution to the data scarcity bottleneck in large language models (LLMs), particularly for specialized domains and low-resource languages. However, the broader adoption of existing synthetic data tools is severely hindered by convoluted workflows, fragmented data standards, and limited scalability across modalities. To address these limitations, we develop DataArc-SynData-Toolkit, an open-source framework featuring: (1) a configuration-driven, end-to-end pipeline equipped with an intuitive visual interface and simplified CLI for exceptional usability; (2) a unified, quality-controllable synthesis paradigm that standardizes multi-source data generation to ensure high reusability; and (3) a highly modular architecture designed for seamless multimodal, multilingual, and multi-task adaptation. We apply the toolkit in multiple application scenarios. Experimental results demonstrate that our toolkit achieves an optimal balance between generation efficiency and data quality. By offering an end-to-end and visually interactive pipeline, DataArc-SynData-Toolkit significantly lowers the technical barrier to synthetic data generation and subsequent model training, accelerating its practical deployment in real-world applications.

Zhichao Shi, Cehao Yang, Hao Zhou, Xiaojun Wu, Huajie Li, Xuhui Jiang, Chengjin Xu, Yuanzhuo Wang, Jian Guo• 2026

Related benchmarks

TaskDatasetResultRank
Medical Question AnsweringMedQA
Accuracy70.62
124
Financial Multimodal ReasoningFinMME
Accuracy36.96
18
Legal EvaluationLexEval
Accuracy48
14
Financial Language AnalysisFlare CFA
Accuracy75.8
8
Multilingual EvaluationABB
Accuracy7.12
8
Multimodal Medical Question AnsweringMedQA Multimodal
Accuracy66.54
8
Showing 6 of 6 rows

Other info

Follow for update