DiffLM: Controllable Synthetic Data Generation via Diffusion Language Models

About

Recent advancements in large language models (LLMs) have significantly enhanced their knowledge and generative capabilities, leading to a surge of interest in leveraging LLMs for high-quality data synthesis. However, synthetic data generation via prompting LLMs remains challenging due to LLMs' limited understanding of target data distributions and the complexity of prompt engineering, especially for structured formatted data. To address these issues, we introduce DiffLM, a controllable data synthesis framework based on variational autoencoder (VAE), which further (1) leverages diffusion models to reserve more information of original distribution and format structure in the learned latent distribution and (2) decouples the learning of target distribution knowledge from the LLM's generative objectives via a plug-and-play latent feature injection module. As we observed significant discrepancies between the VAE's latent representations and the real data distribution, the latent diffusion module is introduced into our framework to learn a fully expressive latent distribution. Evaluations on seven real-world datasets with structured formatted data (i.e., Tabular, Code, and Tool data) demonstrate that DiffLM generates high-quality data, with performance on downstream tasks surpassing that of real data by 2%-7% in certain cases. Data and code are available at https://github.com/bytedance/DiffLM.

Ying Zhou, Xinyao Wang, Yulei Niu, Yaojie Shen, Lexin Tang, Fan Chen, Ben He, Le Sun, Longyin Wen• 2024

Related benchmarks

Task	Dataset	Result
Code Generation	HumanEval (test)	Pass@142.24	612
Code Generation	MBPP (test)	Pass@144.42	405
Tabular Data Generation	magic	Shape Fidelity7.53	16
Tabular Data Generation	DEFAULT	Shape Fidelity9.06	16
Tabular Data Generation	Adult	Shape Similarity9.74	16
Tabular Data Generation	Shoppers	Shape Score10.07	16
Tabular Data Generation	Beijing	Shape Score6.35	16
Structured JSON Generation	MultiWOZ, Super-NaturalInstructions, TruthfulQA, and Self-Instruct Averaged	Similarity Score0.74	16
Tabular Data Generation	Magic (test)	MLE0.917	12
Tabular Data Generation	Beijing (test)	MLE0.696	12

Showing 10 of 15 rows

Other info

Code

Follow for update

@wizwand_team Discord