Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Autoregressive Synthesis of Sparse and Semi-Structured Mixed-Type Data

About

Synthetic data generation is a critical capability for data sharing, privacy compliance, system benchmarking and test data provisioning. Existing methods assume dense, fixed-schema tabular data, yet this assumption is increasingly at odds with modern data systems - from document databases, REST APIs to data lakes - which store and exchange data in sparse, semi-structured formats like JSON. Applying existing tabular methods to such data requires flattening of nested data into wide, sparse tables which scales poorly. We present Origami, an autoregressive transformer-based architecture that tokenizes data records, including nested objects and variable length arrays, into sequences of key, value and structural tokens. This representation natively handles sparsity, mixed types and hierarchical structure without flattening or imputation. Origami outperforms baselines spanning GAN, VAE, diffusion and autoregressive architectures on fidelity, utility and detection metrics across nearly all settings, while maintaining high privacy scores. On semi-structured datasets with up to 38% sparsity, baseline synthesizers either fail to scale or degrade substantially, while Origami maintains high-fidelity synthesis that is harder to distinguish from real data. To the best of our knowledge, Origami is the first architecture capable of natively modeling and generating semi-structured data end-to-end.

Thomas R\"uckstie{\ss}, Robin Vujanic• 2026

Related benchmarks

TaskDatasetResultRank
Tabular Data SynthesisAdult
Shape Similarity0.996
17
Tabular Data SynthesisDiabetes
Shapes0.996
15
Privacy EvaluationAdult--
10
Privacy EvaluationDiabetes--
9
Synthetic Data DetectionAdult
Overall Score0.96
7
Synthetic Data UtilityAdult
Overall Score99.8
7
Synthetic Data DetectionDiabetes
Overall Score100
6
Synthetic Data UtilityDiabetes
Overall Score98.4
6
Privacy EvaluationElectric Vehicles
Overall Score1
4
Synthetic Data DetectionYelp
Overall Score72.5
4
Showing 10 of 20 rows

Other info

Follow for update