Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Self-Improving Tabular Language Models via Iterative Reward-Guided Post-Training

About

Tabular language models can generate synthetic tables by modeling rows as token sequences, but they are typically trained once with supervised fine-tuning and then used as static synthesizers. This is limiting because next-token likelihood does not directly optimize the distributional, utility, and indistinguishability properties used to evaluate synthetic data. We study iterative reward-guided post-training for tabular language models through a generate--score--align protocol, where a generator samples synthetic rows, a task-specified reward ranks them, and the model is updated relative to a fixed supervised reference. Within this protocol, we propose \textbf{TabGRAA} (\textbf{Tab}ular \textbf{G}roup-\textbf{R}elative \textbf{A}dvantage \textbf{A}lignment), a group-relative alignment method that compares high- and low-reward generated groups using group-averaged policy/reference log-ratios rather than one-to-one preference pairs. Across five mixed-type benchmarks, TabGRAA improves a GReaT backbone beyond additional supervised fine-tuning and achieves the strongest average trade-off among adapted DPO, KTO, and NPO baselines on fidelity and downstream utility, while maintaining empirical privacy diagnostics near the supervised baseline. Ablations show that the gains depend on meaningful reward ranking and stable group-level updates rather than extra training alone. Reward-substitution and scorer-separation studies further show that the post-training loop can use both classifier-based and classifier-free rewards, and that proper scorer separation is important for preserving the fidelity--utility--privacy trade-off. These results position TabGRAA as a self-improving post-training method for tabular language-model generators, complementary to strong static tabular synthesizers.

Yunbo Long, Tejumade Afonja, Guangya Hao, Alexandra Brintrup, Mario Fritz• 2026

Related benchmarks

TaskDatasetResultRank
Tabular Synthetic Data GenerationDEFAULT
C2ST13.46
43
Tabular Data SynthesisBeijing
C2ST0.9674
26
Tabular Data Synthesismagic
C2ST0.9823
26
Tabular Data GenerationDEFAULT
C2ST0.9731
21
Tabular Data AlignmentBeijing dataset
CDE98.98
14
Tabular Data GenerationMagic original (test)
CDE95.58
14
Tabular Data GenerationBeijing original (test)
CDE98.98
14
Tabular Data SynthesisAverage of 5 Datasets (Adult, Shoppers, Beijing, and two others)
CDE95.47
14
Tabular Data GenerationShoppers
C2ST97.83
14
Tabular Data GenerationAdult
C2ST0.9627
14
Showing 10 of 36 rows

Other info

Follow for update