Dual-objective Language Models: Training Efficiency Without Overfitting
About
This paper combines autoregressive and masked-diffusion training objectives without any architectural modifications, resulting in flexible language models that outperform single-objective models. Autoregressive modeling has been a popular approach, partly because of its training efficiency; however, that comes at the cost of sensitivity to overfitting. On the other hand, masked-diffusion models are less efficient to train while being more resilient to overfitting. In this work, we demonstrate that dual-objective training achieves the best of both worlds. To derive the optimal balance between both objectives, we train and evaluate 50 language models under varying levels of data repetition. We show that it is optimal to combine both objectives under all evaluated settings and that the optimal balance is similar whether targeting autoregressive or masked-diffusion downstream performance.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Commonsense Reasoning | HellaSwag | Accuracy31.1 | 1460 | |
| Question Answering | ARC Easy | Normalized Acc28.6 | 385 | |
| Question Answering | OpenBookQA | Normalized Accuracy17.6 | 35 | |
| Question Answering | ARC Challenge | Normalized Accuracy5.7 | 17 | |
| Multitask Knowledge | MMLU | Accuracy4.9 | 15 | |
| Linguistic Probing | BLiMP | Performance63.7 | 10 | |
| Physical Reasoning | PIQA | PIQA Normalized Performance40.9 | 6 | |
| Social Reasoning | SIQA | Performance (%)14.6 | 6 | |
| Aggregate Zero-shot NLU Performance | 9-Task Suite Aggregate | Avg Normalized PLL Score25.3 | 4 | |
| Commonsense Reasoning | HSWAG | Normalized PLL Score27.8 | 4 |