Fixed-Point Masked Generative Modeling
About
Masked Generative Models (MGMs) enable parallel decoding and achieve strong performance across modalities, but require full-sequence bidirectional transformers at every step, making training costly and degrading quality under low sampling budgets. Existing work improves efficiency via better samplers or cheaper fixed-depth denoisers, but they still allocate a fixed amount of denoiser computation to each refinement step. We introduce Fixed-Point Masked Generative Models (FP-MGMs), which replace part of the denoiser with a fixed-point solver over shared attention layers to enable adaptive depth with fewer parameters. To make it more effective for masked generation, we first introduce a cross-step consistency loss, which aligns hidden representations at neighboring denoising steps and, second, three-state reuse (3SR) which warm-starts the solver using the previous solution by treating differently unchanged, still-masked, and newly revealed tokens respectively. Together, these components define our complete training-to-inference framework for fixed-point masked generation, \emph{CoFRe}. We also show that pre-trained MGMs can be converted into FP-MGMs with short fine-tuning, avoiding full retraining. Across modalities, CoFRe improves the quality and cost trade-off. On OpenWebText, CoFRe reduces parameters by 38.8\%, training time by 11.5\%, and VRAM by 16.9\%, while improving generative perplexity from 830.8 to 101.8 at a budget of $96$ transformer-block forward passes, compared to MDLM. In ImageNette, CoFRe reduces training time by 48.6\% and VRAM by 50.7\%, while improving FID in all sample budgets tested. Overall, CoFRe offers a practical framework for cheaper training and stronger low-budget masked generation.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Multiple-choice Question Answering | ARC Easy | Accuracy36.7 | 257 | |
| Social Interaction Question Answering | SIQA | Accuracy35.82 | 157 | |
| Multiple-choice Question Answering | OBQA | Accuracy25.2 | 79 | |
| Language Modeling | OWT | Gen. PPL37.846 | 78 | |
| Multiple-choice Question Answering | RACE | Accuracy27.27 | 64 | |
| Multiple-choice Question Answering | PIQA | Accuracy55.33 | 63 | |
| Text Generation | OpenWebText | Gen PPL37.846 | 54 | |
| Boolean Question Answering | BoolQ | Accuracy60.21 | 27 | |
| Image Generation | Imagenette | FID22.8381 | 18 | |
| Multiple-choice Question Answering | ARC-C | Accuracy22.87 | 18 |