Evo: Autoregressive-Diffusion Large Language Models with Evolving Balance
About
We introduce \textbf{Evo}, a duality latent trajectory model that bridges autoregressive (AR) and diffusion-based language generation within a continuous evolutionary generative framework. Rather than treating AR decoding and diffusion generation as separate paradigms, Evo reconceptualizes text generation as a latent flow: each token is associated with a vector-valued embedding that evolves over a progression variable $t_i \in [0, 1]$, indicating its semantic maturity. Low $t_i$ values correspond to confident AR-like refinement, while high values invoke diffusion-style planning, allowing the model to adaptively balance AR and diffusion based on uncertainty. Theoretically, we show that both AR and diffusion models emerge as discretizations of a shared probability flow, and we derive Evo's training objective from a unified variational ELBO. The model is implemented as a time-conditioned Transformer governed by a shared vector field, trained end-to-end to jointly infer latent codes and their progression times. During decoding, Evo performs efficient, semantics-aware refinement, achieving high-quality outputs without sacrificing speed. Empirically, Evo 8B achieves state-of-the-art or highly competitive results on 15 diverse benchmarks, including reasoning (GSM8K, ARC-C), code generation (HumanEval, MBPP), and general language understanding, while maintaining fast inference speed. Our results demonstrate that Evo delivers a new paradigm for LLM design with strong generation quality, robust symbolic reasoning, and decoding efficiency.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Commonsense Reasoning | HellaSwag | Accuracy82.1 | 1891 | |
| Commonsense Reasoning | WinoGrande | Accuracy76.3 | 1085 | |
| Code Generation | HumanEval | -- | 1036 | |
| Question Answering | ARC Challenge | Accuracy65.6 | 906 | |
| Language Understanding | MMLU | Accuracy78.6 | 825 | |
| Reasoning | BBH | Accuracy68.4 | 672 | |
| Physical Commonsense Reasoning | PIQA | Accuracy81.2 | 572 | |
| Common Sense Reasoning | HellaSwag | Accuracy86.4 | 213 | |
| Scientific Reasoning | GPQA | Accuracy39.1 | 75 | |
| Question Answering | MMLU | Accuracy76.8 | 46 |