Jet-Nemotron: Efficient Language Model with Post Neural Architecture Search
About
We present Jet-Nemotron, a new family of hybrid-architecture language models, which matches or exceeds the accuracy of leading full-attention models while significantly improving generation throughput. Jet-Nemotron is developed using Post Neural Architecture Search (PostNAS), a novel neural architecture exploration pipeline that enables efficient model design. Unlike prior approaches, PostNAS begins with a pre-trained full-attention model and freezes its MLP weights, allowing efficient exploration of attention block designs. The pipeline includes four key components: (1) learning optimal full-attention layer placement and elimination, (2) linear attention block selection, (3) designing new attention blocks, and (4) performing hardware-aware hyperparameter search. Our Jet-Nemotron-2B model achieves comparable or superior accuracy to Qwen3, Qwen2.5, Gemma3, and Llama3.2 across a comprehensive suite of benchmarks while delivering up to 53.6x generation throughput speedup and 6.1x prefilling speedup. It also achieves higher accuracy on MMLU and MMLU-Pro than recent advanced MoE full-attention models, such as DeepSeek-V3-Small and Moonlight, despite their larger scale with 15B total and 2.2B activated parameters.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Commonsense Reasoning | WinoGrande | Accuracy68.51 | 1442 | |
| Commonsense Reasoning | HellaSwag | HellaSwag Accuracy52.38 | 711 | |
| Physical Commonsense Reasoning | PIQA | Accuracy77.97 | 696 | |
| Multi-task Language Understanding | MMLU | MMLU Accuracy25.44 | 442 | |
| Question Answering | OpenBookQA | Accuracy23.8 | 305 | |
| Science Question Answering | ARC-E | Accuracy79.25 | 240 | |
| Reasoning | ARC Easy | -- | 233 | |
| Multiple-choice Question Answering | MMLU | Accuracy65.24 | 210 | |
| Reasoning | ARC Challenge | Accuracy39.76 | 81 | |
| Truthfulness | TruthfulQA | Truthfulness Accuracy47.36 | 51 |