Nemotron-H: A Family of Accurate and Efficient Hybrid Mamba-Transformer Models
About
As inference-time scaling becomes critical for enhanced reasoning capabilities, it is increasingly becoming important to build models that are efficient to infer. We introduce Nemotron-H, a family of 8B and 56B/47B hybrid Mamba-Transformer models designed to reduce inference cost for a given accuracy level. To achieve this goal, we replace the majority of self-attention layers in the common Transformer model architecture with Mamba layers that perform constant computation and require constant memory per generated token. We show that Nemotron-H models offer either better or on-par accuracy compared to other similarly-sized state-of-the-art open-sourced Transformer models (e.g., Qwen-2.5-7B/72B and Llama-3.1-8B/70B), while being up to 3$\times$ faster at inference. To further increase inference speed and reduce the memory required at inference time, we created Nemotron-H-47B-Base from the 56B model using a new compression via pruning and distillation technique called MiniPuzzle. Nemotron-H-47B-Base achieves similar accuracy to the 56B model, but is 20% faster to infer. In addition, we introduce an FP8-based training recipe and show that it can achieve on par results with BF16-based training. This recipe is used to train the 56B model. We are releasing Nemotron-H base model checkpoints with support in Hugging Face and NeMo.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Commonsense Reasoning | WinoGrande | Accuracy76.7 | 1085 | |
| Question Answering | ARC Challenge | Accuracy56.5 | 906 | |
| Language Understanding | MMLU | Accuracy71.7 | 825 | |
| Question Answering | BoolQ | -- | 317 | |
| Recognizing Textual Entailment | RTE | Accuracy71.8 | 47 | |
| Generative Question Answering | Bolmo Evaluation Suite GenQA 7B | GenQA Average71.2 | 39 | |
| Code Generation | OlmoBaseEval Code BigCodeBench, HumanEval, DeepSeek LeetCode, DS 1000, MBPP, MultiPL | OlmoBaseEval Code Score37.1 | 34 | |
| Mathematical Reasoning | OlmoBaseEval Math (GSM8k, GSM Symbolic, MATH) | Math Aggregate Score54.6 | 34 | |
| Multiple Choice Non-STEM Question Answering | OlmoBaseEval MC Non-STEM (MMLU Humanities/Social Sci, CSQA, PiQA, SocialIQA, CoQA, DROP, Jeopardy, NaturalQs, SQuAD) | Aggregate Score80.7 | 34 | |
| Long-context retrieval | RULER | Retrieval Accuracy (8K)82 | 34 |