Quantization-Aware Distillation for NVFP4 Inference Accuracy Recovery
About
This technical report presents quantization-aware distillation (QAD) and our best practices for recovering accuracy of NVFP4-quantized large language models (LLMs) and vision-language models (VLMs). QAD distills a full-precision teacher model into a quantized student model using a KL divergence loss. While applying distillation to quantized models is not a new idea, we observe key advantages of QAD for today's LLMs: 1. It shows remarkable effectiveness and stability for models trained through multi-stage post-training pipelines, including supervised fine-tuning (SFT), reinforcement learning (RL), and model merging, where traditional quantization-aware training (QAT) suffers from engineering complexity and training instability; 2. It is robust to data quality and coverage, enabling accuracy recovery without full training data. We evaluate QAD across multiple post-trained models including AceReason Nemotron, Nemotron 3 Nano, Nemotron Nano V2, Nemotron Nano V2 VL (VLM), and Llama Nemotron Super v1, showing consistent recovery to near-BF16 accuracy.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Visual Question Answering | TextVQA | Accuracy85.2 | 1117 | |
| OCR Evaluation | OCRBench | Score858 | 296 | |
| Instruction Following | IFEval | -- | 292 | |
| Visual Question Answering | ChartQA | Accuracy89.4 | 239 | |
| Visual Question Answering | AI2D | Accuracy86.7 | 174 | |
| Document Visual Question Answering | DocVQA | Accuracy93.9 | 81 | |
| Mathematical Reasoning | MATH 500 | Accuracy97.2 | 26 | |
| Code Generation | LiveCodeBench v6 | Accuracy53.3 | 23 | |
| Information Visual Question Answering | InfoVQA | Accuracy78.4 | 18 | |
| Mathematics | AIME25 | Accuracy87.9 | 16 |