LLM Pruning and Distillation in Practice: The Minitron Approach

About

We present a comprehensive report on compressing the Llama 3.1 8B and Mistral NeMo 12B models to 4B and 8B parameters, respectively, using pruning and distillation. We explore two distinct pruning strategies: (1) depth pruning and (2) joint hidden/attention/MLP (width) pruning, and evaluate the results on common benchmarks from the LM Evaluation Harness. The models are then aligned with NeMo Aligner and tested in instruct-tuned versions. This approach produces a compelling 4B model from Llama 3.1 8B and a state-of-the-art Mistral-NeMo-Minitron-8B (MN-Minitron-8B for brevity) model from Mistral NeMo 12B. We found that with no access to the original data, it is beneficial to slightly fine-tune teacher models on the distillation dataset. We open-source our base model weights on Hugging Face with a permissive license.

Sharath Turuvekere Sreenivas, Saurav Muralidharan, Raviraj Joshi, Marcin Chochowski, Ameya Sunil Mahabaleshwarkar, Gerald Shen, Jiaqi Zeng, Zijia Chen, Yoshi Suhara, Shizhe Diao, Chenhan Yu, Wei-Chun Chen, Hayley Ross, Oluwatobi Olabiyi, Ashwath Aithal, Oleksii Kuchaiev, Daniel Korzekwa, Pavlo Molchanov, Mostofa Patwary, Mohammad Shoeybi, Jan Kautz, Bryan Catanzaro• 2024

Related benchmarks

Task	Dataset	Result
Abstract Reasoning	ARC-AGI 1	Accuracy1.12	18
Abstract Reasoning	ARC-AGI 2	Accuracy0.00e+0	18
Abstract Reasoning	ConceptARC	Accuracy5.83	16
Reasoning	ARC Mini	Accuracy6.04	16
Abstraction and Reasoning	1D-ARC	Accuracy3.55	13
Abstract Reasoning	ARC Community	Accuracy6.52	9
Numeral sequence completion	Numeral sequences	Alignment97.78	4
Question Answering	SimpleQA (200 random samples)	Alignment Score0.9812	1

Showing 8 of 8 rows

Other info

Follow for update

@wizwand_team Discord