NEFTune: Noisy Embeddings Improve Instruction Finetuning
About
We show that language model finetuning can be improved, sometimes dramatically, with a simple augmentation. NEFTune adds noise to the embedding vectors during training. Standard finetuning of LLaMA-2-7B using Alpaca achieves 29.79% on AlpacaEval, which rises to 64.69% using noisy embeddings. NEFTune also improves over strong baselines on modern instruction datasets. Models trained with Evol-Instruct see a 10% improvement, with ShareGPT an 8% improvement, and with OpenPlatypus an 8% improvement. Even powerful models further refined with RLHF such as LLaMA-2-Chat benefit from additional training with NEFTune.
Neel Jain, Ping-yeh Chiang, Yuxin Wen, John Kirchenbauer, Hong-Min Chu, Gowthami Somepalli, Brian R. Bartoldson, Bhavya Kailkhura, Avi Schwarzschild, Aniruddha Saha, Micah Goldblum, Jonas Geiping, Tom Goldstein• 2023
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Code Generation | HumanEval | -- | 850 | |
| Language Understanding | MMLU | Accuracy49.8 | 756 | |
| Multi-turn Dialogue Evaluation | MT-Bench | Overall Score5.05 | 331 | |
| Physical Commonsense Reasoning | PIQA | Accuracy67.6 | 329 | |
| Instruction Following | IFEval | Accuracy (0-100)42.7 | 292 | |
| Science Question Answering | ARC-C | Accuracy55.9 | 127 | |
| Code Generation | MBPP | Accuracy29.2 | 120 | |
| Open-ended generation | AlpacaEval 2.0 | Win Rate287 | 43 | |
| General Natural Language Processing | 18 Canonical NLP Tasks | Understanding & Knowledge65.9 | 23 | |
| Open-ended generation | AlpacaEval 1.0 | Win Rate3.98e+3 | 23 |
Showing 10 of 14 rows