Camels in a Changing Climate: Enhancing LM Adaptation with Tulu 2
About
Since the release of T\"ULU [Wang et al., 2023b], open resources for instruction tuning have developed quickly, from better base models to new finetuning techniques. We test and incorporate a number of these advances into T\"ULU, resulting in T\"ULU 2, a suite of improved T\"ULU models for advancing the understanding and best practices of adapting pretrained language models to downstream tasks and user preferences. Concretely, we release: (1) T\"ULU-V2-mix, an improved collection of high-quality instruction datasets; (2) T\"ULU 2, LLAMA-2 models finetuned on the V2 mixture; (3) T\"ULU 2+DPO, T\"ULU 2 models trained with direct preference optimization (DPO), including the largest DPO-trained model to date (T\"ULU 2+DPO 70B); (4) CODE T\"ULU 2, CODE LLAMA models finetuned on our V2 mix that outperform CODE LLAMA and its instruction-tuned variant, CODE LLAMA-Instruct. Our evaluation from multiple perspectives shows that the T\"ULU 2 suite achieves state-of-the-art performance among open models and matches or exceeds the performance of GPT-3.5-turbo-0301 on several benchmarks. We release all the checkpoints, data, training and evaluation code to facilitate future open efforts on adapting large language models.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Code Generation | HumanEval | Pass@16.95e+3 | 850 | |
| Multi-task Language Understanding | MMLU | Accuracy67.8 | 842 | |
| Instruction Following | IFEval | -- | 292 | |
| Instruction Following | AlpacaEval 2.0 | -- | 281 | |
| Mathematical Reasoning | GSM8K | Accuracy52.5 | 212 | |
| Instruction Following | MT-Bench | MT-Bench Score7.89 | 189 | |
| Instruction Following | AlpacaEval | Win Rate85.1 | 125 | |
| Mathematical Reasoning | MATH | Pass@165.2 | 112 | |
| Multitask Language Understanding | MMLU-Pro | Accuracy40.5 | 99 | |
| Instruction Following | Arena Hard | Win Rate15 | 77 |