Tuning Language Models by Proxy

About

Despite the general capabilities of large pretrained language models, they consistently benefit from further adaptation to better achieve desired behaviors. However, tuning these models has become increasingly resource-intensive, or impossible when model weights are private. We introduce proxy-tuning, a lightweight decoding-time algorithm that operates on top of black-box LMs to achieve the same end as direct tuning, but by accessing only its predictions over the output vocabulary, not its parameters. Our method tunes a smaller LM, then applies the difference between the predictions of the small tuned and untuned LMs to shift the original predictions of the larger untuned model in the direction of tuning, while retaining the benefits of larger-scale pretraining. In experiments, when we apply proxy-tuning to Llama2-70B using proxies of only 7B size, we can close 88% of the gap between Llama2-70B and its truly-tuned chat version, when evaluated across knowledge, reasoning, and safety benchmarks. We then demonstrate the generality of proxy-tuning by applying it to domain adaptation on code, and task-specific finetuning on question-answering and math problems. Finally, we show how to proxy-tune a truly black-box LM, GPT-3.5, for temporal adaptation, increasing its knowledge about recent events. Our work demonstrates the promise of using small tuned LMs to efficiently customize large, potentially proprietary LMs through decoding-time guidance.

Alisa Liu, Xiaochuang Han, Yizhong Wang, Yulia Tsvetkov, Yejin Choi, Noah A. Smith• 2024

Related benchmarks

Task	Dataset	Result
Instruction Following	AlpacaEval	Win Rate72.3	420
Mathematical Reasoning	GSM8K	Accuracy52.5	388
Multilingual Language Understanding	Qwen Multi-task Evaluation Suite 2.5 (test)	MC Score57.7	18
Instruction Following	AlpacaEval 805 instructions (test)	Win Rate10.47	14
Biomedical Question Answering	BioASQ	Factoid Acc29	11
Classification	MTS-Specialty 200 random samples (test)	Macro F1 Score9.5	6
Text Generation	MTS-Procedure 100 samples (test)	MedHelm LLM-Jury Score3.466	6
Text Generation	MIMIC-RRS 100 samples (test)	MedHelm LLM-jury score3.898	6
Text Generation	MIMIC-BHC 100 samples (test)	MedHelm Jury Score3.456	6
Classification	MedNLI 200 random samples (test)	Macro F1 Score57.5	6

Showing 10 of 21 rows

Other info

Follow for update

@wizwand_team Discord