SteerLM: Attribute Conditioned SFT as an (User-Steerable) Alternative to RLHF

About

Model alignment with human preferences is an essential step in making Large Language Models (LLMs) helpful and consistent with human values. It typically consists of supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF) stages. However, RLHF faces inherent limitations stemming from a complex training setup and its tendency to align the model with implicit values that end users cannot control at run-time. Moreover, reward models in RLHF stage commonly rely on single-dimensional feedback as opposed to explicit, multifaceted signals that indicate attributes such as helpfulness, humor, and toxicity. To address these limitations, we propose SteerLM, a supervised fine-tuning method that empowers end-users to control responses during inference. SteerLM conditions responses to conform to an explicitly defined multi-dimensional set of attributes, thereby empowering a steerable AI capable of generating helpful and high-quality responses while maintaining customizability. Experiments show that SteerLM trained on open source datasets generates responses that are preferred by human and automatic evaluators to many state-of-the-art baselines trained with RLHF while being much easier to train. Try SteerLM at https://huggingface.co/nvidia/SteerLM-llama2-13B

Yi Dong, Zhilin Wang, Makesh Narsimhan Sreedhar, Xianchao Wu, Oleksii Kuchaiev• 2023

Related benchmarks

Task	Dataset	Result
Instruction Following	AlpacaEval	Win Rate68.8	420
Instruction Following	MT-Bench	MT-Bench Score5.7	287
Reward Modeling	RewardBench	Chat Score94.4	216
Reward Modeling	RM-Bench	--	137
Multimodal Reasoning	MMBench	--	127
Visual Reasoning and Instruction Following	MM-Vet	Overall Score35.2	23
Visual Instruction Following	LLaVA-Bench	Overall Score67	15
Social dilemma resolution	DAILY-DILEMMAS	Maslow Score90.3	8
Open-ended conversation	MIC (in-domain)	ROUGE-L15.03	8
Visual Multi-Choice	POPE	Accuracy87.8	6

Showing 10 of 10 rows

Other info

Follow for update

@wizwand_team Discord