Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

BLEUBERI: BLEU is a surprisingly effective reward for instruction following

About

Reward models are central to aligning LLMs with human preferences, but they are costly to train, requiring large-scale human-labeled preference data and powerful pretrained LLM backbones. Meanwhile, the increasing availability of high-quality synthetic instruction-following datasets raises the question: can simpler, reference-based metrics serve as viable alternatives to reward models during RL-based alignment? In this paper, we show first that BLEU, a basic string-matching metric, surprisingly matches strong reward models in agreement with human preferences on general instruction-following datasets. Based on this insight, we develop BLEUBERI, a method that first identifies challenging instructions and then applies Group Relative Policy Optimization (GRPO) using BLEU directly as the reward function. We demonstrate that BLEUBERI-trained models are competitive with models trained via reward model-guided RL across four challenging instruction-following benchmarks and three different base language models. A human evaluation further supports that the quality of BLEUBERI model outputs is on par with those from reward model-aligned models. Moreover, BLEUBERI models generate outputs that are more factually grounded than competing methods. Overall, we show that given access to high-quality reference outputs (easily obtained via existing instruction-following datasets or synthetic data generation), string matching-based metrics are cheap yet effective proxies for reward models during alignment. We release our code and data at https://github.com/lilakk/BLEUBERI.

Yapei Chang, Yekyung Kim, Michael Krumdick, Amir Zadeh, Chuan Li, Chris Tanner, Mohit Iyyer• 2025

Related benchmarks

TaskDatasetResultRank
Instruction FollowingIFEval
Accuracy (0-100)69.2
292
Instruction FollowingAlpacaEval 2.0
LC Win Rate30.4
281
General KnowledgeMMLU
MMLU General Knowledge Accuracy69.2
170
Mathematical Problem SolvingMATH
Accuracy50.2
166
CodeHumanEval
HumanEval Accuracy72.8
50
Multi-turn conversationMT-Bench
Conversation Rating (1-10)8.2
41
Instruction FollowingFollowBench--
39
Science ReasoningARC
Accuracy82
10
Technical problem-solvingArena Hard
Win Rate39.6
10
Showing 9 of 9 rows

Other info

Follow for update