RLEF: Grounding Code LLMs in Execution Feedback with Reinforcement Learning

About

Large language models (LLMs) deployed as agents solve user-specified tasks over multiple steps while keeping the required manual engagement to a minimum. Crucially, such LLMs need to ground their generations in any feedback obtained to reliably achieve the desired outcomes. We propose an end-to-end reinforcement learning method for teaching models to leverage execution feedback in the realm of code synthesis, where state-of-the-art LLMs struggle to improve code iteratively compared to independent sampling. We benchmark on competitive programming tasks, where we achieve new state-of-the art results with both small (8B parameters) and large (70B) models while reducing the amount of samples required by an order of magnitude. Our analysis of inference-time behavior demonstrates that our method produces LLMs that effectively leverage automatic feedback over multiple steps.

Jonas Gehring, Kunhao Zheng, Jade Copet, Vegard Mella, Quentin Carbonneaux, Taco Cohen, Gabriel Synnaeve• 2024

Related benchmarks

Task	Dataset	Result
Automated Program Repair	HumanEval Java (164 tasks)	Pass@1 Rate74.3	16
Automated Program Repair	SWE-bench Verified 500 instances	Pass@1 Rate12.6	16
Automated Program Repair	QuixBugs-Java 40 bugs	Pass@1 Rate80	16
Automated Program Repair	Defects4J 835 bugs v2.0	Pass@18.4	16
Code Generation	DMC	PASS@165.3	8
Code Generation	LCB-IO	Pass@163.8	8

Showing 6 of 6 rows

Other info

Follow for update

@wizwand_team Discord