RLEF: Grounding Code LLMs in Execution Feedback with Reinforcement Learning
About
Large language models (LLMs) deployed as agents solve user-specified tasks over multiple steps while keeping the required manual engagement to a minimum. Crucially, such LLMs need to ground their generations in any feedback obtained to reliably achieve the desired outcomes. We propose an end-to-end reinforcement learning method for teaching models to leverage execution feedback in the realm of code synthesis, where state-of-the-art LLMs struggle to improve code iteratively compared to independent sampling. We benchmark on competitive programming tasks, where we achieve new state-of-the art results with both small (8B parameters) and large (70B) models while reducing the amount of samples required by an order of magnitude. Our analysis of inference-time behavior demonstrates that our method produces LLMs that effectively leverage automatic feedback over multiple steps.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Automated Program Repair | HumanEval Java (164 tasks) | Pass@1 Rate74.3 | 16 | |
| Automated Program Repair | SWE-bench Verified 500 instances | Pass@1 Rate12.6 | 16 | |
| Automated Program Repair | QuixBugs-Java 40 bugs | Pass@1 Rate80 | 16 | |
| Automated Program Repair | Defects4J 835 bugs v2.0 | Pass@18.4 | 16 | |
| Code Generation | DMC | PASS@165.3 | 8 | |
| Code Generation | LCB-IO | Pass@163.8 | 8 |