OpenCodeInterpreter: Integrating Code Generation with Execution and Refinement

About

The introduction of large language models has significantly advanced code generation. However, open-source models often lack the execution capabilities and iterative refinement of advanced systems like the GPT-4 Code Interpreter. To address this, we introduce OpenCodeInterpreter, a family of open-source code systems designed for generating, executing, and iteratively refining code. Supported by Code-Feedback, a dataset featuring 68K multi-turn interactions, OpenCodeInterpreter integrates execution and human feedback for dynamic code refinement. Our comprehensive evaluation of OpenCodeInterpreter across key benchmarks such as HumanEval, MBPP, and their enhanced versions from EvalPlus reveals its exceptional performance. Notably, OpenCodeInterpreter-33B achieves an accuracy of 83.2 (76.4) on the average (and plus versions) of HumanEval and MBPP, closely rivaling GPT-4's 84.2 (76.2) and further elevates to 91.6 (84.6) with synthesized human feedback from GPT-4. OpenCodeInterpreter brings the gap between open-source code generation models and proprietary systems like GPT-4 Code Interpreter.

Tianyu Zheng, Ge Zhang, Tianhao Shen, Xueling Liu, Bill Yuchen Lin, Jie Fu, Wenhu Chen, Xiang Yue• 2024

Related benchmarks

Task	Dataset	Result
Function-level Code Generation	HumanEval+ augmented (test)	Pass@173.8	65
Code Reasoning	HumanEval	HumanEval Score79.3	62
Function-level Code Generation	MBPP+ augmented (test)	Pass@167.7	56
Code Generation	HumanEval+ v1 (test)	Pass Rate0.897	55
Code Reasoning	MBPP	MBPP Execution Accuracy77.2	33
Multi-turn Code Generation	MT-Bench Coding	First Turn Score6.8	15
Knowledge Integration	PM	Performance Score1	8
Knowledge Integration	NCM	Performance Score0.00e+0	8
Knowledge Integration	FPNENN (Fixed Points in Non-Negative Neural Networks)	Performance Score0.00e+0	8
Code Reasoning	BigCodeBench	BigCodeBench Score35.6	3

Showing 10 of 10 rows

Other info

Code

Follow for update

@wizwand_team Discord