Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Iterative Reasoning Preference Optimization

About

Iterative preference optimization methods have recently been shown to perform well for general instruction tuning tasks, but typically make little improvement on reasoning tasks (Yuan et al., 2024, Chen et al., 2024). In this work we develop an iterative approach that optimizes the preference between competing generated Chain-of-Thought (CoT) candidates by optimizing for winning vs. losing reasoning steps that lead to the correct answer. We train using a modified DPO loss (Rafailov et al., 2023) with an additional negative log-likelihood term, which we find to be crucial. We show reasoning improves across repeated iterations of this scheme. While only relying on examples in the training set, our approach results in increasing accuracy on GSM8K, MATH, and ARC-Challenge for Llama-2-70B-Chat, outperforming other Llama-2-based models not relying on additionally sourced datasets. For example, we see a large improvement from 55.6% to 81.6% on GSM8K and an accuracy of 88.7% with majority voting out of 32 samples.

Richard Yuanzhe Pang, Weizhe Yuan, Kyunghyun Cho, He He, Sainbayar Sukhbaatar, Jason Weston• 2024

Related benchmarks

TaskDatasetResultRank
LLM Alignment EvaluationAlpacaEval 2.0 (test)
LC Win Rate27.43
51
Factual Knowledge EvaluationPopQA
Accuracy0.4123
32
Code GenerationHumanEval OOD
Pass@132.31
30
Mathematical ReasoningMATH OOD
Accuracy22.68
30
Factual Knowledge EvaluationWikidata knowledge infusion
Accuracy58.92
18
Dialogue GenerationAnthropic HH (test)
Average Preference Score60.31
16
Sentiment Control Language GenerationIMDB
Perplexity34.08
14
Visual Question AnsweringSLAKE, VQA-RAD, and PathVQA Error-specific Subsets
MM54.1
14
Visual Question AnsweringSLAKE, VQA-RAD, and PathVQA Pooled
Accuracy39
14
SummarizationReddit TL;DR (test)
Preference vs SFT (%)69.37
8
Showing 10 of 10 rows

Other info

Follow for update