ReMoDetect: Reward Models Recognize Aligned LLM's Generations

About

The remarkable capabilities and easy accessibility of large language models (LLMs) have significantly increased societal risks (e.g., fake news generation), necessitating the development of LLM-generated text (LGT) detection methods for safe usage. However, detecting LGTs is challenging due to the vast number of LLMs, making it impractical to account for each LLM individually; hence, it is crucial to identify the common characteristics shared by these models. In this paper, we draw attention to a common feature of recent powerful LLMs, namely the alignment training, i.e., training LLMs to generate human-preferable texts. Our key finding is that as these aligned LLMs are trained to maximize the human preferences, they generate texts with higher estimated preferences even than human-written texts; thus, such texts are easily detected by using the reward model (i.e., an LLM trained to model human preference distribution). Based on this finding, we propose two training schemes to further improve the detection ability of the reward model, namely (i) continual preference fine-tuning to make the reward model prefer aligned LGTs even further and (ii) reward modeling of Human/LLM mixed texts (a rephrased texts from human-written texts using aligned LLMs), which serves as a median preference text corpus between LGTs and human-written texts to learn the decision boundary better. We provide an extensive evaluation by considering six text domains across twelve aligned LLMs, where our method demonstrates state-of-the-art results. Code is available at https://github.com/hyunseoklee-ai/ReMoDetect.

Hyunseok Lee, Jihoon Tack, Jinwoo Shin• 2024

Related benchmarks

Task	Dataset	Result
Machine-generated text detection	MGT benchmark Essay	AUROC100	129
LGT Detection	Fast-DetectGPT PubMed (test)	AUROC0.97	96
LGT Detection	Fast-DetectGPT XSum (test)	AUROC100	96
LGT Detection	PubMed Fast-DetectGPT benchmark	AUROC0.97	54
LGT Detection	XSum Fast-DetectGPT benchmark	AUROC100	54
LGT Detection	WritingPrompts-small Fast-DetectGPT benchmark	AUROC99.8	54
LGT Detection	WritingPrompts small Fast-DetectGPT benchmark (test)	AUROC99.8	54
LGT Detection	MGTBench WritingPrompts	AUROC100	45
Machine-generated text detection	MGT benchmark Reuters	AUROC100	45
LGT Detection	Fast-DetectGPT WP-s (test)	AUROC100	42

Showing 10 of 52 rows

Other info

Code

Follow for update

@wizwand_team Discord