Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

TURINGBENCH: A Benchmark Environment for Turing Test in the Age of Neural Text Generation

About

Recent progress in generative language models has enabled machines to generate astonishingly realistic texts. While there are many legitimate applications of such models, there is also a rising need to distinguish machine-generated texts from human-written ones (e.g., fake news detection). However, to our best knowledge, there is currently no benchmark environment with datasets and tasks to systematically study the so-called "Turing Test" problem for neural text generation methods. In this work, we present the TuringBench benchmark environment, which is comprised of (1) a dataset with 200K human- or machine-generated samples across 20 labels {Human, GPT-1, GPT-2_small, GPT-2_medium, GPT-2_large, GPT-2_xl, GPT-2_PyTorch, GPT-3, GROVER_base, GROVER_large, GROVER_mega, CTRL, XLM, XLNET_base, XLNET_large, FAIR_wmt19, FAIR_wmt20, TRANSFORMER_XL, PPLM_distil, PPLM_gpt2}, (2) two benchmark tasks -- i.e., Turing Test (TT) and Authorship Attribution (AA), and (3) a website with leaderboards. Our preliminary experimental results using TuringBench show that FAIR_wmt20 and GPT-3 are the current winners, among all language models tested, in generating the most human-like indistinguishable texts with the lowest F1 score by five state-of-the-art TT detection models. The TuringBench is available at: https://turingbench.ist.psu.edu/

Adaku Uchendu, Zeyu Ma, Thai Le, Rui Zhang, Dongwon Lee• 2021

Related benchmarks

TaskDatasetResultRank
ClassificationIMDB
Accuracy95
56
Binary ClassificationIMDB
Accuracy90
21
Binary Classification (Human vs Assistive)CNN/DailyMail
AUC0.99
12
Binary Classification (Human vs Creative)CNN/DailyMail
AUC99
12
Binary Classification (Human vs Assistive)All Datasets Combined
AUC95
12
Binary Classification (Human vs Assistive)Wikipedia
AUC94
12
Binary Classification (Human vs Assistive)arXiv
AUC92
12
Binary Classification (Assistive vs Creative)CNN/DailyMail
AUC78
12
Binary Classification (Assistive vs Creative)Wikipedia
AUC86
12
Binary Classification (Assistive vs Creative)arXiv
AUC0.81
12
Showing 10 of 15 rows

Other info

Code

Follow for update