Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Alignment for Honesty

About

Recent research has made significant strides in aligning large language models (LLMs) with helpfulness and harmlessness. In this paper, we argue for the importance of alignment for \emph{honesty}, ensuring that LLMs proactively refuse to answer questions when they lack knowledge, while still not being overly conservative. However, a pivotal aspect of alignment for honesty involves discerning an LLM's knowledge boundaries, which demands comprehensive solutions in terms of metric development, benchmark creation, and training methodologies. We address these challenges by first establishing a precise problem definition and defining ``honesty'' inspired by the Analects of Confucius. This serves as a cornerstone for developing metrics that effectively measure an LLM's honesty by quantifying its progress post-alignment. Furthermore, we introduce a flexible training framework which is further instantiated by several efficient fine-tuning techniques that emphasize honesty without sacrificing performance on other tasks. Our extensive experiments reveal that these aligned models show a marked increase in honesty, as indicated by our proposed metrics. We open-source all relevant resources to facilitate future research at \url{https://github.com/GAIR-NLP/alignment-for-honesty}.

Yuqing Yang, Ethan Chern, Xipeng Qiu, Graham Neubig, Pengfei Liu• 2023

Related benchmarks

TaskDatasetResultRank
Visual Question AnsweringA-OKVQA
Acc66.5
202
Visual Question AnsweringA-OKVQA (test)
Accuracy68.3
88
Visual Question AnsweringScienceVQA
ECE44.6
36
Honesty AlignmentNatural Questions (NQ) In-Domain
AUROC84.89
33
Visual Question AnsweringCulturalVQA (test)
ECE46.3
19
Honesty AlignmentHonestyBench OOD
Squad Score81.19
13
Honesty AlignmentHonestyBench In-Domain
NQ Score84.89
13
Question AnsweringHotpotQA (test)
Accuracy25.52
12
Question Answering2Wiki (test)
EM Accuracy4.7
12
Question Answering CalibrationOOD Evaluation (Squad, WQ, CWQ, MSQ, PopQA)
Squad Calibration Score17
11
Showing 10 of 20 rows

Other info

Follow for update