Vera: A General-Purpose Plausibility Estimation Model for Commonsense Statements

About

Despite the much discussed capabilities of today's language models, they are still prone to silly and unexpected commonsense failures. We consider a retrospective verification approach that reflects on the correctness of LM outputs, and introduce Vera, a general-purpose model that estimates the plausibility of declarative statements based on commonsense knowledge. Trained on ~7M commonsense statements created from 19 QA datasets and two large-scale knowledge bases, and with a combination of three training objectives, Vera is a versatile model that effectively separates correct from incorrect statements across diverse commonsense domains. When applied to solving commonsense problems in the verification format, Vera substantially outperforms existing models that can be repurposed for commonsense verification, and it further exhibits generalization capabilities to unseen tasks and provides well-calibrated outputs. We find that Vera excels at filtering LM-generated commonsense knowledge and is useful in detecting erroneous commonsense statements generated by models like ChatGPT in real-world settings.

Jiacheng Liu, Wenya Wang, Dianzhuo Wang, Noah A. Smith, Yejin Choi, Hannaneh Hajishirzi• 2023

Related benchmarks

Task	Dataset	Result
Commonsense Reasoning	WinoGrande	Accuracy92.4	1442
Physical Commonsense Reasoning	PIQA	Accuracy77.2	696
Physical Interaction Question Answering	PIQA	Accuracy88.5	415
Social Interaction Question Answering	SIQA	Accuracy80.1	157
Physical Commonsense Reasoning	PIQA (val)	Accuracy77.2	118
Social Commonsense Reasoning	SIQA	Accuracy58.2	112
Commonsense Question Answering	CSQA	Accuracy63	71
Abductive Commonsense Reasoning	ANLI (test)	Accuracy73.2	53
Compositional Reasoning	SugarCrepe	--	50
Abductive Natural Language Inference	aNLI (leaderboard)	Accuracy83.9	47

Showing 10 of 25 rows

Other info

Follow for update

@wizwand_team Discord