SelfExplain: A Self-Explaining Architecture for Neural Text Classifiers

About

We introduce SelfExplain, a novel self-explaining model that explains a text classifier's predictions using phrase-based concepts. SelfExplain augments existing neural classifiers by adding (1) a globally interpretable layer that identifies the most influential concepts in the training set for a given sample and (2) a locally interpretable layer that quantifies the contribution of each local input concept by computing a relevance score relative to the predicted label. Experiments across five text-classification datasets show that SelfExplain facilitates interpretability without sacrificing performance. Most importantly, explanations from SelfExplain show sufficiency for model predictions and are perceived as adequate, trustworthy and understandable by human judges compared to existing widely-used baselines.

Dheeraj Rajagopal, Vidhisha Balachandran, Eduard Hovy, Yulia Tsvetkov• 2021

Related benchmarks

Task	Dataset	Result
Text Classification	AG-News	Accuracy92.5	248
Text Classification	IMDB	Accuracy93.6	141
Text Classification	CEBaB	Acc68.3	7
Text Classification	Hotel	Accuracy97.8	7
Text Classification	Beer	Accuracy87.3	7
Text Classification	SciCite	Accuracy85.6	5
Text Classification	Twitter	Accuracy81.7	5

Showing 7 of 7 rows

Other info

Follow for update

@wizwand_team Discord