JoPA:Explaining Large Language Model's Generation via Joint Prompt Attribution

About

Large Language Models (LLMs) have demonstrated impressive performances in complex text generation tasks. However, the contribution of the input prompt to the generated content still remains obscure to humans, underscoring the necessity of understanding the causality between input and output pairs. Existing works for providing prompt-specific explanation often confine model output to be classification or next-word prediction. Few initial attempts aiming to explain the entire language generation often treat input prompt texts independently, ignoring their combinatorial effects on the follow-up generation. In this study, we introduce a counterfactual explanation framework based on Joint Prompt Attribution, JoPA, which aims to explain how a few prompt texts collaboratively influences the LLM's complete generation. Particularly, we formulate the task of prompt attribution for generation interpretation as a combinatorial optimization problem, and introduce a probabilistic algorithm to search for the casual input combination in the discrete space. We define and utilize multiple metrics to evaluate the produced explanations, demonstrating both the faithfulness and efficiency of our framework.

Yurui Chang, Bochuan Cao, Yujia Wang, Jinghui Chen, Lu Lin• 2024

Related benchmarks

Task	Dataset	Result
Faithfulness Measurement	MHC	BLEU57.9	18
Faithfulness Measurement	Alpaca	BLEU0.484	12
Faithfulness Measurement	tldr_news	BLEU69.2	12
Faithfulness Evaluation	Alpaca 800 samples	BLEU47.9	5
Faithfulness Evaluation	tldr_news 800 samples	BLEU68.7	5
Explanation Generation	Alpaca avg prompt instance	Inference Time (s)15.225	2
Explanation Generation	tldr_news avg prompt instance	Latency (s)15.397	2
Explanation Generation	MHC avg prompt instance	Time (s)14.473	2
Malicious Prompt Detection	GCG attacks on Llama-2 7B-Chat	Detection Accuracy100	1
Malicious Prompt Detection	Llama-2 Prompt with Random Search 7B-Chat	Detection Accuracy91	1

Showing 10 of 10 rows

Other info

Code

Follow for update

@wizwand_team Discord