Share your thoughts, 1 month free Claude Pro on usSee more

Prompts

Benchmarks

Task Name	Dataset Name	SOTA Result
Text-to-Image Generation	prompts 500 sampled	HPSv20.3345	36
Speculative Decoding	20 Prompts across 4 Task Categories	Mean Expected Tokens per Speculation Step6.55	20
Jailbreak Prompt Quality Evaluation	500 randomly sampled prompts	Similarity81	16
Distribution-distance evaluation	Prompts 100 (evaluation)	Distinct-N (WM)94.1	14
Creative Plot Generation	160 prompts NQD (test)	Character Development8.67	13
Over-generation attack	1000 prompts (test)	Succ. @≥ 188.2	8
Text-to-Image Generation	prompts 10 randomly sampled	Inference Time (s)2.2322	6
Property-based retrieval	Prompts (test)	MAP0.48	6
Platform identification	1000 prompts (held-out)	CPRSD l(f)1.1	5
Text-to-Video	1,024 prompts (held-out)	VQ4.81	5
Panorama Generation	14 prompts 1000 panoramas of dimensions 512x4608	Intra-LPIPS0.58	4
Text-to-Image Generation	400 prompts (test)	HPSv229.0533	4
Human Preference Evaluation (Harmlessness)	1,172 Prompts (test)	Win Count (CS)677	3
Human Preference Evaluation (Helpfulness)	1,172 prompts (test)	CS Wins695	3
Steering LLM states	50 prompts	LogFreq (d)1.6666	3
3D Scene Editing	15 distinct single-task prompts	LLM Time10.63	3
LLM agent alignment evaluation	1000 prompts (test)	Usefulness Score1	2
Word Count Adherence	180 held-out prompts Length Far OOD 8–12k	Length Adherence Ratio72	1
Word Count Adherence	180 held-out prompts Length Near OOD 4–8k	Length Adherence Ratio87	1
Word Count Adherence	180 Prompts Length ID 1–4k (test)	Length Adherence Ratio99	1
Story Generation Quality	180 held-out prompts Far OOD 8–12k	Story Quality Score44.1	1
Story Generation Quality	180 prompts Near OOD 4–8k (held-out)	Story Quality Score48.2	1

Showing 22 of 22 rows