| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Instruction Following | Vicuna | Rouge-L20.93 | 101 | |
| Watermark Detection | Vicuna-7b 16k 50 samples v1.5 | AUROC (Overall)0.986 | 94 | |
| Adversarial Jailbreak Attack | Vicuna 7B | Attack Success Rate (ASR)98.46 | 58 | |
| Adversarial Jailbreak Attack | Vicuna 13B | Attack Success Rate (ASR)98.65 | 55 | |
| Instruction Following | Vicuna Eval (test) | ROUGE-L20.37 | 36 | |
| Watermark Attack Robustness | Vicuna 7b 16k v1.5 (test) | ASR62 | 30 | |
| Instruction Following | Vicuna | SBERT Similarity73.6 | 24 | |
| Instruction Following | Vicuna benchmark zero-shot | Pairwise Score (ChatGPT vs Sys)119.4 | 21 | |
| LLM-as-a-Judge Evaluation | Vicuna Benchmark | Pearson Correlation (r)65.1 | 20 | |
| Instruction Tuning | Vicuna | RougeL Score18.73 | 19 | |
| Instruction Following | Vicuna Eval | Win Rate (A)66.3 | 19 | |
| Open-ended generation | Vicuna | Skywork Reward V2 Score99.1 | 18 | |
| Hallucination Detection | SC-Vicuna | AUROC71.4 | 18 | |
| Instruction Following | Vicuna benchmark | GPT-4 Evaluation Score8.09 | 18 | |
| Instruction Following | Vicuna | Score58.2 | 18 | |
| Instruction Following Evaluation | Vicuna Out-of-Distribution | GPT-4o Score51.9 | 17 | |
| Dialogue Generation | Vicuna | Rouge-L15.05 | 16 | |
| Human alignment evaluation | Vicuna Evaluation Benchmark | Accuracy76.3 | 16 | |
| Response generation | Vicuna 80 prompts (test) | Elo1,348 | 16 | |
| Watermark Evasion | vicuna-7b 50 samples, UMD watermarking v1.5-16k (test) | ASR (0 Unattacked)58 | 15 | |
| Language Generation | Vicuna (test) | ROUGE-L19.4 | 14 | |
| Output Equivalence | Vicuna | Exact Match97.3 | 13 | |
| Instruction Following | Vicuna-bench | Score8.24 | 13 | |
| Instruction Following | Vicuna Eval | ROUGE-L16.31 | 11 | |
| Language Instruction Following | Vicuna-80 v1 (test) | Score85.6 | 10 |