| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Watermark Detection | Vicuna-7b 16k 50 samples v1.5 | AUROC (Overall)0.986 | 94 | |
| Watermark Attack Robustness | Vicuna 7b 16k v1.5 (test) | ASR62 | 30 | |
| Instruction Following | Vicuna | SBERT Similarity73.6 | 24 | |
| Instruction Following | Vicuna benchmark zero-shot | Pairwise Score (ChatGPT vs Sys)119.4 | 21 | |
| Instruction Following | Vicuna Eval | Win Rate (A)66.3 | 19 | |
| Instruction Following | Vicuna benchmark | GPT-4 Evaluation Score8.09 | 18 | |
| Instruction Following | Vicuna | Score58.2 | 18 | |
| Human alignment evaluation | Vicuna Evaluation Benchmark | Accuracy76.3 | 16 | |
| Response generation | Vicuna 80 prompts (test) | Elo1,348 | 16 | |
| Watermark Evasion | vicuna-7b 50 samples, UMD watermarking v1.5-16k (test) | ASR (0 Unattacked)58 | 15 | |
| Output Equivalence | Vicuna | Exact Match97.3 | 13 | |
| Instruction Following | Vicuna-bench | Score8.24 | 13 | |
| Language Instruction Following | Vicuna-80 v1 (test) | Score85.6 | 10 | |
| Chatbot Evaluation | Vicuna benchmark | Elo Rating13,481 | 8 | |
| Computational Efficiency Evaluation | Vicuna | ATGR0.88 | 7 | |
| Open-ended instruction following | Vicuna Eval v1.3 (test) | A Win Rate65 | 7 | |
| Instruction Following | Vicuna low-resource | Win Rate (bn)0.85 | 7 | |
| Instruction Following | Vicuna | Rouge-L17.8 | 6 | |
| Jailbreak Attack | Vicuna | ASR96.67 | 5 | |
| Instruction Following Evaluation | Vicuna Eval | Win Rate (A)63.8 | 5 | |
| Instruction Following | Vicuna (test) | Score A669 | 3 |