| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Circuit Discovery Evaluation | Gemma-2-2B | Clarity82 | 70 | |
| Automated Interpretability Evaluation | Gemma-2-2B | Clarity80 | 50 | |
| Watermarking Attack Robustness | Gemma 9B v2 (test) | TPR100 | 49 | |
| Negative Sentiment Backdoor Detection | Gemma 2 9B | Attack Success Rate (ASR)0 | 48 | |
| Refusal Backdoor Detection | Gemma-2-9B | ASR0 | 42 | |
| Model Steering | Gemma 2 2B Steering Evaluation Set | Granularity1.2961 | 20 | |
| Sparse Autoencoder Evaluation | Gemma-2-2B activations | L0 Count320 | 20 | |
| Jailbreak Attack | Gemma 4B 3 | NR66 | 20 | |
| Jailbreak attack | Gemma-7b five finetuned variants | Average ASR66.2 | 16 | |
| Jailbreak Attack | gemma-7b v1 (pretrained) | ASR6 | 13 | |
| LLM Alignment | Gemma-3-4B | Win Rate94.33 | 12 | |
| LLM fingerprinting | Gemma 2 2B | AUC1 | 10 | |
| Language Modeling | Gemma 3 | Accuracy47.06 | 10 | |
| Semantic Attribute Alignment | Gemma animal-attribute prompts | Happy Score26.11 | 9 | |
| Jailbreak Attack | Gemma-3 27B-it | ASR92 | 9 | |
| Model Utility | Gemma-2B-IT | Utility57.8 | 8 | |
| Contextual Question Answering | Gemma-2B-IT 5% forget set | ROUGE-L92.4 | 8 | |
| Direct Question Answering | Gemma-2B-IT 5% forget set | ROUGE-L47.1 | 8 | |
| Adversarial Attack | Gemma 27B-it 3 | Attack Success Rate (ASR)10 | 8 | |
| Transferable Adversarial Attack | Gemma 27B-it 3 | ASR (%)30.2 | 8 | |
| Neuron Description | Gemma 2 | Faithfulness47 | 8 | |
| Output-based feature description evaluation | Gemma-2 MLP SAE features | Score49.9 | 8 | |
| Output-based feature description evaluation | Gemma-2 Residual SAE features | Score66.9 | 8 | |
| Watermark Detection Robustness | Gemma-2 2B Pre-trained (PT) (test) | TPR (None)100 | 7 | |
| Watermarked text generation and detection | Gemma-2 9B Pre-trained | TPR100 | 7 |