| Dataset Name | SOTA Method | Metric | Trend | ||
|---|---|---|---|---|---|
| MT-Bench | Qwen3-32B REAL (ours) | Pearson's r0.689 | 36 | 1mo ago | |
| FLASK | Qwen3-32B REAL (ours) | Pearson's r0.589 | 36 | 1mo ago | |
| FB Bench (Feedback Bench) | Qwen3-8B TRACT | Pearson's r0.949 | 36 | 1mo ago | |
| JudgeBench (test) | Skywork-Reward-V2-Llama-3.1-8B-40M | Score83.4 | 22 | 1mo ago | |
| Average Across FB Bench, FLASK, Vic. Bench, MT Bench | Qwen3-32B REAL (ours) | Pearson (r)71 | 20 | 1mo ago | |
| Vicuna Benchmark | Qwen3-32B REAL (ours) | Pearson Correlation (r)65.1 | 20 | 1mo ago | |
| Vicuna Bench | TRACT | Pearson Correlation (r)0.605 | 16 | 1mo ago | |
| 100 Romanian synthetic prompts (test) | Fluency4.71 | 7 | 1mo ago |