| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Triton Kernel Generation | Synthetic Benchmark Overall All Levels | Average Speedup1.57 | 7 | |
| Triton Kernel Generation | Synthetic Benchmark Level 20 | Accuracy99 | 7 | |
| Triton Kernel Generation | Synthetic Benchmark (Level 5) | Acc99 | 7 | |
| Triton Kernel Generation | Synthetic Benchmark Level 2 | Accuracy96 | 7 | |
| Triton Kernel Generation | Synthetic Benchmark Level 1 | Accuracy86.8 | 7 | |
| Shortest Path | synthetic benchmark | Accuracy95 | 7 | |
| Edge Existence | Synthetic Benchmark | Accuracy99.7 | 7 | |
| Node Degree | Synthetic Benchmark | Accuracy99.75 | 7 | |
| Triangle Count | synthetic benchmark | Accuracy74.35 | 7 | |
| Cycle Check | synthetic benchmark | Accuracy99.9 | 7 | |
| Edge Count | Synthetic Benchmark | Accuracy94.95 | 7 | |
| Node Count | synthetic benchmark 1.0 (test) | Accuracy100 | 7 | |
| Feature Attribution | Synthetic benchmark softplus aggregator nonlinear f (test) | MAE0.365 | 6 | |
| Dynamic causal graph tracking | Synthetic benchmark semi-synthetic health data (test) | Direction Accuracy91 | 6 | |
| Learning to Defer | Synthetic benchmark (test) | Test True Risk28.1 | 6 | |
| CATE estimation | Synthetic Benchmark range do(D) ∈ [-2.5, 2.5] (in-sample) | RMSE0.36 | 5 | |
| Regression | Synthetic benchmark with planted ground truth N=1,000, d=8 (test) | R20.961 | 5 | |
| Cluster Validity Index Evaluation | 10 Synthetic Benchmark Datasets varying d from 10 to 500 | Mean SCOPE96.3 | 5 | |
| Online Bayesian calibration | Synthetic benchmark Mixed(3) | Theta RMSE0.02 | 5 | |
| Online Bayesian calibration | Synthetic benchmark Sudden(3) | RMSE (Theta)0.018 | 5 | |
| Online Bayesian calibration | Synthetic benchmark Drifting | RMSE ($ heta$)0.014 | 5 | |
| Bokeh Rendering | Synthetic Benchmark | RMSE0.0133 | 5 | |
| Domain Adaptation | Synthetic Benchmark | Geometry Score58 | 4 | |
| Generative model evaluation metric validation | Synthetic benchmark 2025 (test) | Metric- | 0 |