| Dataset Name | SOTA Method | Metric | Trend | ||
|---|---|---|---|---|---|
| Counterfactual Eval (dev) | PE2 | Mean Score63.4 | 52 | 4d ago | |
| CVQA | Accuracy71.41 | 40 | 2d ago | ||
| MMLU-CF | GHS-TDA | EM71.6 | 30 | 4d ago | |
| UCI Adult missing values Agent (test) | SHAP | Accuracy100 | 16 | 4d ago | |
| Counterfactual reasoning Agent synthetic (test) | LIME | Accuracy99.7 | 16 | 4d ago | |
| CRASS | GPT-4 | Exact Match Performance94.53 | 11 | 4d ago | |
| CRAFT Hard Split (test) | CRCG_GPT4 | Accuracy83.64 | 8 | 4d ago | |
| CRAFT Easy Split (test) | BERT-D | Accuracy80.05 | 8 | 4d ago | |
| OmniDrive | Omni-L | Safe Precision72.1 | 4 | 4d ago | |
| C-VQA | ViperGPT | Numerical Direct Accuracy80.6 | 4 | 4d ago | |
| UCI Adult missing values Human Survey (test) | - | - | 0 | 4d ago | |
| Counterfactual reasoning Human Amazon Mechanical Turk (test) | - | - | 0 | 4d ago |