| Dataset Name | SOTA Method | Metric | Trend | ||
|---|---|---|---|---|---|
| Counterfactual Eval (dev) | PE2 | Mean Score63.4 | 52 | 1mo ago | |
| CVQA | Accuracy71.41 | 40 | 1mo ago | ||
| MMLU-CF | GHS-TDA | EM71.6 | 30 | 1mo ago | |
| CounterBench | Basic Score80.8 | 20 | 4d ago | ||
| UCI Adult missing values Agent (test) | SHAP | Accuracy100 | 16 | 1mo ago | |
| Counterfactual reasoning Agent synthetic (test) | LIME | Accuracy99.7 | 16 | 1mo ago | |
| CRASS | GPT-4 | Exact Match Performance94.53 | 11 | 1mo ago | |
| CRAFT Hard Split (test) | CRCG_GPT4 | Accuracy83.64 | 8 | 1mo ago | |
| CRAFT Easy Split (test) | BERT-D | Accuracy80.05 | 8 | 1mo ago | |
| Y-struct NADD | JANUS | MSE16,515 | 5 | 1mo ago | |
| Diamond NADD | JANUS | MSE599 | 5 | 1mo ago | |
| Triangle NADD | JANUS | MSE150 | 5 | 1mo ago | |
| Chain NADD | JANUS | MSE48.8 | 5 | 1mo ago | |
| OmniDrive | Omni-L | Safe Precision72.1 | 4 | 1mo ago | |
| C-VQA | ViperGPT | Numerical Direct Accuracy80.6 | 4 | 1mo ago | |
| UCI Adult missing values Human Survey (test) | - | - | 0 | 1mo ago | |
| Counterfactual reasoning Human Amazon Mechanical Turk (test) | - | - | 0 | 1mo ago |