| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Reward Modeling | PPE-Preference | Accuracy79.8 | 60 | |
| Reward Modeling | PPE Correlation | Correlation87.2 | 40 | |
| Reward Modeling | PPE Correctness | Accuracy67.3 | 33 | |
| Preference Calibration | PPE | Kuiper0.034 | 24 | |
| Correctness Calibration | PPE (Preference Policy Evaluation) | Kuiper0.017 | 24 | |
| Reward Modeling | PPE-P | Accuracy68.3 | 23 | |
| Reward Modeling | PPE Pref | Accuracy67.7 | 15 | |
| Reward Modeling | PPE | Accuracy76.4 | 13 | |
| Reward Modeling | PPE Human | Accuracy64.6 | 10 | |
| Reward Modeling | PPE | PPE Human Preference76.9 | 8 | |
| Information Extraction | PPE 10-PDF subsample | F1 Score62.69 | 6 | |
| Information Extraction | PPE | Precision52.5 | 6 | |
| Scientific Information Extraction | PPE (full) | Precision53.91 | 4 | |
| Precision Prediction | PPE unseen pipeline configurations | MSE0.0131 | 3 | |
| Accuracy Prediction | PPE unseen pipeline configurations | MSE0.007 | 3 |