Croissant Tasks: A Metadata Format for Reproducible Machine Learning Evaluations
About
Reproducibility is fundamental to the scientific method, yet remains a critical challenge in machine learning. Contributing factors include underspecified execution details and brittle software environments. Human-centric remedies, such as checklists and manual verification, help but require intensive effort and fail to scale. To address this, we introduce Croissant Tasks: a declarative, machine-actionable metadata format that abstracts low-level implementation details into high-level specifications. This format enables conceptual reproducibility: verifying claims via independent, agent-generated implementations rather than brittle source code replication. We contribute: (1) the Croissant Tasks specification, formally decoupling task problem from solution; (2) an automated LLM pipeline that retrofits existing benchmarks into this format; and (3) empirical validation showing autonomous agents can ingest these specifications to generate functional, accurate reproduction pipelines from scratch. We envision this format as a new foundation for automated and conceptual reproducibility in machine learning.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Visual Patch Grounding | MedSG-Bench n=1000 (VPG) | IoU30.81 | 12 | |
| Anomaly Localization | nova | mAP@3035.86 | 3 | |
| Control-Dependency / Trace extraction | CoRe Lite Control-Dependency Trace subtask n=489 | F1 Score94.58 | 3 | |
| Diagnostic Reasoning | nova | Top-1 Accuracy24.2 | 3 | |
| Image Description | nova | BLEU-41.83 | 3 | |
| Safety Evaluation | SAGE-Eval 1.0 (test) | Model-level Safety Score34.62 | 3 | |
| Omission Detection | AbsenceBench (val) | -- | 3 |