Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Counterfactual Eval

Benchmarks

Task NameDataset NameSOTA ResultTrend
Counterfactual ReasoningCounterfactual Eval (dev)
Mean Score63.4
52
Logical and Mathematical Reasoning under CounterfactualsCounterfactual Eval Manual Initialization 5 random samples 1.0 (train and dev)
Arithmetic Base 8 (Mean)32
4
Showing 2 of 2 rows