Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

S-Eval, ORFuzzSet, and NQ

Benchmarks

Task NameDataset NameSOTA ResultTrend
Safety-Utility Trade-off EvaluationS-Eval, ORFuzzSet, and NQ Aggregated
F1 Score86.81
72
Showing 1 of 1 rows