Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

LLM-as-a-Judge Robustness on Sage (Hard)

55.9Factuality (IPI)

Llama-3.1-8B-Instruct

22.72431.33739.9548.563Dec 17, 2025
Updated 4d ago

Evaluation Results

MethodLinks
55.98.50749.67.665.310.25458.19.24660.89.26557.38.869
54.68.26147.87.27645.27.80335.35.75566.19.97450.27.865
52.28.44250.77.98529.25.21335.26.1716110.16146.47.687
51.38.6358.59.17121.43.89343.87.23552.39.47467.756
45.57.14645.57.06498.28640.36.33153.38.49246.77.446
43.46.59437.65.76445.17.41733.65.919487.28241.56.571
2025.12
43.36.56357.18.61928.74.8528.64.43158.38.76443.66.7
42.46.45839.36.05933.75.62537.96.065537.96841.36.435
416.39745.17.06727.65.03239.86.40443.46.82839.66.362
37.65.73440.56.21537.66.25434.65.65343.36.57338.86.078
34.75.32635.75.58525.64.40426.84.124426.40933.15.183
26.54.175325.00725.74.45722.83.73226.74.33126.94.35
243.76929.14.58619.33.66126.84.439253.903254.079