Share your thoughts, 1 month free Claude Pro on usSee more

Model Performance Evaluation on Table 1 Aggregate excluding Human-Internal

79.74Average Score

LMUNIT LLaMA3.1-70B-Decomposed-Weighted

Updated 4mo ago

Evaluation Results

Method	Links
LMUNIT LLaMA3.1-70B-Decomposed-Weighted 2024.12		79.74
LMUNIT LLaMA3.1-70B 2024.12		79.29
LMUNIT LLaMA3.1-70B-Decomposed 2024.12		78.78
SFR-LLaMA-3.1-70B-Judge 2024.12		78.26
GPT-4o 2024.12		77.59
Claude-3.5 Sonnet 2024.12		76.43
LMUNIT LLaMA3.1-8B 2024.12		74.1
SFR-LLaMA-3.1-8B-Judge 2024.12		72.27
Prometheus-2-8x7B 2024.12		68.52
Prometheus-2-BGB-8x7B 2024.12		59.74
Prometheus-2-7B 2024.12		57.98
Llama-3-OffsetBias-8B 2024.12		53.85