Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Loong

Benchmarks

Task NameDataset NameSOTA ResultTrend
Long-context evaluation (Financial)Loong Fin
Fin Judge Score58.8
13
OverallLoong Set 4: 200K–250K Tokens
LLM Score54.62
12
Chain-of-reasoningLoong Set 4: 200K–250K Tokens
LLM Score36.17
12
ClusteringLoong Set 4: 200K–250K Tokens
LLM Score57.53
12
ComparisonLoong Set 4: 200K–250K Tokens
LLM Score55.8
12
SpottingLoong Set 4: 200K–250K Tokens
LLM Score57.74
12
OverallLoong Set 3: 100K–200K Tokens
LLM Score58.86
12
Chain-of-reasoningLoong Set 3: 100K–200K Tokens
LLM Score0.5217
12
ClusteringLoong Set 3: 100K–200K Tokens
LLM Score58.85
12
ComparisonLoong Set 3: 100K–200K Tokens
LLM Score57.84
12
SpottingLoong Set 3: 100K–200K Tokens
LLM Score0.6862
12
OverallLoong Set 2: 50K–100K Tokens
LLM Score0.6361
12
Chain-of-reasoningLoong Set 2: 50K–100K Tokens
LLM Score58.23
12
ClusteringLoong Set 2: 50K–100K Tokens
LLM Score61.67
12
ComparisonLoong Set 2: 50K–100K Tokens
LLM Score64.34
12
SpottingLoong Set 2: 50K–100K Tokens
LLM Score69.92
12
OverallLoong Set 1: 10K–50K Tokens
LLM Score71
12
Chain-of-reasoningLoong Set 1: 10K–50K Tokens
LLM Score70.31
12
ClusteringLoong Set 1: 10K–50K Tokens
LLM Score0.6536
12
ComparisonLoong Set 1: 10K–50K Tokens
LLM Score75.65
12
SpottingLoong Set 1: 10K–50K Tokens
LLM Score0.766
12
Showing 21 of 21 rows