Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Reasoning Generalization on Out-of-Distribution Avg
Loading...
59.7
Avg Score (OOD)
TRAPO
13.628
25.589
37.55
49.511
Dec 15, 2025
Avg Score (OOD)
Updated 1mo ago
Evaluation Results
Method
Method
Links
Avg Score (OOD)
TRAPO
Training Paradigm=Semi...
2025.12
59.7
Fully Supervised
Training Paradigm=Supe...
2025.12
57.3
Fully Supervised
Training Paradigm=Supe...
2025.12
56.7
TRAPO
Training Paradigm=Semi...
2025.12
56.1
Sentence-level Entropy
Training Paradigm=Semi...
2025.12
52.6
TTRL
Training Paradigm=Unsu...
2025.12
52.4
Fully Supervised
Training Paradigm=Supe...
2025.12
52.1
Sentence-level Entropy
Training Paradigm=Unsu...
2025.12
51.5
TTRL
Training Paradigm=Semi...
2025.12
50.2
Token-level Entropy
Training Paradigm=Unsu...
2025.12
49.9
Token-level Entropy
Training Paradigm=Semi...
2025.12
49.7
Self-certainty
Training Paradigm=Unsu...
2025.12
48.4
Self-certainty
Training Paradigm=Semi...
2025.12
45.6
Qwen-Instruct
Training Paradigm=Orig...
2025.12
43
Qwen-Base
Training Paradigm=Orig...
2025.12
15.4
Feedback
Search any
task
Search any
task