Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Data Contamination Detection on AIME 2024
Loading...
76
F1 Score
Self-Critique
28.16
40.58
53
65.42
Oct 10, 2025
F1 Score
AUC
Updated 1mo ago
Evaluation Results
Method
Method
Links
F1 Score
AUC
Self-Critique
Target Model=DeepSeek-...
2025.10
76
67
Min-K%++
Target Model=Qwen2.5-7...
2025.10
73
58
Entropy-Temp
Target Model=Qwen2.5-7...
2025.10
73
64
Entropy-Noise
Target Model=Qwen2.5-7...
2025.10
70
57
Entropy-Noise
Target Model=DeepSeek-...
2025.10
70
56
Self-Critique
Target Model=Qwen2.5-7...
2025.10
69
72
Min-K%++
Target Model=DeepSeek-...
2025.10
67
53
Recall
Target Model=Qwen2.5-7...
2025.10
62
61
Entropy-Temp
Target Model=DeepSeek-...
2025.10
60
48
Min-K%
Target Model=Qwen2.5-7...
2025.10
59
49
Min-K%
Target Model=DeepSeek-...
2025.10
55
47
Recall
Target Model=DeepSeek-...
2025.10
54
46
CDD
Target Model=Qwen2.5-7...
2025.10
50
57
PPL
Target Model=DeepSeek-...
2025.10
42
53
PPL
Target Model=Qwen2.5-7...
2025.10
33
51
CDD
Target Model=DeepSeek-...
2025.10
30
49
Feedback
Search any
task
Search any
task