Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Legitimate Task Completion on Skill-Inject 100 sandbox
Loading...
88
TSR
Deepseek-V4-Flash+OC
65.12
71.06
77
82.94
Jun 1, 2026
TSR
Updated 1d ago
Evaluation Results
Method
Method
Links
TSR
Deepseek-V4-Flash+OC
Condition=Vanilla
2026.06
88
Sonnet-4.5+CC
Condition=Static
2026.06
87
Deepseek-V4-Flash+OC
Condition=Dynamic
2026.06
83.8
Sonnet-4.5+CC
Condition=SysTargeted
2026.06
83
Sonnet-4.5+CC
Condition=Dynamic
2026.06
82.8
Sonnet-4.5+CC
Condition=SysGeneric
2026.06
81
Deepseek-V4-Flash+OC
Condition=Static
2026.06
81
Sonnet-4.5+CC
Condition=Vanilla
2026.06
80
Nemotron3-Super+OC
Condition=Vanilla
2026.06
72
Nemotron3-Super+OC
Condition=Static
2026.06
67
Nemotron3-Super+OC
Condition=Dynamic
2026.06
66
Feedback
Search any
task
Search any
task