Share your thoughts, 1 month free Claude Pro on usSee more

Multi-task model alignment and mixing on Math, Chat, IF, and General QA tasks Llama-3.1-8B (test)

36Math Accuracy

Mod. Surgery

Updated 4mo ago

Evaluation Results

Method	Links
Mod. Surgery 2026.02		36	30.5	30	33.1	32.6
Global Surgery 2026.02		35.8	24.2	31.1	30.3	30.3
Naïve Mixing 2026.02		35	22.9	25.4	25.2	27.3