DR Tulu: Reinforcement Learning with Evolving Rubrics for Deep Research
About
Deep research agents perform multi-step research to produce long-form, well-attributed answers. However, most open deep research agents are trained on easily verifiable short-form QA tasks via reinforcement learning with verifiable rewards, which does not extend to realistic long-form tasks. We address this with Reinforcement Learning with Evolving Rubrics (RLER), where rubrics are constructed and maintained to co-evolve with the policy model during training. This allows the rubrics to incorporate newly explored information from search and contrasting model responses, enabling better fact checking and more discriminative on-policy feedback. Using RLER, we develop Deep Research Tulu (DR Tulu-8B), the first fully open model that is directly trained for open-ended, long-form deep research. Across four long-form deep research benchmarks in science, healthcare, and general domains, DR Tulu substantially outperforms existing open deep research agents (by 15.6% over Tongyi DR on average) and matches or exceeds proprietary deep research agents (by 0.7% over OpenAI DR on average), while being significantly smaller and cheaper per query (1000x cheaper than OpenAI DR per query).
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Deep Research Report Generation | DeepResearch Bench | Comprehensiveness41.7 | 81 | |
| Reasoning | MMLU | Accuracy79 | 54 | |
| Creative Writing | WildBench | WildBench Score36 | 49 | |
| Long-form research | DRB | Score45.4 | 39 | |
| Reasoning | MMLU-Pro | MMLU-Pro Reasoning Score71 | 36 | |
| Research Idea Evaluation | ScholarIdeas-AI contribution rubrics | Coverage (mean)2.35 | 31 | |
| Deep Research | ResearchQA | Score75.7 | 21 | |
| Open-ended writing | WritingBench | Score74.49 | 20 | |
| Deep Research | SQA v2 | Score88.3 | 18 | |
| Long-form research | ResearchQA | Score74.3 | 18 |