Tucano 2 Cool: Better Open Source LLMs for Portuguese
About
We present Tucano 2, a fully open suite of large language models (LLMs) with 0.5-3.7 billion parameters, designed to address certain gaps in open-source development for Portuguese LLMs. Following our previous works, we now extend our dataset, GigaVerbo-v2, to a new degree of quality and scale, while also introducing a new synthetic dataset, GigaVerbo-v2 Synth, aimed at filling missing gaps in GigaVerbo-v2, and two post-training datasets, GigaVerbo-v2 SFT and GigaVerbo-v2 Preferences, that allow Portuguese LLMs to be trained in domains like retrieval augmented generation, coding, tool use, chain-of-thought reasoning, and many other domains of interest. Through extensive ablation studies, we design both pretraining and continual pretraining recipes for the Tucano 2 suite (Base, Instruct, and Think), which achieve state-of-the-art performance on several Portuguese-language modeling benchmarks. We also extend and refine the evaluation harness introduced in our earlier work, yielding a comprehensive evaluation suite that provides strong signals across different pretraining, continual pretraining, and post-training regimes. All artifacts associated with Tucano 2 are openly released, including training recipes, logs, and source code, ensuring that our work is reproducible, accessible, and extendable by the broader Portuguese NLP community.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Language Modeling | Portuguese evaluation suite (test) | NPM20.63 | 27 | |
| Language Modeling | Portuguese Evaluation Suite Hard Set | NPM0.99 | 15 | |
| Language Modeling | Portuguese Evaluation Suite Total | NPM20.64 | 15 | |
| Language Modeling | Portuguese Evaluation Suite Easy Set | NPM39.93 | 15 | |
| General Language Capability | Aggregate K&R, IFEval-PT, HumanEval | Average Score53.64 | 14 | |
| Knowledge & Reasoning | ARC-Challenge, ENEM, BLUEX, OAB Exams, BELEBELE, MMLU, GSM8K-PT | K&R Score (NPM)56.22 | 14 | |
| Coding | HumanEval | Coding Score47.56 | 14 | |
| Instruction Following | IFEval-PT | Instruction Score41.67 | 14 | |
| Long-context reasoning and retrieval | RULER-PT (aggregate) | RULER-PT (Aggregate) Score @ 1024 Context81.7 | 9 | |
| Natural Language Understanding | Portuguese Benchmarks Easy Set | NPM40.28 | 8 |