Tower+: Bridging Generality and Translation Specialization in Multilingual LLMs

About

Fine-tuning pretrained LLMs has been shown to be an effective strategy for reaching state-of-the-art performance on specific tasks like machine translation. However, this process of adaptation often implies sacrificing general-purpose capabilities, such as conversational reasoning and instruction-following, hampering the utility of the system in real-world applications that require a mixture of skills. In this paper, we introduce Tower+, a suite of models designed to deliver strong performance across both translation and multilingual general-purpose text capabilities. We achieve a Pareto frontier between translation specialization and multilingual general-purpose capabilities by introducing a novel training recipe that builds on Tower (Alves et al., 2024), comprising continued pretraining, supervised fine-tuning, preference optimization, and reinforcement learning with verifiable rewards. At each stage of training, we carefully generate and curate data to strengthen performance on translation as well as general-purpose tasks involving code generation, mathematics problem solving, and general instruction-following. We develop models at multiple scales: 2B, 9B, and 72B. Our smaller models often outperform larger general-purpose open-weight and proprietary LLMs (e.g., Llama 3.3 70B, GPT-4o). Our largest model delivers best-in-class translation performance for high-resource languages and top results in multilingual Arena Hard evaluations and in IF-MT, a benchmark we introduce for evaluating both translation and instruction-following. Our findings highlight that it is possible to rival frontier models in general capabilities, while optimizing for specific business domains, such as translation and localization.

Ricardo Rei, Nuno M. Guerreiro, Jos\'e Pombal, Jo\~ao Alves, Pedro Teixeirinha, Amin Farajian, Andr\'e F. T. Martins• 2025

Related benchmarks

Task	Dataset	Result
Machine Translation	FLORES+ (test)	spBLEU45.32	128
General Reasoning	BBH	BBH General Reasoning Accuracy40.4	103
Machine Translation	WMT24++ v1.0 (test)	XCOMET Score88.19	49
Machine Translation	Flores-101 (test)	Average Score14.55	41
Machine Translation (xx -> zh)	FLORES+ latest (test)	spBLEU33.25	30
Machine Translation	WMT 2025 (test)	XCOMET-XXL41	17
Machine Translation	FLORES-200 EN ⇔ XX 2022	XCOMET-XXL84.16	17
Machine Translation	FLORES-200 ZH ⇔ XX 2022	XCOMET-XXL0.7969	17
Machine Translation	FLORES-200 XX ⇔ XX 2022	XCOMET-XXL70.02	17
Machine Translation	Mandarin ⇔ Minority (test)	XCOMET-XXL0.3855	16

Showing 10 of 20 rows

Other info

Follow for update

@wizwand_team Discord