Pantagruel: Unified Self-Supervised Encoders for French Text and Speech

About

We release Pantagruel models, a new family of self-supervised encoder models for French text and speech. Instead of predicting modality-tailored targets such as textual tokens or speech units, Pantagruel learns contextualized target representations in the feature space, allowing modality-specific encoders to capture linguistic and acoustic regularities more effectively. Separate models are pre-trained on large-scale French corpora, including Wikipedia, OSCAR and CroissantLLM for text, together with MultilingualLibriSpeech, LeBenchmark, and INA-100k for speech. INA-100k is a newly introduced 100,000-hour corpus of French audio derived from the archives of the Institut National de l'Audiovisuel (INA), the national repository of French radio and television broadcasts, providing highly diverse audio data. We evaluate Pantagruel across a broad range of downstream tasks spanning both modalities, including those from the standard French benchmarks such as FLUE or LeBenchmark. Across these tasks, Pantagruel models show competitive or superior performance compared to strong French baselines such as CamemBERT, FlauBERT, and LeBenchmark2.0, while maintaining a shared architecture that can seamlessly handle either speech or text inputs. These results confirm the effectiveness of feature-space self-supervised objectives for French representation learning and highlight Pantagruel as a robust foundation for multimodal speech-text understanding.

Phuong-Hang Le, Valentin Pelloin, Arnault Chatelain, Maryem Bouziane, Mohammed Ghennai, Qianwen Guan, Kirill Milintsevich, Salima Mdhaffar, Aidan Mannion, Nils Defauw, Shuyue Gu, Alexandre Audibert, Marco Dinarelli, Yannick Est\`eve, Lorraine Goeuriot, Steffen Lalande, Nicolas Herv\'e, Maximin Coavoux, Fran\c{c}ois Portet, \'Etienne Ollion, Marie Candito, Maxime Peyrard, Solange Rossato, Benjamin Lecouteux, Aur\'elie Nardy, Gilles S\'erasset, Vincent Segonne, Sol\`ene Evain, Diandra Fabre, Didier Schwab• 2026

Related benchmarks

Task	Dataset	Result
Spoken Language Understanding	MEDIA	SLU CER10.5	14
Named Entity Recognition	PxCorpus	NER F187	14
Universal Dependency Parsing	German GSD v2.2 (test)	UPOS98.4	12
Natural Language Understanding	FLUE	Classification F194.1	8
Part-of-Speech Tagging	CAS-POS	MacF196.8	8
Question Answering	FrMMCQA	Hamming Distance22.4	8
Sentence Grouping	CAS-SG	WF175.7	8
Biomedical Text Processing	EMEA	Word F193.7	8
Part-of-Speech Tagging	ESSAI POS	MacF196.3	8
Biomedical Text Processing	MEDLINE	WF184.2	8

Showing 10 of 26 rows

Other info

Follow for update

@wizwand_team Discord