The Devil is in the Detail: Simple Tricks Improve Systematic Generalization of Transformers

About

Recently, many datasets have been proposed to test the systematic generalization ability of neural networks. The companion baseline Transformers, typically trained with default hyper-parameters from standard tasks, are shown to fail dramatically. Here we demonstrate that by revisiting model configurations as basic as scaling of embeddings, early stopping, relative positional embedding, and Universal Transformer variants, we can drastically improve the performance of Transformers on systematic generalization. We report improvements on five popular datasets: SCAN, CFQ, PCFG, COGS, and Mathematics dataset. Our models improve accuracy from 50% to 85% on the PCFG productivity split, and from 35% to 81% on COGS. On SCAN, relative positional embedding largely mitigates the EOS decision problem (Newman et al., 2020), yielding 100% accuracy on the length split with a cutoff at 26. Importantly, performance differences between these models are typically invisible on the IID data split. This calls for proper generalization validation sets for developing neural networks that generalize systematically. We publicly release the code to reproduce our results.

R\'obert Csord\'as, Kazuki Irie, J\"urgen Schmidhuber• 2021

Related benchmarks

Task	Dataset	Result
Semantic Parsing	GeoQuery (i.i.d.)	Exact Match Accuracy87	32
Semantic Parsing	GeoQuery (TMCD)	Exact Match Acc37	12
Semantic Parsing	ATIS iid	Accuracy75.98	7
Semantic Parsing	ATIS length split	Accuracy4.95	7
Active↔Logical	Active↔Logical (IID)	Accuracy100	6
Semantic Parsing	SCAN (IID)	Accuracy100	6
Semantic Parsing	SCAN Template	Accuracy100	6
Semantic Parsing	SCAN (length)	Accuracy19	6
Active↔Logical	Active↔Logical (Structural)	Accuracy0.12	6
Semantic Parsing	SCAN 1-shot lexical	Accuracy11	6

Showing 10 of 17 rows

Other info

Follow for update

@wizwand_team Discord