FitNets: Hints for Thin Deep Nets

About

While depth tends to improve network performances, it also makes gradient-based training more difficult since deeper networks tend to be more non-linear. The recently proposed knowledge distillation approach is aimed at obtaining small and fast-to-execute models, and it has shown that a student network could imitate the soft output of a larger teacher network or ensemble of networks. In this paper, we extend this idea to allow the training of a student that is deeper and thinner than the teacher, using not only the outputs but also the intermediate representations learned by the teacher as hints to improve the training process and final performance of the student. Because the student intermediate hidden layer will generally be smaller than the teacher's intermediate hidden layer, additional parameters are introduced to map the student hidden layer to the prediction of the teacher hidden layer. This allows one to train deeper students that can generalize better or run faster, a trade-off that is controlled by the chosen student capacity. For example, on CIFAR-10, a deep student network with almost 10.4 times less parameters outperforms a larger, state-of-the-art teacher network.

Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, Yoshua Bengio• 2014

Related benchmarks

Task	Dataset	Result
Image Classification	CIFAR-100 (test)	Accuracy73.73	3518
Image Classification	CIFAR-10 (test)	--	3381
Object Detection	COCO 2017 (val)	AP39.9	2843
Image Classification	ImageNet-1K 1.0 (val)	Top-1 Accuracy70.44	2238
Image Classification	ImageNet-1K	Top-1 Acc71.75	1239
Image Classification	ImageNet (val)	--	1206
3D Object Detection	nuScenes (val)	NDS57.97	981
Image Classification	CIFAR-10 (test)	Accuracy88.57	906
Image Classification	MNIST (test)	--	894
Object Detection	PASCAL VOC 2007 (test)	mAP57	844

Showing 10 of 134 rows

...

Other info

Follow for update

@wizwand_team Discord