Mish: A Self Regularized Non-Monotonic Activation Function

About

We propose $\textit{Mish}$, a novel self-regularized non-monotonic activation function which can be mathematically defined as: $f(x)=x\tanh(softplus(x))$. As activation functions play a crucial role in the performance and training dynamics in neural networks, we validated experimentally on several well-known benchmarks against the best combinations of architectures and activation functions. We also observe that data augmentation techniques have a favorable effect on benchmarks like ImageNet-1k and MS-COCO across multiple architectures. For example, Mish outperformed Leaky ReLU on YOLOv4 with a CSP-DarkNet-53 backbone on average precision ($AP_{50}^{val}$) by 2.1$\%$ in MS-COCO object detection and ReLU on ResNet-50 on ImageNet-1k in Top-1 accuracy by $\approx$1$\%$ while keeping all other network parameters and hyperparameters constant. Furthermore, we explore the mathematical formulation of Mish in relation with the Swish family of functions and propose an intuitive understanding on how the first derivative behavior may be acting as a regularizer helping the optimization of deep neural networks. Code is publicly available at https://github.com/digantamisra98/Mish.

Diganta Misra• 2019

Related benchmarks

Task	Dataset	Result
Language Modeling	WikiText-103 (test)	Perplexity15.9	703
Image Classification	Fashion MNIST (test)	Accuracy89.8	633
Image Classification	SVHN (test)	--	470
Image Classification	MNIST (test)	--	201
Image Classification	CIFAR100-LT (test)	Top-1 Acc (Avg)44.56	65
Classification	Evaluation Benchmark (aggregated)	Accuracy78.91	27
Image Classification	CIFAR-100 LT (500:1 ratio) (test)	Loss6.132	15
Image Classification	CIFAR-100 LT (50:1 ratio) (test)	Loss4.619	15
Image Classification	CIFAR-100-LT (100:1 ratio) (test)	Loss5.578	15
Image Classification	CIFAR-100 LT 10:1 ratio (test)	Loss3.028	15

Showing 10 of 11 rows

Other info

Follow for update

@wizwand_team Discord