A Framework to Learn with Interpretation
About
To tackle interpretability in deep learning, we present a novel framework to jointly learn a predictive model and its associated interpretation model. The interpreter provides both local and global interpretability about the predictive model in terms of human-understandable high level attribute functions, with minimal loss of accuracy. This is achieved by a dedicated architecture and well chosen regularization penalties. We seek for a small-size dictionary of high level attribute functions that take as inputs the outputs of selected hidden layers and whose outputs feed a linear classifier. We impose strong conciseness on the activation of attributes with an entropy-based criterion while enforcing fidelity to both inputs and outputs of the predictive model. A detailed pipeline to visualize the learnt features is also developed. Moreover, besides generating interpretable models by design, our approach can be specialized to provide post-hoc interpretations for a pre-trained neural network. We validate our approach against several state-of-the-art methods on multiple datasets and show its efficacy on both kinds of tasks.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Image Classification | CIFAR-10 (test) | Accuracy79.6 | 3381 | |
| Image Classification | MNIST (test) | Accuracy99.4 | 882 | |
| Image Classification | SVHN (test) | Accuracy90.8 | 362 | |
| Image Classification | F-MNIST (test) | Accuracy91.5 | 64 | |
| Environmental Sound Classification | ESC-50 (test) | Top-1 Fidelity73.5 | 14 | |
| Image Classification | QuickDraw (test) | Accuracy82.6 | 5 | |
| multi-label urban sound tagging | SONYC-UST | Macro AUPRC81.6 | 4 |