Feature Fusion Vision Transformer for Fine-Grained Visual Categorization
About
The core for tackling the fine-grained visual categorization (FGVC) is to learn subtle yet discriminative features. Most previous works achieve this by explicitly selecting the discriminative parts or integrating the attention mechanism via CNN-based approaches.However, these methods enhance the computational complexity and make the modeldominated by the regions containing the most of the objects. Recently, vision trans-former (ViT) has achieved SOTA performance on general image recognition tasks. Theself-attention mechanism aggregates and weights the information from all patches to the classification token, making it perfectly suitable for FGVC. Nonetheless, the classifi-cation token in the deep layer pays more attention to the global information, lacking the local and low-level features that are essential for FGVC. In this work, we proposea novel pure transformer-based framework Feature Fusion Vision Transformer (FFVT)where we aggregate the important tokens from each transformer layer to compensate thelocal, low-level and middle-level information. We design a novel token selection mod-ule called mutual attention weight selection (MAWS) to guide the network effectively and efficiently towards selecting discriminative tokens without introducing extra param-eters. We verify the effectiveness of FFVT on three benchmarks where FFVT achieves the state-of-the-art performance.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Fine-grained Image Classification | CUB200 2011 (test) | Accuracy91.65 | 536 | |
| Fine-grained visual classification | FGVC-Aircraft (test) | Top-1 Acc91.6 | 287 | |
| Image Classification | CUB-200-2011 (test) | Top-1 Acc91.6 | 276 | |
| Fine-grained visual classification | NABirds (test) | Top-1 Accuracy89.42 | 157 | |
| Fine-grained Visual Categorization | Stanford Cars (test) | Accuracy91.25 | 110 | |
| Image Classification | Stanford Dogs (test) | Top-1 Acc91.5 | 85 | |
| Fine-grained Visual Categorization | FGVCAircraft | Accuracy79.8 | 60 | |
| Fine-grained Image Classification | NABirds | Accuracy89.42 | 22 | |
| Fine-grained Visual Categorization | CUB | Accuracy91.65 | 20 | |
| Fine-grained Image Classification | iNaturalist 2017 (test) | Accuracy70.3 | 19 |