Event Transformer. A sparse-aware solution for efficient event data processing
About
Event cameras are sensors of great interest for many applications that run in low-resource and challenging environments. They log sparse illumination changes with high temporal resolution and high dynamic range, while they present minimal power consumption. However, top-performing methods often ignore specific event-data properties, leading to the development of generic but computationally expensive algorithms. Efforts toward efficient solutions usually do not achieve top-accuracy results for complex tasks. This work proposes a novel framework, Event Transformer (EvT), that effectively takes advantage of event-data properties to be highly efficient and accurate. We introduce a new patch-based event representation and a compact transformer-like architecture to process it. EvT is evaluated on different event-based benchmarks for action and gesture recognition. Evaluation results show better or comparable accuracy to the state-of-the-art while requiring significantly less computation resources, which makes EvT able to work with minimal latency both on GPU and CPU.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Gesture Recognition | DVS128-Gesture (test) | Accuracy96.2 | 30 | |
| Action Recognition | SL-Animals 4Sets | Accuracy88.12 | 15 | |
| Action Recognition | DVS128Gesture | Accuracy94.4 | 15 | |
| Action Recognition | SL-Animals 3Sets | Accuracy87.45 | 13 | |
| Action Recognition | DVSGesture (full) | Accuracy96.2 | 11 | |
| Event-based action recognition | DVS128 Gesture | Top-1 Acc96.2 | 8 | |
| Event-based action recognition | SeAct | Top-1 Acc61.3 | 4 |