LiteEval: A Coarse-to-Fine Framework for Resource Efficient Video Recognition
About
This paper presents LiteEval, a simple yet effective coarse-to-fine framework for resource efficient video recognition, suitable for both online and offline scenarios. Exploiting decent yet computationally efficient features derived at a coarse scale with a lightweight CNN model, LiteEval dynamically decides on-the-fly whether to compute more powerful features for incoming video frames at a finer scale to obtain more details. This is achieved by a coarse LSTM and a fine LSTM operating cooperatively, as well as a conditional gating module to learn when to allocate more computation. Extensive experiments are conducted on two large-scale video benchmarks, FCVID and ActivityNet, and the results demonstrate LiteEval requires substantially less computation while offering excellent classification accuracy for both online and offline predictions.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Video Recognition | FCVID (test) | mAP80 | 28 | |
| Action Recognition | ActivityNet | Accuracy72.7 | 22 | |
| Action Recognition | ActivityNet v1.3 (test) | mAP72.7 | 19 | |
| Video Recognition | Kinetics Mini | Top-1 Acc61 | 18 | |
| Video Recognition | Mini-Kinetics (test) | Accuracy61 | 17 | |
| Online action recognition | 50Salads (test) | Accuracy40.3 | 7 | |
| Action Recognition | FCVID | Accuracy80 | 6 |