Positional Prompt Tuning for Efficient 3D Representation Learning
About
We rethink the role of positional encoding in 3D representation learning and fine-tuning. We argue that using positional encoding in point Transformer-based methods serves to aggregate multi-scale features of point clouds. Additionally, we explore parameter-efficient fine-tuning (PEFT) through the lens of prompts and adapters, introducing a straightforward yet effective method called PPT for point cloud analysis. PPT incorporates increased patch tokens and trainable positional encoding while keeping most pre-trained model parameters frozen. Extensive experiments validate that PPT is both effective and efficient. Our proposed method of PEFT tasks, namely PPT, with only 1.05M of parameters for training, gets state-of-the-art results in several mainstream datasets, such as 95.01% accuracy in the ScanObjectNN OBJ_BG dataset. Codes and weights will be released at https://github.com/zsc000722/PPT.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Semantic segmentation | S3DIS (Area 5) | mIOU54.8 | 799 | |
| Object Classification | ScanObjectNN OBJ_BG | Accuracy89.84 | 215 | |
| Part Segmentation | ShapeNetPart | mIoU (Instance)85.7 | 198 | |
| Object Classification | ScanObjectNN PB_T50_RS | Accuracy84.45 | 195 | |
| Object Classification | ScanObjectNN OBJ_ONLY | Overall Accuracy88.98 | 166 | |
| Few-shot classification | ModelNet40 5-way 10-shot | Accuracy97 | 79 | |
| Few-shot classification | ModelNet40 5-way 20-shot | Accuracy98.7 | 79 | |
| Few-shot classification | ModelNet40 10-way 20-shot | Accuracy95.6 | 79 | |
| Few-shot classification | ModelNet40 10-way 10-shot | Accuracy92.2 | 79 | |
| Shape classification | ScanObjectNN PB_T50_RS | OA89.52 | 72 |