Point Transformer, Zhao, Jiang, Jia, Torr, Koltun; 2020 - Summary
author: | kelseyball |
score: | 10 / 10 |
What is the core idea?
- Self-attention is fundamentally a set operator (permutation-invariant) and thus a natural fit for modeling 3D point clouds, which are sets of 3D points
- The authors investigate how to apply self-attention/transformers to 3D point cloud processing
- “Point Transformer” networks outperform a variety of models in large- and small-scale 3D image understanding tasks
How is it realized (technically)?
Point Transformer Layer
- Point transformer layer uses vector self-attention (rather than scalar dot-product attention) with learnable position encoding function
- Attention is computed per point over its k nearest neighbors
Position Encoding Function
- The position encoding is the difference of two 3D point coordinates (anchor point and neighboring point), passed into a 2-layer MLP with ReLU which is learned end-to-end
- This value is added to both the attention vector and transformed feature vector
Point Transformer Block
- Comprised of: linear projections, point transformer layer, and a residual connection
- Input x is a set of feature vectors with associated coordinates p
Point Transformer Network
- Point transformer blocks are combined with downsampling and interpolation modules which reduce and increase the cardinality of the point set as needed
Training details
- SGD optimizer, momentum=0.9, weight decay=0.0001.
- For semantic segmentation: 40K iterations with initial LR=0.5, dropped by 10x at steps 24K and 32K.
- For shape classification/object part segmentation: 200 epochs, initial LR=0.05, dropped by 10x at epochs 120 and 160.
How well does the paper perform?
Point Transformers achieve new state-of-the-art on semantic segmentation, shape classification, and object part segmentation, outperforming a variety of models including pointwise MLPs, voxel-based models, graph-based models, sparse convolutional networks, and continuous convolutional networks.
Semantic Segmentation
- Task/Dataset: S3DIS – 271 rooms, each point has a semantic label (floor, chair, etc.). The task is to label each point.
- Eval metrics: mean classwise accuracy (mAcc), overall pointwise accuracy (OA), and mean classwise Intersection over Union (IoU) (IoU computes the ratio of (1) the intersection of the predicted and true points for a class and (2) their union)
Shape Classification
- Task/Dataset: ModelNet40; classify 12,311 CAD models into 40 object categories
- Eval metrics: mean classwise accuracy (mAcc) and overall accuracy over all classes (OA)
Object Part Segmentation
- Task/Dataset: ShapeNetPart; 16k models of 16 shape types, each annotated with 2-6 parts, 50 part types total. Classify each point with a part.
- Eval metrics: IoU averaged over part category (cat. mIoU) and IoU averaged per instance over parts (inst. mIou)
What interesting variants are explored?
Takeaways from ablation study:
- Vector attention significantly outperforms scalar dot-product attention
- k=16 neighbors works better than smaller or larger values
- Relative position encoding works better than absolute, and adding the encoding to both the attention vector and feature vectors works better than either alone
- Using softmax regularization in the attention computation is essential
TL;DR
- Point transformer layers apply vector attention to local KNN neighborhoods of a point in a 3D point cloud; this operation is the basis of Point Transformer networks
- The authors experiment with different values of k, types of attention, and types of position encoding functions
- Point Transformer Networks achieve new SoTA on semantic segmentation, shape classification, and object-part segmentation