3D vision, PointNet

Prerequisite

3d geometric data는 point cloud, mesh로 표현될 수 있다.

먼저 mesh는 polygon으로 이루어진 3d image를 의미한다. polygon은 object의 표면을 덮는 작은 면들을 의미하며, point와 edge로 구성되며 주로 삼각형의 triangle로 구성된다. 따라서 각 object는 point들의 (x, y, z) coordinate과 edge로 구성될 수 있다. [1]

image retreived from https://en.wikipedia.org/wiki/Polygon_mesh

point cloud는 3d 공간 상의 여러 point들의 discrete set이다. 여러 알고리즘을 이용하여 3d point cloud를 mesh로 바꾸어 rendering할 수 있다. [2]

기존에는 3D geometric data를 처리하기 위해서는 voxel grid나 2d view로 바꾸어 처리하였다. 그러나 이 경우 데이터가 커지고 grid에 맞게 정보가 소실된다는 문제가 있다.

따라서 PointNet은 3D geometry를 point cloud를 이해할 수 있는 deep neural network를 제시한다.

다만, point cloud의 특성 상 permutation invariant해야 한다는 점을 고려해야 한다. 이를 위한 간단한 architecture를 제시하고, classification, part segmentation, semantic segmentation과 같은 downstream task를 수행할 수 있는 모델을 제시한다.

Method

Properties of Point Sets in $\mathbb R^n$

다음은 Euclidean space의 point cloud가 가져야 하는 properties이다.

Unordered: point cloud는 $N$개의 point set이고, 순서가 없으므로 $N!$ permutation에 대해서 invariant해야 한다.
Interaction among points: 가까운 point는 관계가 존재하므로, 함께 local structure를 구성할 수 있다.
Invariance under transformations: rotation이나 translation이 segmentation이나 categorization 결과에 영향을 주지 말아야 한다.

PointNet Architecture

architecture는 크게 세 개의 모듈로 구성되어 있다:

max pooling layer as a symmetric function to aggregate information from all the points
a local and global information combination structure
two joint alignment networks that align both input points and point features

Symmetry Function for Unordered Input

model이 permuation invariant하게 하기 위해서는 1) sort하거나 2) 모든 permuation에 대해서 augment하거나 3) symmetric function을 사용하는 방법이 있다. 그러나 high dimensional data의 perturbation에 대해서 robust한 canonical sorting algorithm이 존재하지 않으므로 여기서는 3의 방법을 사용한다. 이는 Fig. 5에서 sorted data에 대해서 MLP를 적용할 경우 performance가 떨어지는 것으로 보인다. 이외에도 random order를 input으로 넣고 RNN이 permutation invariant하게 학습하길 기대할 수도 있으나 Fig. 5에서 이 방법도 성공적이지 않음을 보인다.

Equation 1: general function을 이용한 symmetric function의 approximation.

$h$는 MLP로 approximate되고, $g$는 single variable function과 max pooling function으로 구성된다.

Local and Global Information Aggregation

위의 결과를 적용하면 global signature인 vector $\[f_1, \dots, f_K\]$를 얻는다. 이를 이용하여 local과 global knowledge를 얻을 수 있어야 한다.

이를 위해 Fig. 2에서 보인 것처럼, global feature를 얻은 후에 각 point의 feature vector와 concatenate한다. 그리고 여기에서 combined point feature를 다시 extract한다. 즉, 최종적으로 얻은 point feature는 global information과 local information을 모두 aware하는 형태가 된다.

Joint Alignment Network

point cloud에 대한 semantic labeling은 cloud가 어떤 종류의 transformation을 겪어도 invariant하게 동작할 수 있어야 한다.

기존 방법에서는 모든 input을 canonical space에 align하는 것이었다.[3]

여기서는 좀 더 간단하게, T-Net을 이용한 방식을 사용한다. 이는 mini-network를 이용해서 affine transformation을 predict하고, 이를 input point의 coordinate에 적용하는 것이다.

이런 방식을 spatial space에 적용하는 것이 아니라, feature space에도 적용할 수 있다. 하지만 feature space에서의 transformation matrix는 dimension이 훨씬 크기 때문에 optimization이 어렵다. 따라서 regularization term을 더해서 feature transform matrix가 orthogonal에 가까워지도록 했다.

위 Eq. 2에서 $A$는 feature alignment matrix이다. 수식의 의미를 생각해보면, $AA^\top$에서 orthogonal하지 않은 term들만 남게 되는데, 이 term들을 한 번 더 더해주므로 2를 곱해준 형태가 된다. 이런 term들만 더 강하게 regularize되므로 $A$는 orthonormal한 성분들 위주로 남게 될 것이다.

논문에서 정확한 이유는 나와있지 않으나, 이 경우 모델이 좀 더 stable하고 better performance를 달성하게 되었다고 한다. 그 이유를 고민해보면, 두 가지가 있을 것 같다.

먼저 orthonormal vector들의 크기가 normalize되어 있으므로 거기서 오는 수치적인 stablity가 있을 것이고,

두 번째로 transformation matrix가 rank n matrix임을 ensure할 수 있을 것 같다. 사실 rank n matrix가 아닐 가능성이 높지는 않을 것 같지만, linear dependency가 있을 경우 feature dimension을 잃어버리므로 큰 문제가 있을 수 있다.

Result

실험은 총 네 가지 part로 구성된다. 1) 3D recognition 2) network design 3) network visualization 4) time/space complexity

Applications

3d object classification

ModelNet40 shape classification benchmark에 대해서 진행되었다. 여기에는 12,311개의 CAD 모델과 40개의 object categories가 있으며 9,843개의 training dataset과 2,468개의 testing dataset이 있다.

mesh space에서 uniform하게 1024개의 point를 sampling해서 데이터를 얻었다. rotation과 jittering으로 augment를 한 뒤 얻은 결과이다. Table 1에서는 기존 baseline 대비 강력한 성능을 보임을 알 수 있다.

3d object part segmentation

ShapeNet part data는 16,881개의 shape와 16개의 category, 50개의 part로 구성되어 있다. 각 object는 2-5개의 part로 구성되어 있다.

References

[1] Awati, R. (2024, January 22). 3D mesh. WhatIs. https://www.techtarget.com/whatis/definition/3D-mesh

[2] Wikipedia contributors. (2024, March 11). Point cloud. Wikipedia. https://en.wikipedia.org/wiki/Point_cloud

[3] M. Jaderberg, K. Simonyan, A. Zisserman, et al. Spatial transformer networks. In NIPS 2015.

Footnotes

'DL·ML > Paper' 카테고리의 다른 글

[task] 3D Pose Estimation (in progress) (0)	2024.03.25
VARS(SoccerNet) (0)	2024.03.22
HQ-SAM (0)	2024.03.20
SAM(Segment Anything) (0)	2024.03.12
CAT-Seg(Cost AggregaTion approach for open-vocabulary semantic Segmentation) (0)	2024.03.07