[Paper Review] Emerging Properties in Self-Supervised Vision Transformers

Emerging Properties in Self-Supervised Vision Transformers

In this paper, we question if self-supervised learning provides new properties to Vision Transformer (ViT) that stand out compared to convolutional networks (convnets). Beyond the fact that adapting self-supervised methods to this architecture works partic

arxiv.org

Abstract

self-supervised learning이 ViT에 새로운 property를 줄 수 있는가?
self-supervised ViT는 image의 semantic segmentation 정보를 가진다.
- supervised ViT와 CNN에는 없었던 feature
이러한 feature는 k-NN calssifier’에 사용될 수 있다.
DINO(self-DIstillation with NO labels)

Motivation

NLP의 성공은 self-supervised learning의 효과 덕분이라고 할 수 있다.
기존 ViT는 supervised learning을 사용함으로써 image 안의 rich feature들을 모두 가져가지 못하고, simple category로 줄여버리는 효과가 있다.
ViT에 self-supervised learning을 적용해보고 창발하는 새로운 특성들을 살펴본다.
- last block의 self-attention module에서 scene layout과 object boundary 정보가 얻어진다.
- self-supervised ViT feature는 k-NN 에서 좋은 성능을 보인다. (semantic-numeric 유사성 발생)
- ViT에서는 작은 patch를 쓸 수록 좋은 성능이 나온다.

Method

SSL with Knowledge Distillation

student network의 probability distribution은 다음과 같이 얻어진다:

student network와 teacher network의 cross entropy를 optimize해서 knowledge distillation을 한다.

위 equation을 self-supervised learning에 적용하기 위해 다음 방법을 사용한다.
- $V$개의 다른 view를 만든다. (view는 crop을 의미한다)
- 이 중 두 개의 global view $x_1^g ,x_2^g$와 여러 개의 작은 resolution의 local view가 있다.
- 모든 crop은 student를 통과하고, teacher는 global view만 통과한다.
- 다음 loss를 minimize하여 local-to-global correspondance를 강화할 수 있다.
- global view의 기본 크기는 $224^2$이고, local view의 기본 크기는 $96^2$이다.
- 두 network 모두 architecture $g$를 공유하고, parameter만 다르다.

Teacher network
- SSL을 구현하기 위해서 사용한 부분이다.
- knowledge distillation과 다르게 a priori teacher network는 없다.
- student network의 past iteration으로부터 teacher network를 만든다.
- student weight의 EMA를 이용하여 teacher network를 구성한다.
- 즉, mode ensembling과 비슷한 효과를 갖게 된다.
- 학습 내내 teacher network는 student network보다 좋은 성능을 보인다
Network architecture
- $g$는 ViT 또는 ResNet이다.
- projection head는 $h:g=h\circ f$
Avoiding Collapse
- centering과 sharpening을 이용하여 collapse를 조절했다. 적절히 조절하지 않으면 uniform distribution으로 collapse한다.
- collapse에는 두 종류가 있었다.
  - input과 관계없이 output이 uniform하게 collapse
  - 한 dimension에 dominated
- centering은 한 dimension이 dominate하는 것을 막지만 uniform distribution으로 collapse하려고 한다.
- sharpening은 반대의 역할을 한다.
- centering bias term은 EMA로 결정된다.

Implementation and evaluation protocols

Vision Transformer
- image patch가 8×8인 형태와 16×16인 모델을 사용했다.
- [CLS] token을 head의 output에 추가해서 information aggregation하도록 했다.

Results

Figure 3를 보면, 각 head가 각 semantic region을 나타낸다.
이는 supervised ViT의 self-attention map과 비교해도 더 좋은 segmentation을 보인다.

References

[1] Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., & Joulin, A. (2021). Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 9650-9660).

Footnotes

'DL·ML' 카테고리의 다른 글

[Object Detection] DETR (0)	2024.02.21
[ZSD] GLIP (2)	2024.02.06
LLaMa2(GPT3) 사용기와 에러 정리 (1)	2023.11.30
[Paper Review] Faith and Fate: Limit of Transformers on Compositionality (0)	2023.11.18
TPU란 무엇인가? (1)	2023.11.10

MathJax = { tex: {inlineMath: [['$', '$']]} };

Abstract

Motivation

Method

SSL with Knowledge Distillation

Implementation and evaluation protocols

Results

References

Footnotes

'DL·ML' 카테고리의 다른 글

티스토리툴바