[ZSD] GLIP

Grounded Language-Image Pre-training

This paper presents a grounded language-image pre-training (GLIP) model for learning object-level, language-aware, and semantic-rich visual representations. GLIP unifies object detection and phrase grounding for pre-training. The unification brings two ben

arxiv.org

Abstract

GLIP model 제안
object detection task와 phrase grounding task를 unify
- 두 task에서 성능 향상
- image-text pair 데이터들을 self-training으로 활용 가능

Motivation

phrase grounding task: sentence 안의 phrase와 image의 object를 matching하는 task
GLIP: phrase grounding과 object detection task를 unify
- object detection task는 context-free phrase grounding task로 해석 가능

⇒ detection task를 phrase grounding task로 unify할 수 있음

임의의 object detection model은 object classification logit을 box classifier로 바꿔서 grounding model로 바꿀 수 있다.
- detection은 grounding data로 visual concept 양을 늘릴 수 있음
- grounding은 더 많은 bounding box annotation을 얻을 수 있음

Method

두 task간 유사성을 이용하여 unified formulation을 제안
- image-text 간 deep fution을 제안하여 detection model은 language-aware하게 만들고 grounding model은 strong하게 만듦
- GLIP을 pretrain함

Unified Formulation

Object detection as phrase grounding
- 원래 object detection task는 각 region에 대해서 $c$ class를 계산
- 이를 text prompt에 있는 $c$ phrawse로 align하도록 task 변경
  - 이 형태의 prompt는 임시의 것으로서 수정의 여지가 있음
- Prompt = "Detect: person, bicycle, car, ... , toothbrush"
- alignment score는 다음과 같이 계산함
  - $O\in \mathbb R^{N\times d}$는 input image의 object feature
  - $P\in\mathbb R^{M\times d}$는 contextual word feature from language encoder
  - $S_{ground}$는 region-word alignment score
- $$
  O=Enc_I(Img), P=Enc_L(Prompt),S_{ground}=OP^\top
  $$
Equivalence between detection and groudning
- 이를 통해서 pre-trained phrase groudning model을 any object detection task에 적용할 수 있음

Language-Aware Deep Fusion

late-fusion: image와 text가 따로 encoding되어 alignment score 계산하는 형태

deep-fusion: image와 text encoder 안에서 information을 fuse하는 형태(Figure 2 참조)
- $L$: DyHeadModule 개수
- $O^0$: vision backbone의 visual feature
- $P^0$: language backbone의 token feature
- cross modality multihead attention으로 modality간 communication함
- 이후 single modality 정보와 더함Attention score 계산 및 multihead attention

deep-fusion의 장점
- phrase grounding performance 향상
- visual feature가 language-aware하도록 만듦 → text prompt에 conditioned

Pre-training with Scalable Semantic Rich Data

기존 teacher network 사용하는 방법들은 정해진 label로만 detection 가능
grounding data는 더 많은 semantic을 제공하여 self-training fashion에 이용할 수 있음
1. grounding data가 더 다양한 caption
  - object detection → 최대 2000 categories / VG Caption → 110689 unique phrases
2. 따라서 detection data를 늘리는 것보다 grounding data를 늘리는 것이 좋음
  - GLIP을 human annotated data로 학습시킨 후 teacher로 사용
3. student model은 teacher model이 web collected data로 annotate한 것도 함께 사용
  - 이런 방법론은 teacher가 모르는 개념도 student가 학습할 수 있도록 함
  - → vaccine과 turquoise를 몰라도 context를 이용하여 annotation한 teacher model(Figure 3)

'DL·ML' 카테고리의 다른 글

[Object Detection] DINO (0)	2024.02.21
[Object Detection] DETR (0)	2024.02.21
[Paper Review] Emerging Properties in Self-Supervised Vision Transformers (1)	2024.02.06
LLaMa2(GPT3) 사용기와 에러 정리 (1)	2023.11.30
[Paper Review] Faith and Fate: Limit of Transformers on Compositionality (0)	2023.11.18