VideoChat2 (CVPR 2024, MLLM)

Methods

기존 MLLM의 suboptimal한 특성은 instruction-tuning data의 limited diversity 때문에 발생하는 것으로 확인하였다. 따라서 M^3IT들 따라서, 모든 data sample을 uniform format으로 만들었는데, 이 형태는 Fig. 1의 우하단에 있는 모습과 같다.

Figure 1: Instruction tuning data for VideoChat2.

'image' 또는 'video' 가 첫 번째 key이고, 두 번째 key는 'QA'인 dictionary 형태의 구조이다. 첫 번째 key는 vision data를 포함하고, 두 번째 key는 'i'에 task instruction, 'q'에 question과 'a'에 answer를 포함한다.

전체 instruction tuning dataset은 6개의 category로 나뉠 수 있다:

1) Conversation

2) Simple Caption

3) Detailed Caption

4) VQA
5) Reasoning

6) Classification

Progressive Multi-Modal Training

Stage1: Vision-Language Alignment

처음으로는 vision과 text를 align한다. visual encoder는 freeze하고 QFormer(BLIP-2)를 train한다. 이 과정은 visual token들을 더 적은 수의 query token으로 바꾼다는 의미가 있다.

여기서 VTM(Vision-Text Matching), VTC(Vision-Text Contrastive learning), VTG(Vision-grounded Text Generation)로 train한다.

여기서 visual encoder로 UMT-L을 사용했다. UMT는 video encoder인데, https://jordano-jackson.tistory.com/149 를 참조하면 좋다.

QFormer는 CC3M과 CC12M의 15M image caption들로 train했다. 또한 10M video caption으로 WebVid를 train했다.

Stage2: Vision-Language Connection

stage 1의 pretraining이 끝난 이후에 visual encoder를 pretrained LLM과 연결했다. 이를 위해서 query token을 linear projection하고 이를 text token과 concatenate해서 LLM에 feed한다.

stage 2에서는 visual encoder를 fine-tune하는데 이는 LLM과의 alignment를 위한 것이다.

training을 위해서는 COCO 2M image caption dataset, Visual Genome, SBU, 10M video caption from InternVid를 활용한다.

Stage3: Instruction Tuning

마지막 instruction tuning 과정에서 MVBench의 dataset을 활용하여 fine-tune한다. LoRA를 frozen LLM에 사용하고, fine-tune되는 visual encoder와 QFormer에 align시킨다.

특히 주목할 점은 QFormer에 input으로 들어가는 instruction도 위의 Figure 1과 같은 형태로 만들어 LLM에 feed할 token을 instruction-relevant하게 generation했다는 점이다. 이때는 'QA'의 'i'만 활용하고 'q'는 활용하지 않았다. Appendix B에서, question을 추가할 경우 performance drop으로 이어졌는데, 이는 Q Former가 information extraction에 어려움을 겪었음으로 인한 것이었을 것으로 추측한다.

Figure 2: Progressive multi-modal training of VideoChat2.

References

Footnotes

'DL·ML > Paper' 카테고리의 다른 글

BASNet (CVPR 2019, OD) (0)	2024.06.14
DDPM (NeurIPS 2020, Diffusion) (2)	2024.06.11
UMT(ICCV 2023 Oral, Video Foundation Model) (0)	2024.05.28
MViT v1 (ICCV 2021, Video Recognition) (0)	2024.05.18
U-Net (1)	2024.04.15

Methods

Progressive Multi-Modal Training

References

Footnotes

'DL·ML > Paper' 카테고리의 다른 글

티스토리툴바