VoT (ICML oral, video understanding)

Abstract

Suggests MotionEpic, a model that integrates STSG to video
Suggests VoT(Video of Thought) framework

Motivation

Figure 1: Human-like multi-step video reasoning procedure.

video에 대한 reasoning을 수행하기 위해서는 두 종류의 ability가 필요하다.

fine-grained perceptive pixel understanding of the video movement
cognitive ability allowing reasonable explanation and causal imagination

실제로 사람이 video에 대해서 reasoning을 할 때에는 multi-hop으로 추론하므로 이를 모방하는 것이 필요하다고 짐작하기는 어렵지 않다. intuitive하게 따라서, CoT와 비슷한 형태의 formatting을 통해 문제를 subproblem으로 divide하여 해결하도록 만들었다.

또한 여기에 STSG(SpatioTemporal Scene Graph) representation을 intergrate하여 이를 활용할 수 있도록 만들었다. 전체 architecture는 MotionEpic으로 명명되었다.

VoT는 다음과 같은 procedure로 구현된다:

VoT identifies the possible target(s) involved in the question to observe
The system grounds the temporal tracklet(s)
VoT interprets the target object's trajectory and interactions
LLM examines each optional answer
VoT performs verification for the answer

MotionEpic

Overview

LLM으로는 Vicuan-7B-1.5v를 사용하고, video input으로는 ViT-L/14 encoder와 Q-Former projector를 사용한다.

STSG를 위해서는 Graph Transformer를 수정하여 사용한다.

Integrating STSG Representation

STSG definition을 수정하여 사용한다. 먼저 video에서 uniform하게 frame을 sample한다. $k$th frame의 SG를 $G_k = (V_k ; E_k)$라고 하면, 각각의 vertex는 다음과 같이 표현된다:

$$ v_{k,i}=(c_i,f_i,b_i)_k$$

이때 c는 category label, f는 neural representation, b는 $(x,y,w,h)$로 2D coordinate을 의미한다.

일반적인 SG와 같이 각 subject와 object는 predicate로 연결된다. 여기에 이전 frame의 동일한 object에서 다음 frame의 object로 'tracking' edge가 temporality를 위해 추가된다.

Figure 3: The STSG expression generated by MotionEpic.

Fig.3 에서 STSG representation의 예를 볼 수 있다.

Fine-grained Video-Scene Grounding-aware Tuning

실제 inference 과정에서는 STSG가 들어오길 기대하기 어려워 STSG-free inference가 수행될 수 있어야 한다. 따라서 MotionEpic이 자동으로 STSG를 parse할 수 있도록 만든다. 이때의 objectives는 다음과 같다:

Enhancing coarse-grained correspondence:
- $\mathcal{L_1}$: predicting if the overall input video and STSG are paired.
- $\mathcal{L_2}$: generating the whole STSG of the video
Enhancing fine-graiend correspondence:
- $\mathcal{L_3}$: video and action description에 대해 object tracklet 출력
- $\mathcal{L_4}$: video and key object(s)에 대해 temporal action description 출력
- $\mathcal{L_5}$: video와 특정 frame bbox에 대해 object label 출력

video encoder는 frozen되었고, LLM은 LoRA finetuning했으며 video projector와 STSG encoder는 train되었다.

-> STSG는 정확히 어떻게 train했는지?

Video-of-Thought Reasoning Framework

Figure 4: An illustrative view of VoT framework.

VoT는 다음과 같이 subproblem들을 구성한다:

Task Definition and Target Identification
MotionEpic은 text prompt, task definition, format, raw question과 raw video를 입력받는다.

Given the question [Question], what are the possible targets of the [Video] mainly mentioned or involved?

Object Tracking
STSG를 사용하여 temporal grounding을 한다. 이때 STSG는 concise하여 MLLM의 hallucination issue를 줄이는 역할을 한다.

Provide the tracklet of involved [Target] by outputting the corresponding partial [STSG] expression.

MotionEpic은 이미 object와 STSG를 grounding할 수 있으므로, [Target Tracklet]을 yield한다.

Action Analyzing

STSG와 pixel observation, 그리고 commonsense knowledge까지 결합해서 in-depth understanding을 하도록 prompting한다.

Combining all possible related commonsense, analyze the motion behavior based on the [Target Tracklet] and the neighbor scene within [STSG]. Describing the action observations and implications.

이를 통해 target에 대한 자세한 description인 [Observation and Implication] 을 얻는다.

Question Answering via Ranking

이제 original question에 대해서 대답한다. multiple choise QA problem으로 문제를 formatting한 뒤, 각각의 answer에 대해 likelihood를 1 to 10의 값으로 평가하도록 한다.

For the question [Question], given a candidate answer [Answer], please based on the action's [Observation and Implication] combined with commonsense, score the rationality of this answer with a 1-10 scale, and also output the rationale.

여기서 얻은 option들의 rank에 따라서 optimal answer인 [Answer]를 만든다.

Answer Verification

정답이 올바른지 check하는데, 이때 이미 얻은 정답이 correct하다고 가정하고 두 aspect에 대해 평가한다:

pixel grounding information이 align되는지
commonsnse와 contradict하는지

Given the [Video], and the raw question [Question], now you need to verify the previous answer by 
1) checking the pixel grounding information if the answer [Answer] aligns with the facts presented in teh video from a prrception standpoint;
2) determining from a cognitive perspective if the commonsense implications inherent in the answer contradict any of the main [Observations] inferred in the 3-rd reasoning step.
Output the verification result with rationale.

만약 inconsistency가 발견되면 4-th step으로 돌아가 다시 answer를 select한다.

Experiments

Implementations

LLM으로 Vicuna-7B-v1.5, video encoder로 ViT-L/14와 projector로 Q-former를 사용했다. Graph Transformer encoding STSG는 768d hidden size의 6 layer architecture로 사용되었다. object neural representation은 CLIP을 사용했다.

video는 8fps로 sampling되었다.

Results

Figure 5: Visualization of qualitative example showcasing the VoT reasoning.

Discussion

References

[1] Fei, Hao, et al. "Video-of-Thought: Step-by-Step Video Reasoning from Perception to Cognition." Proceedings of the International Conference on Machine Learning (ICML), 2024.

Footnotes

'DL·ML > Paper' 카테고리의 다른 글

InstructGPT / RLHF (NeurIPS 2022) (1)	2024.08.21
PPO (Policy Proximal Optimization) (0)	2024.08.20
VPD (CVPR 2024 Oral, VLM) (0)	2024.08.05
InternVideo2 (VFM) (1)	2024.07.25
ChatPose (CVPR 2024) (0)	2024.07.17

Abstract

Motivation

MotionEpic

Overview

Integrating STSG Representation

Fine-grained Video-Scene Grounding-aware Tuning

Video-of-Thought Reasoning Framework

Experiments

Implementations

Results

Discussion

References

Footnotes

'DL·ML > Paper' 카테고리의 다른 글

티스토리툴바