TemporalVQA

Can Multimodal LLMs do Visual Temporal Understanding and Reasoning? The answer is No!

Multimodal Large Language Models (MLLMs) have achieved significant advancements in tasks like Visual Question Answering (VQA) by leveraging foundational Large Language Models (LLMs). However, their abilities in specific areas such as temporal understanding

arxiv.org

Abstract

TemporalVQA benchmark 제안
- Temporal Order Understanding: determine the sequence of events
- Time-lapse Estimation: 두 image 간의 time-lapse를 estimate하는 task

Motivation

TemporalVQA를 만들었는데, 이는 두 개의 task로 구성된다.

Temporal Order Understanding은 여러 frame 중에서 어떤 frame이 처음에 오는지 찾는 task이다.

Time-lapse Estimation은 두 image 사이의 time-lapse를 multiple-choice로 찾는 task이다.

TemporalVQA Benchmark Construction

Figure 1: An introuductory diagram illustrating the task setup for the TemporalVQA benchmark.

Experiments

여러 image processing이 안 되는 경우를 위해 두 image를 concatenate했다.

TOU에는 prompt #1, #2를 사용했다.

Time-lapse Estimation에서는 prompt #3를 사용했다.

Results

Tab. 1의 P1과 P2는 prompt setting을 의미한다.

Table 2: Time-lapse estimation accuracy across different temporal scales.

Table 3: Accuarcy comparison of the best performing MLLM with human performance.

Figure 2: Only instance where all participants in the human evaluation provided incorrect responses.

Fig. 2의 정답은 "Between 1-12 hours"이고, all participants는 "Between 2-15 minutes"를 선택했다.

Table 5: Breakdown of changes in reasoning (valid or invalid) when the order of the image pairs fed to the model is swapped.

Discussion

* Conclusion에서 제시하는 바는 다음과 같다:

(1) high sensitivity to input order and layout, indicating a lack of robust temporal understanding

(2) inconsistent reasoning patterns, where models often provide valid reasoning but arrive at incorrect conclusions

(3) tendency to derive illogical reasoning and hallucinate when making temporal judgements

* 다들 어렴풋이는 짐작했지만 benchmark로 만들어서 눈에 보이게 한 점이 좋음

* input 방식이 일단 문제임. 두 image를 concat하는 방식은 별로 이상적이지 않음. 이는 benchmark의 문제라기보다는 현재 llm input format의 문제라고 봐야할듯. tab. 1에서도 마지막 row가 GPT-4o setting에서 가장 성능이 좋다.

* 사람의 경우 여러 image를 볼 때 concat해서 보기보다는 두 image를 번갈아가면서 여러 번 본다. 여러 image를 볼 때에는 temporal하게 봐야함. 이거 video mllm에 넣고 돌렸으면 어땠을까 싶다.

* 어차피 이런 데이터는 무조건 zero-shot인데 이런거 하나하나 unexplored domain들에 대한 dataset을 만들어서 cover up하는것은 현실적으로 불가능함. physical AI만 해결할 수 있는 문제

* 근데 다 zero-shot setting에서도 GPT-4o는 유의미하게 좋은 성능. parameter 개수가 커지면 emergent하게 cover up한다고 기대할수도 있을듯

References

Footnotes

'DL·ML > Paper' 카테고리의 다른 글

LITA (ECCV 2024) (1)	2025.01.31
Video Token Merging(VTM) (NeurIPS 2024, long video) (2)	2025.01.24
NExT-Chat (ICML 2024, MLLM for OD and Seg) (0)	2025.01.22
STVG (VidSTG, CVPR 2020) (1)	2025.01.21
LongVU (Long Video Understanding) (0)	2025.01.20