Video Token Merging(VTM) (NeurIPS 2024, long video)

DL·ML/Paper

Video Token Merging(VTM) (NeurIPS 2024, long video)

Jordano 2025. 1. 24. 15:23

Video Token Merging for Long-form Video Understanding

As the scale of data and models for video understanding rapidly expand, handling long-form video input in transformer-based models presents a practical challenge. Rather than resorting to input sampling or token dropping, which may result in information lo

arxiv.org

Abstract

long video의 token merging에 대한 paper
video token merging은 단순히 similarity를 기반으로 해서는 안 된다고 주장
learnable video token merging(VTM) algorithm을 제안

→ adaptive하게 video token merging하는 알고리즘은 이미 많다. 비교하면서 보면 될 듯

https://jordano-jackson.tistory.com/203 참조

LongVU (Long Video Understanding)

https://arxiv.org/abs/2410.17434 LongVU: Spatiotemporal Adaptive Compression for Long Video-Language UnderstandingMultimodal Large Language Models (MLLMs) have shown promising progress in understanding and analyzing video content. However, processing

jordano-jackson.tistory.com

Motivation

Figure 1: Comparisons of GPU memory footprint and throughput against scene prediction accuracy on the LVU dataset.

token merging scheme들은 similarity로 merge하는 경우가 많은데, 이 경우 granularity가 다른 token이 함께 process될 수 있다는 잠재적인 문제가 있다. 이는 token들이 different information density를 갖게 한다.

이 paper에서는 기존에 사용되던 여러 video token merging method들을 살펴본다. 그 후 effective token merging method를 찾는다.

Methods

Preliminary -- Token Merging

visual token merging은 [1]에서 제안된 개념으로 세 단계로 구성된다: partitioning, matching and merging.

partitioning은 $|\mathcal{T}|$개의 target tokens $\mathcal{T}$을 uniform하게 sampling하는 것이고, 나머지 source tokens $\mathcal{S}$는 key값의 cosine similarity로 가장 가까운 target token을 찾는다. 그 후 target token과 source token들은 average pooling으로 merge된다. 이를 통해 $|\mathcal{T}|$개의 merged token을 얻게 된다.

자세한 내용은 해당 paper를 참조하길 바란다.

Problem Definition

Figure 2: The architectures of (a) the baseline network, (b) the transformer block, and (c) the video token merging block.

$L$개의 frame을 가진 video가 들어왔을 때, self-attnetion의 time complexity는 $\mathcal{O}(L^2H^2W^2D^2)$이다.

Video Token Merging -- Exploration

Table 1: Comparison of different VTM methods on the LVU dataset. The best results are boldfaced and the second-best ones are underlined.

여러 video token merging strategy들을 살펴본다.

Naïve video token merging:
VTM block을 original transformer layer 사이에 끼워넣는다. 위에서 설명한 token merging method를 사용할 경우 gradual하게 68%의 token을 줄일 수 있다(Fig. 3 (a) 참조).
Region-concentrated video token merging:
video token에서는 uniform하게 sampling하는 것이 바람직하지 않으므로 saliency를 고려한다. 이를 위해 center에서 좀 더 많은 token을 뽑아서 위와 같은 방식으로 token merging했다(Fig. 3 (b) 참조).
Motion-based video token merging:
center-concentrated VTM이 heuristic하게 center에 meaningful token이 많다고 가정하는 것을 개선하여 motion information vector를 바탕으로 token merging한다(Fig. 3 (c) 참조).
Learnable Video Token Merging:
motion-based VTM은 camara movement가 있는 경우에 효과적이지 않을 수 있다. learnable VTM은 saliency score를 esimate해서 target token을 추출한다.

Learnable VTM은 two forward path로 구성되어 있다(Fig. 4 참조). main path에서는 token의 QKV를 얻어 self-attention한다. 그리고 saliency score를 estimate한다.

saliency score로 softmax를 거쳐 target token을 구하여 match & merge process를 수행한다.

다만 이때 partitioning하는 것이 differentiable하지 않으므로 auxiliary token $X_{aux}$를 입력으로 받아 updated auxiliary token $X_{aux}'$를 구한다(eq. 10 참조).

Figure 4: An overview of the learnable video token merging block. The auxiliary path is used during training only.

만약 saliecy 값이 prediction하기에 올바른 값들에 높고, 그렇지 않은 값들에 낮게 잡혀있다면 updated auxiliary token의 가중치에도 그 값이 반영될 것이다. 이는 이후 layer에서 prediction에 중요한 token에 가중치를 주었는지 평가하는 방법으로 사용된다.

→ 똑똑한 방법이다. 근데 그냥 merging 방법을 gradient가 흐르게 바꾸면 안되냐? network 구조가 dynamic해져서 안 되는 건가

생각해보니까 가능은 한데 저거보다 훨씬 비쌀듯?

Experiments

dataset은 LVU, Breakfast, COIN으로 video 길이는 2분 내외이다. 요즘 세팅에서 생각해보면 충분히 long하지 않긴 하다..

Table 3: Comparison on the Breakfast datsaet. PT stands for pretraining.

Table 4: Comparison on the COIN dataset.

Figure 5: Visualizations of video token merging results on the LVU dataset. Patches with same inner and border color are merged together. The tokens corresponding to the backgrounds are merged together, therby increasing the influence of salient tokens in the attention process.

Discussion

* 아니 왜 long video라고 해놓고 실험이 2분짜리인거냐 이게 어떻게 long video지

* 당연히 생각할 수 있는 방법인데 이 논문이 이제 나온게 흥미롭다

* 근데 그냥 전체 learnable하게 할 수 있을것같은데 .. 그리고 그게 더 잘할듯 . 그리고 좀 더 멀리 보면서 하면 더 좋을듯

References

[1] Bolya, Daniel, et al. "Token merging: Your vit but faster." arXiv preprint arXiv:2210.09461 (2022).

티스토리