LongVU (Long Video Understanding)

LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding

Multimodal Large Language Models (MLLMs) have shown promising progress in understanding and analyzing video content. However, processing long videos remains a significant challenge constrained by LLM's context size. To address this limitation, we propose L

arxiv.org

Abstract

LongVU(Long Video-Language Understanding)을 제안
- long video understanding을 위한 MLLM
- DINOv2 vision feature를 이용하여 similarity가 높은 frame을 제거
- text-guided cross-modal query로 selective frame feature reduction
- spatial-temporal token reduction

Motivation

Figure 1: Effectiveness of the LongVU over commonly-used uniform sampling and dense sampling.

Methods

전체적인 pipeline은 Fig. 2에 나타나 있다. 먼저 video를 받으면 DINOv2를 이용해서 temporal reduction을 수행한다. 나머지 frame들에 대해서 text query와 함께 token을 pooling하고, sptail token reduciton을 한 번 더 수행한다.

→ 그럼 time에 대한 embedding은 어떻게 주냐?

Frame Feature Extractor and Temporal Reduction

$N$ frame video에 대해 1fps로 sampling한다. 얻은 frames $I=\{I^1,\dots,I^N\}$에 대해 DINOv2로 feature extraction을 해 $\{V^1_{dino},\dots,V^N_{dino}\}$를 얻는다. 각 frame은 근처 $J=8$ frame에 대해서 이 window 내의 average similarity $sim^i=\frac{1}{J-1}\sum^J_{j=1,j\neq i}sim(V^i_{dino}, V^j_{dino})$를 계산하여 other frame과 high similarity를 가지는 frame을 제거한다. $T$개의 frame으로 compress하는데 이는 대략 $N$의 절반 정도이다.

이후 CLIPv2와 SigLIP으로 얻은 두 종류의 visual feature를 Spatial Vision Aggregator를 이용해서 aggregate해 $V=\{V^1,\dots,V^T\}$를 얻는다.

Selective Feature Reduction via Cross-modal Query

가진 vision feature $V∈ℝ^{T×(H_h×W_h)×D_v}$에 대해 만약 context length보다 feature의 개수가 많으면 selective compression strategy를 수행한다.

$H_h×W_h$를 $H_l×W_l$로 줄이기 위해 text query의 LLM embedding $Q∈ℝ^{L_q×D_q}$를 활용한다. 이때 $N_h$개의 frame만 original resolution으로 남겨두고, 나머지는 spatial pooling을 수행한다. original resolution으로 남는 frame의 개수는 다음과 같다:

여기서 $\mathcal{F}$는 MLP-based multimodal adapter이다. 따라서 오른쪽 term에서 계산된, original resolution으로 남길 수 있는 token 개수를 가지고, 왼쪽 term의 방식으로 해당 frame들을 sampling하는 것이다. 왼쪽 term은 visual feature와 text query feature를 가지고 attention이 강하게 된 frame들을 골라내는 방식으로 이해할 수 있다.

Spatial Token Compression

low resolution으로 만든 후에도 여전히 token 개수가 문제가 되는 경우, 추가적인 compression을 진행한다. window size $K<T$를 가지고 spatial token compression(STC)를 수행한다. 각 window의 첫 frame은 full token resolution으로 두고, 나머지 frame들과 cosine similairty를 element-wise로 계산한다. cosine similarity $sim(\cdot,\cdot)$이 threshold $θ$보다 크면 prune되는 간단한 형태이다(Eq. 2 참조).

Experiments

Table 1: Results on comprehensive video understanding benchmarks.

Figure 3: Examples of various video understanding capabilities of LongVU model.

Results

Discussion

* 재미있는 방법에 성능도 잘 나오는 paper

* compression step이 여러 가지라서 output token이 consistent하게 나오지 않았을 것 → 어쩌면 learning에 부정적으로 작용했을듯

* 다만 timestamp를 embedding할 방법이 없어서 실제 얼마나 시간 차이가 나는지에 대한 인식하기는 어려울 듯

References

Footnotes

'DL·ML > Paper' 카테고리의 다른 글

NExT-Chat (ICML 2024, MLLM for OD and Seg) (0)	2025.01.22
STVG (VidSTG, CVPR 2020) (0)	2025.01.21
LaSagnA (Segmentation) (0)	2025.01.14
VideoRefer Suite (0)	2025.01.10
PSALM (ECCV 2024, Image Segmentation) (0)	2025.01.10

Abstract

Motivation

Methods

Frame Feature Extractor and Temporal Reduction

Selective Feature Reduction via Cross-modal Query

Spatial Token Compression

Experiments

Results

Discussion

References

Footnotes

'DL·ML > Paper' 카테고리의 다른 글

티스토리툴바