STVG (VidSTG, CVPR 2020)

https://arxiv.org/abs/2001.06891

Where Does It Exist: Spatio-Temporal Video Grounding for Multi-Form Sentences

In this paper, we consider a novel task, Spatio-Temporal Video Grounding for Multi-Form Sentences (STVG). Given an untrimmed video and a declarative/interrogative sentence depicting an object, STVG aims to localize the spatio-temporal tube of the queried o

arxiv.org

Abstract

STVG task 제안
VidSTG dataset
STGRN 구조 제안

Motivation

(1) untrimmed video에서 spatiotemporal object tube를 localize한다. 이때의 object는 특정 동작을 하고 있는 경우에 대해서 설명한다.

(2) declarative하게 object를 refer하는 것 뿐 아니라 interrogative하게 물체를 refer한다.

Methods

Figure 2: The overall architecture of STGRN.

내 생각에 method는 outdated이므로 자세히 보는 것은 중요하지 않다.

vision branch에서는 R-CNN으로 object region을 얻은 뒤 spatiotemporal graph를 만든다. 이를 query embedding과 cross-modal fusion해서 graph를 얻은 뒤 temporal → spatial 순으로 localize한다.

Dataset

VidSTG는 VidOR dataset을 기반으로 만들어졌다.

Results

Discussion

References

Footnotes

'DL·ML > Paper' 카테고리의 다른 글

TemporalVQA (0)	2025.01.22
NExT-Chat (ICML 2024, MLLM for OD and Seg) (0)	2025.01.22
LongVU (Long Video Understanding) (0)	2025.01.20
LaSagnA (Segmentation) (0)	2025.01.14
VideoRefer Suite (0)	2025.01.10

Abstract

Motivation

Methods

Dataset

Results

Discussion

References

Footnotes

'DL·ML > Paper' 카테고리의 다른 글

티스토리툴바