DL·ML
RVOS Datasets (In Progress)
VOS 중 하나인 RVOS(Referring Video Object Segmentation) task와, 이 task를 다루는 dataset들에 대해 다룬다. segmentation task에 대한 전반적인 이해는 segmentation task들의 종류를 참조하길 바란다. Ref-DAVISRVOS task를 처음으로 정의한 paper이다. Refer-YouTube-VOS (URVOS)ECCV 2020 paper이고, RVOS task dataset의 크기를 키운 paper이다. Dataset27,000+ referring expressions for 3,900 videosend-to-end architecture 제안 → 기존 DAVIS-2017 dataset은 개수가 작아서 en..
Video Token Merging(VTM) (NeurIPS 2024, long video)
https://arxiv.org/abs/2410.23782 Video Token Merging for Long-form Video UnderstandingAs the scale of data and models for video understanding rapidly expand, handling long-form video input in transformer-based models presents a practical challenge. Rather than resorting to input sampling or token dropping, which may result in information loarxiv.org Abstractlong video의 token merging에 대한 papervi..
TemporalVQA
https://arxiv.org/abs/2501.10674 Can Multimodal LLMs do Visual Temporal Understanding and Reasoning? The answer is No!Multimodal Large Language Models (MLLMs) have achieved significant advancements in tasks like Visual Question Answering (VQA) by leveraging foundational Large Language Models (LLMs). However, their abilities in specific areas such as temporal understandingarxiv.org AbstractTemp..
NExT-Chat (ICML 2024, MLLM for OD and Seg)
https://icml.cc/virtual/2024/poster/33745 ICML Poster NExT-Chat: An LMM for Chat, Detection and SegmentationAbstract: The development of large language models (LLMs) has greatly advanced the field of multimodal understanding, leading to the emergence of large multimodal models (LMMs). In order to enhance visual comprehension, recent studies have equipped LMMs wiicml.cc Abstractpix2seq에 영감을 받은 pi..
STVG (VidSTG, CVPR 2020)
https://arxiv.org/abs/2001.06891 Where Does It Exist: Spatio-Temporal Video Grounding for Multi-Form SentencesIn this paper, we consider a novel task, Spatio-Temporal Video Grounding for Multi-Form Sentences (STVG). Given an untrimmed video and a declarative/interrogative sentence depicting an object, STVG aims to localize the spatio-temporal tube of the queried oarxiv.org AbstractSTVG task 제안V..
LongVU (Long Video Understanding)
https://arxiv.org/abs/2410.17434 LongVU: Spatiotemporal Adaptive Compression for Long Video-Language UnderstandingMultimodal Large Language Models (MLLMs) have shown promising progress in understanding and analyzing video content. However, processing long videos remains a significant challenge constrained by LLM's context size. To address this limitation, we propose Larxiv.org AbstractLongVU(L..