'DL·ML' 카테고리의 글 목록

Abstracthttps://arxiv.org/pdf/2403.19046 Recent works often overlook the importance of temporal localizationThe key aspects that limit the temporal localization abilities are:time representationarchitecturedataHence, new architecture, LITA, is proposed in this paper which is capable of:leveraging time tokens to better represent time in videos handling SlowFast tokens to capture temporal informat..

VOS 중 하나인 RVOS(Referring Video Object Segmentation) task와, 이 task를 다루는 dataset들에 대해 다룬다. segmentation task에 대한 전반적인 이해는 segmentation task들의 종류를 참조하길 바란다. Ref-DAVISRVOS task를 처음으로 정의한 paper이다. Refer-YouTube-VOS (URVOS)ECCV 2020 paper이고, RVOS task dataset의 크기를 키운 paper이다. Dataset27,000+ referring expressions for 3,900 videosend-to-end architecture 제안 → 기존 DAVIS-2017 dataset은 개수가 작아서 end..

https://arxiv.org/abs/2410.23782 Video Token Merging for Long-form Video UnderstandingAs the scale of data and models for video understanding rapidly expand, handling long-form video input in transformer-based models presents a practical challenge. Rather than resorting to input sampling or token dropping, which may result in information loarxiv.org Abstractlong video의 token merging에 대한 papervi..

https://arxiv.org/abs/2501.10674 Can Multimodal LLMs do Visual Temporal Understanding and Reasoning? The answer is No!Multimodal Large Language Models (MLLMs) have achieved significant advancements in tasks like Visual Question Answering (VQA) by leveraging foundational Large Language Models (LLMs). However, their abilities in specific areas such as temporal understandingarxiv.org AbstractTemp..

https://icml.cc/virtual/2024/poster/33745 ICML Poster NExT-Chat: An LMM for Chat, Detection and SegmentationAbstract: The development of large language models (LLMs) has greatly advanced the field of multimodal understanding, leading to the emergence of large multimodal models (LMMs). In order to enhance visual comprehension, recent studies have equipped LMMs wiicml.cc Abstractpix2seq에 영감을 받은 pi..

https://arxiv.org/abs/2001.06891 Where Does It Exist: Spatio-Temporal Video Grounding for Multi-Form SentencesIn this paper, we consider a novel task, Spatio-Temporal Video Grounding for Multi-Form Sentences (STVG). Given an untrimmed video and a declarative/interrogative sentence depicting an object, STVG aims to localize the spatio-temporal tube of the queried oarxiv.org AbstractSTVG task 제안V..

https://arxiv.org/abs/2410.17434 LongVU: Spatiotemporal Adaptive Compression for Long Video-Language UnderstandingMultimodal Large Language Models (MLLMs) have shown promising progress in understanding and analyzing video content. However, processing long videos remains a significant challenge constrained by LLM's context size. To address this limitation, we propose Larxiv.org AbstractLongVU(L..

segmentation task들은 그 종류가 다양하나 이름에 따라 task에서 다루고자 하는 바가 미묘하게 다르다. 다만 영어로 이름이 붙어 있는 탓에 그 뉘앙스를 파악하기 어려워 각 task 별 차이를 확인해보고자 한다. Object Segmentationobject segmentation이라는 용어는 image domain에서는 특정한 task를 refer하는 용도로는 사용되지 않는 것으로 보인다. image에서 object segmentation이라고 하면 전체 segmentation task를 의미하는 것으로 사용된다. video domain에서 object segmentation은 foreground object를 segment하고 track하는 것을 의미한다(Fig. 1 참조). 이..

https://arxiv.org/abs/2404.08506 LaSagnA: Language-based Segmentation Assistant for Complex QueriesRecent advancements have empowered Large Language Models for Vision (vLLMs) to generate detailed perceptual outcomes, including bounding boxes and masks. Nonetheless, there are two constraints that restrict the further application of these vLLMs: the incaparxiv.org AbstractMLLM을 image domain의 ma..

티스토리툴바