'DL·ML/Paper' 카테고리의 글 목록

Abstracthttps://arxiv.org/pdf/2403.19046 Recent works often overlook the importance of temporal localizationThe key aspects that limit the temporal localization abilities are:time representationarchitecturedataHence, new architecture, LITA, is proposed in this paper which is capable of:leveraging time tokens to better represent time in videos handling SlowFast tokens to capture temporal informat..

https://arxiv.org/abs/2410.23782 Video Token Merging for Long-form Video UnderstandingAs the scale of data and models for video understanding rapidly expand, handling long-form video input in transformer-based models presents a practical challenge. Rather than resorting to input sampling or token dropping, which may result in information loarxiv.org Abstractlong video의 token merging에 대한 papervi..

https://arxiv.org/abs/2501.10674 Can Multimodal LLMs do Visual Temporal Understanding and Reasoning? The answer is No!Multimodal Large Language Models (MLLMs) have achieved significant advancements in tasks like Visual Question Answering (VQA) by leveraging foundational Large Language Models (LLMs). However, their abilities in specific areas such as temporal understandingarxiv.org AbstractTemp..

https://icml.cc/virtual/2024/poster/33745 ICML Poster NExT-Chat: An LMM for Chat, Detection and SegmentationAbstract: The development of large language models (LLMs) has greatly advanced the field of multimodal understanding, leading to the emergence of large multimodal models (LMMs). In order to enhance visual comprehension, recent studies have equipped LMMs wiicml.cc Abstractpix2seq에 영감을 받은 pi..

https://arxiv.org/abs/2001.06891 Where Does It Exist: Spatio-Temporal Video Grounding for Multi-Form SentencesIn this paper, we consider a novel task, Spatio-Temporal Video Grounding for Multi-Form Sentences (STVG). Given an untrimmed video and a declarative/interrogative sentence depicting an object, STVG aims to localize the spatio-temporal tube of the queried oarxiv.org AbstractSTVG task 제안V..

https://arxiv.org/abs/2410.17434 LongVU: Spatiotemporal Adaptive Compression for Long Video-Language UnderstandingMultimodal Large Language Models (MLLMs) have shown promising progress in understanding and analyzing video content. However, processing long videos remains a significant challenge constrained by LLM's context size. To address this limitation, we propose Larxiv.org AbstractLongVU(L..

https://arxiv.org/abs/2404.08506 LaSagnA: Language-based Segmentation Assistant for Complex QueriesRecent advancements have empowered Large Language Models for Vision (vLLMs) to generate detailed perceptual outcomes, including bounding boxes and masks. Nonetheless, there are two constraints that restrict the further application of these vLLMs: the incaparxiv.org AbstractMLLM을 image domain의 ma..

https://arxiv.org/abs/2501.00599v1 VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLMVideo Large Language Models (Video LLMs) have recently exhibited remarkable capabilities in general video understanding. However, they mainly focus on holistic comprehension and struggle with capturing fine-grained spatial and temporal details. Besides, tharxiv.org arXiv에 241231에..

https://arxiv.org/abs/2403.14598 PSALM: Pixelwise SegmentAtion with Large Multi-Modal ModelPSALM is a powerful extension of the Large Multi-modal Model (LMM) to address the segmentation task challenges. To overcome the limitation of the LMM being limited to textual output, PSALM incorporates a mask decoder and a well-designed input schema to hanarxiv.org AbstractPSALM(Pixelwise SegmentAtion wi..

티스토리툴바