LITA (ECCV 2024)
·
DL·ML/Paper
Abstracthttps://arxiv.org/pdf/2403.19046 Recent works often overlook the importance of temporal localizationThe key aspects that limit the temporal localization abilities are:time representationarchitecturedataHence, new architecture, LITA, is proposed in this paper which is capable of:leveraging time tokens to better represent time in videos handling SlowFast tokens to capture temporal informat..
Video Token Merging(VTM) (NeurIPS 2024, long video)
·
DL·ML/Paper
https://arxiv.org/abs/2410.23782  Video Token Merging for Long-form Video UnderstandingAs the scale of data and models for video understanding rapidly expand, handling long-form video input in transformer-based models presents a practical challenge. Rather than resorting to input sampling or token dropping, which may result in information loarxiv.org Abstractlong video의 token merging에 대한 papervi..
TemporalVQA
·
DL·ML/Paper
https://arxiv.org/abs/2501.10674 Can Multimodal LLMs do Visual Temporal Understanding and Reasoning? The answer is No!Multimodal Large Language Models (MLLMs) have achieved significant advancements in tasks like Visual Question Answering (VQA) by leveraging foundational Large Language Models (LLMs). However, their abilities in specific areas such as temporal understandingarxiv.org   AbstractTemp..
NExT-Chat (ICML 2024, MLLM for OD and Seg)
·
DL·ML/Paper
https://icml.cc/virtual/2024/poster/33745 ICML Poster NExT-Chat: An LMM for Chat, Detection and SegmentationAbstract: The development of large language models (LLMs) has greatly advanced the field of multimodal understanding, leading to the emergence of large multimodal models (LMMs). In order to enhance visual comprehension, recent studies have equipped LMMs wiicml.cc Abstractpix2seq에 영감을 받은 pi..
STVG (VidSTG, CVPR 2020)
·
DL·ML/Paper
https://arxiv.org/abs/2001.06891  Where Does It Exist: Spatio-Temporal Video Grounding for Multi-Form SentencesIn this paper, we consider a novel task, Spatio-Temporal Video Grounding for Multi-Form Sentences (STVG). Given an untrimmed video and a declarative/interrogative sentence depicting an object, STVG aims to localize the spatio-temporal tube of the queried oarxiv.org AbstractSTVG task 제안V..
LongVU (Long Video Understanding)
·
DL·ML/Paper
https://arxiv.org/abs/2410.17434   LongVU: Spatiotemporal Adaptive Compression for Long Video-Language UnderstandingMultimodal Large Language Models (MLLMs) have shown promising progress in understanding and analyzing video content. However, processing long videos remains a significant challenge constrained by LLM's context size. To address this limitation, we propose Larxiv.org AbstractLongVU(L..
LaSagnA (Segmentation)
·
DL·ML/Paper
https://arxiv.org/abs/2404.08506    LaSagnA: Language-based Segmentation Assistant for Complex QueriesRecent advancements have empowered Large Language Models for Vision (vLLMs) to generate detailed perceptual outcomes, including bounding boxes and masks. Nonetheless, there are two constraints that restrict the further application of these vLLMs: the incaparxiv.org AbstractMLLM을 image domain의 ma..
VideoRefer Suite
·
DL·ML/Paper
https://arxiv.org/abs/2501.00599v1    VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLMVideo Large Language Models (Video LLMs) have recently exhibited remarkable capabilities in general video understanding. However, they mainly focus on holistic comprehension and struggle with capturing fine-grained spatial and temporal details. Besides, tharxiv.org arXiv에 241231에..
PSALM (ECCV 2024, Image Segmentation)
·
DL·ML/Paper
https://arxiv.org/abs/2403.14598   PSALM: Pixelwise SegmentAtion with Large Multi-Modal ModelPSALM is a powerful extension of the Large Multi-modal Model (LMM) to address the segmentation task challenges. To overcome the limitation of the LMM being limited to textual output, PSALM incorporates a mask decoder and a well-designed input schema to hanarxiv.org AbstractPSALM(Pixelwise SegmentAtion wi..