'분류 전체보기' 카테고리의 글 목록 (2 Page)

https://arxiv.org/abs/2410.17434 LongVU: Spatiotemporal Adaptive Compression for Long Video-Language UnderstandingMultimodal Large Language Models (MLLMs) have shown promising progress in understanding and analyzing video content. However, processing long videos remains a significant challenge constrained by LLM's context size. To address this limitation, we propose Larxiv.org AbstractLongVU(L..

segmentation task들은 그 종류가 다양하나 이름에 따라 task에서 다루고자 하는 바가 미묘하게 다르다. 다만 영어로 이름이 붙어 있는 탓에 그 뉘앙스를 파악하기 어려워 각 task 별 차이를 확인해보고자 한다. Object Segmentationobject segmentation이라는 용어는 image domain에서는 특정한 task를 refer하는 용도로는 사용되지 않는 것으로 보인다. image에서 object segmentation이라고 하면 전체 segmentation task를 의미하는 것으로 사용된다. video domain에서 object segmentation은 foreground object를 segment하고 track하는 것을 의미한다(Fig. 1 참조). 이..

교표(校標)는 드물게 교장(校章)이라고도 하는데, 학교를 상징하는 휘장을 의미한다. 학교를 나타내는 여러 상징 중에서도 물리적 제한에 구애받지 않는다는 점 덕분에 학내 구성원들을 모으는 구심점으로서의 역할을 하곤 한다. 구성원의 수가 많고 종종 여러 캠퍼스로 구성되는 대학에서는 교표가 더욱 중요한 역할을 갖는다. 오늘날 대학이 학자들의 활동 근거지로서 기능하고 있다는 관점에서는 국내 학자들의 얼굴로서의 교표를 떠올릴 수도 있다. 따라서 교표는 단순히 창의적이거나 이것저것 잡다한 의미를 넣는 디자인적 요소로서 기능해야 한다기보다는, 겨레의 과거를 둘러보고 세계의 앞날을 밝히는 학문의 전당으로서의 대학의 사명을 나타낼 수 있어야 한다. 서양 대학의 교표들은 실제로 그런 관점에서 제작되었는데, 이를 살펴보기 ..

https://arxiv.org/abs/2404.08506 LaSagnA: Language-based Segmentation Assistant for Complex QueriesRecent advancements have empowered Large Language Models for Vision (vLLMs) to generate detailed perceptual outcomes, including bounding boxes and masks. Nonetheless, there are two constraints that restrict the further application of these vLLMs: the incaparxiv.org AbstractMLLM을 image domain의 ma..

https://arxiv.org/abs/2501.00599v1 VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLMVideo Large Language Models (Video LLMs) have recently exhibited remarkable capabilities in general video understanding. However, they mainly focus on holistic comprehension and struggle with capturing fine-grained spatial and temporal details. Besides, tharxiv.org arXiv에 241231에..

https://arxiv.org/abs/2403.14598 PSALM: Pixelwise SegmentAtion with Large Multi-Modal ModelPSALM is a powerful extension of the Large Multi-modal Model (LMM) to address the segmentation task challenges. To overcome the limitation of the LMM being limited to textual output, PSALM incorporates a mask decoder and a well-designed input schema to hanarxiv.org AbstractPSALM(Pixelwise SegmentAtion wi..

Abstracthttps://arxiv.org/abs/2412.14006 image와 video domain의 referring task와 reasoning task를 합쳐 Instructed Visual Segmentation(IVS) task로 통합했다.이를 해결할 수 있는 InstructSeg 모델을 제안한다. vision-guided multi-granularity text fusion을 도입해서 global and detailed text information을 fine-grained visual guidance와 integrate한다.Github repository: https://github.com/congvvc/InstructSeg Motivation여러 비슷한 task들 (RES, ..

GIoU (Generalized Intersection over Union) [1]IoU는 overlap하는 구역이 전혀 없을 때 0을 return하는데, 이는 prediction이 GT와 얼마나 가까운지 상관하지 않는다. 따라서 실제로는 GT와 어느 정도 더 가까운 prediction이라도 언제나 0을 return할 수 있다. 이는 model의 optimization process에서 plateau로 작용해서 optimize를 infeasible하게 만든다. Fig. 1에서 GIoU와 IoU, norm이 나타나 있다. 동일한 representation에서도 세 metric은 아주 다르다. GIoU의 아이디어는 간단한데, 두 convex shape A와 B를 enclose하는 smallest con..

Abstract HyperSeg는 image, video scenario 모두에서 동작하는 VLM-based universal segmentation model이다. HyperSeg에서는 hybrid entity recognition module과 fine-grained visual perceiver module을 사용한다. Motivation기존 MLLM-based segmentation 방법론들은 한정된 domain 내에서만 동작한다는 limitation이 있다. HyperSeg에서는 text prompt와 visual prompt(box, mask, etc)를 모두 사용하는 task를 해결한다. 또한 여러 visual domain의 문제를 풀기 위해서 세 가지 방법론을 사용한다: 1. 기존 enc..

티스토리툴바