'DL·ML' 카테고리의 글 목록 (3 Page)

AbstractWeakly-supervised Spatio-Temporal Video Grounding(WS-STVG) task는 기존의 STVG task와 비슷하나, densely annotated training data 없이 수행하는 방법을 의미한다. VTP(Video-Text Prompting)을 제안하여 candidate feature를 생선한다. 이는 tube를 만들기 위해서 video prompt를 red circle과 같은 visual marker로 추가하는 형태이다.candidate feature끼리 비슷하게 보이는 경우에는 constrastive VTP(CVTP)를 제안하여 해결했다. Motivationweakly supervised STVG는 heavily annotated data..

STVG TaskSTVG(spatio-temporal video grounding) task는 text query에 맞는 spatiotemporal 영역을 video 안에서 grounding하는 task이다. 이때 구성되는 sequence of bounding boxes를 spatio-temporal tube라고 한다. text query는 Fig. 1과 같이 declarative할 수도 있고, interrogative할 수도 있다. 그전까지의 방법은 Fig. 2 (a)와 같이 video와 query를 보고 object를 retreive하는 방식이었다. 그러나 일반적으로 textual query는 object를 retrieve하는 데에 충분하지 않다. 따라서 text query를 길게 할 수도 있..

Motivationmultimodal LLM에서 broad range of tasks에 대해서 instruction tuning할 때, LoRA를 사용하면 task interference로 인해서 performance degradation이 발생함이 알려져 있다. 이 paper에서는 (1) 이를 확인하고, (2) 해결 방법으로 Conditional MixLoRA(Mixture-of-LoRA)를 제안한다. Fig. 1에서 보이다시피 기존 LoRA는 하나의 shared weight matrix를 사용하는데 반해, Conditional MixLoRA는 두 개의 matrix를 놓고 input instance에 따라서 둘을 dynamically select해서 task interference issue를 mi..

Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization Motivation 1. image encoder를 video encoding에 사용하는 것은 video의 spatiotemporal feature를 capture하는 데 적합하지 않음. 특히 temporal한 움직임들2. 3d feature를 써서 둘을 capture하는 경우에는 video 자체의 redundancy 때문에 memory efficiency나 token length의 관점에서 효율적이지 않음 → single key frame와 optical flow(motion vector)를 이용하여 video를 encoding하면 mot..

Motivation기존에도 Vision-Language model들을 Human-Object Interaction task에 사용하는 경우는 있었다.(PhraseHOI) 그러나 이 경우 다음과 같은 limitation이 있다:Limited Scalability: annotated data에 지나치게 의존하여 category가 limit된다.Suboptimal adaptability in zero-shot settings: HOI-VLM approach가 적은 word embedding category만 사용하여 그 adaptibility가 제한된다.task description에서 behavior를 추출하기 어렵다.UniHOI에서는 VL model 대신 LLM을 이용하여 위의 limitation들을 해결..

보호되어 있는 글입니다.

MotivationMethods reasoner module이라고 거창하게 써 있는데 그냥 YOLO에 transformer variation 붙인 형태이다. Experiments 당연히 뒤에 transformer를 붙였으니 성능은 올라가고 fps는 낮아질 것이다. Fig 3, 4, 5는 cherry-picked인 것 같다. Discussion근데 왜 ViT를 안쓰는거야? → 연구실 선배가 ViT는 scability가 좋은거지, 모델이 무겁고 데이터가 적은 OD 상황에서는 맞지 않는다고 조언해주심. References[1] M. M. Gündoğan, T. Aksoy, A. Temizel and U. Halici, "IR Reasoner: Real-time Infrared Object Detect..

DatasetsTNO Image Fusion Dataset: 261 pairs of images, few objectsINO Videos Analytics Dataset & OTCBVS OSU Color-Thermal Database: few pedestrianCVC-14: images are already clear--no need for the ir images!!KAIST Multispectral Dataset: FLIR Thermal Dataset LLVIP Dataset Image capture: captured with HIKVISION DS-2TD8166BJZFY-75H2F/V2 camera- 15,488 pairs of visible-ir images from 26 diffrent loca..

PreliminariesReinforcement Learning BasicsPPO Motivationnext token prediction으로 train하는 경우 사용자 intention과 align되지 않을 수 있으므로 다른 fine-tune 방법을 제안한다. 여기서는 RLHF를 사용한다. 구체적으로는 reward model(RM)을 human labeled comparison dataset에 대해서 학습한다. 그 후 RM을 reward function으로 활용하여 PPO로 reward를 maximize하도록 supervised finetuning한다. Fig. 2를 참조하면, method는 세 단계로 구성된다:(1) supervised fine-tuning(SFT)(2) reward model(..

티스토리툴바