'DL·ML/Paper' 카테고리의 글 목록 (3 Page)

Motivation기존에도 Vision-Language model들을 Human-Object Interaction task에 사용하는 경우는 있었다.(PhraseHOI) 그러나 이 경우 다음과 같은 limitation이 있다:Limited Scalability: annotated data에 지나치게 의존하여 category가 limit된다.Suboptimal adaptability in zero-shot settings: HOI-VLM approach가 적은 word embedding category만 사용하여 그 adaptibility가 제한된다.task description에서 behavior를 추출하기 어렵다.UniHOI에서는 VL model 대신 LLM을 이용하여 위의 limitation들을 해결..

보호되어 있는 글입니다.

MotivationMethods reasoner module이라고 거창하게 써 있는데 그냥 YOLO에 transformer variation 붙인 형태이다. Experiments 당연히 뒤에 transformer를 붙였으니 성능은 올라가고 fps는 낮아질 것이다. Fig 3, 4, 5는 cherry-picked인 것 같다. Discussion근데 왜 ViT를 안쓰는거야? → 연구실 선배가 ViT는 scability가 좋은거지, 모델이 무겁고 데이터가 적은 OD 상황에서는 맞지 않는다고 조언해주심. References[1] M. M. Gündoğan, T. Aksoy, A. Temizel and U. Halici, "IR Reasoner: Real-time Infrared Object Detect..

DatasetsTNO Image Fusion Dataset: 261 pairs of images, few objectsINO Videos Analytics Dataset & OTCBVS OSU Color-Thermal Database: few pedestrianCVC-14: images are already clear--no need for the ir images!!KAIST Multispectral Dataset: FLIR Thermal Dataset LLVIP Dataset Image capture: captured with HIKVISION DS-2TD8166BJZFY-75H2F/V2 camera- 15,488 pairs of visible-ir images from 26 diffrent loca..

PreliminariesReinforcement Learning BasicsPPO Motivationnext token prediction으로 train하는 경우 사용자 intention과 align되지 않을 수 있으므로 다른 fine-tune 방법을 제안한다. 여기서는 RLHF를 사용한다. 구체적으로는 reward model(RM)을 human labeled comparison dataset에 대해서 학습한다. 그 후 RM을 reward function으로 활용하여 PPO로 reward를 maximize하도록 supervised finetuning한다. Fig. 2를 참조하면, method는 세 단계로 구성된다:(1) supervised fine-tuning(SFT)(2) reward model(..

PreliminariesReinforcement Learning Basics 참조. Policy Gradient Methods $\hat g$는 gradient estimator이다. 여기서 역할은 여러 sample들에 대해 측정한 gradient의 expection 값인데, 이는 여러 sample에 대해서 stochastic policy를 얻었을 때 그 값에 advantage function으로 weight를 준 형태로 구하는 것이다. 여기서의 objective는 당연하게도 $$ \mathcal {L} ^{PG}(θ) = \hat{ \mathbb E}_t \left [ \log π_θ (a_t|s_t) \hat A_t \right ]$$ 의 형태로 정의되는 것이다. 그러나 이는 여러 step에 대..

AbstractSuggests MotionEpic, a model that integrates STSG to videoSuggests VoT(Video of Thought) frameworkMotivation video에 대한 reasoning을 수행하기 위해서는 두 종류의 ability가 필요하다. fine-grained perceptive pixel understanding of the video movementcognitive ability allowing reasonable explanation and causal imagination실제로 사람이 video에 대해서 reasoning을 할 때에는 multi-hop으로 추론하므로 이를 모방하는 것이 필요하다고 짐작하기는 어렵지 않다. intuiti..

Abstract MotivationVLM에서 directly image 내에서의 reasoning task를 풀도록 하는 것은 잘 되지 않는다.LLM에서 reasoning 할 수 있도록 tool들을 활용하는 program을 만드는 경우도 있지만, generated program은 잘 동작하지 않아 여전히 expert model보다 잘 하지 못한다.VPD(Visual Program Distillation)에서는 cross modality reasoning capability를 VLM에 distill한다.이는 다음 두 가지를 활용한다:tool을 활용하는 visual program들의 advancementCoT reasoing을 통한 distillation 방법 Visual Program Distillati..

Motivation InternVideo2는 three stages of learning scheme으로 spatiotemporal perception을 개선한다. 처음에는 VideoMAE처럼 maksed video token prediction을 수행한다. 두 번째 stage로 multimodal learning을 수행하여 audio와 text에 대해서도 task를 수행할 수 있게 된다. 마지막으로 InternVideo2를 LLM에 붙여 next-token prediction training함으로써 contextually appropriate token을 generate하도록 train된다. Methodvideo encoder로 CLIP을 쓰지 않고 ViT를 쓴다. 여기에 attention pooli..

티스토리툴바