'DL·ML' 카테고리의 글 목록 (4 Page)

PreliminariesReinforcement Learning Basics 참조. Policy Gradient Methods $\hat g$는 gradient estimator이다. 여기서 역할은 여러 sample들에 대해 측정한 gradient의 expection 값인데, 이는 여러 sample에 대해서 stochastic policy를 얻었을 때 그 값에 advantage function으로 weight를 준 형태로 구하는 것이다. 여기서의 objective는 당연하게도 $$ \mathcal {L} ^{PG}(θ) = \hat{ \mathbb E}_t \left [ \log π_θ (a_t|s_t) \hat A_t \right ]$$ 의 형태로 정의되는 것이다. 그러나 이는 여러 step에 대..

PoliciesA policy is a rule that determines what action to take, typically denoted as $μ$. When the action is selected stochastically, the policy is represented specifically as $π(⋅|s_t)$ at timestep $t$When the policy is based on stochastic process, the action is sampled categorically if the action space is discrete, and sampled in a Guassian manner if the action space is continuous.Value Functi..

AbstractSuggests MotionEpic, a model that integrates STSG to videoSuggests VoT(Video of Thought) frameworkMotivation video에 대한 reasoning을 수행하기 위해서는 두 종류의 ability가 필요하다. fine-grained perceptive pixel understanding of the video movementcognitive ability allowing reasonable explanation and causal imagination실제로 사람이 video에 대해서 reasoning을 할 때에는 multi-hop으로 추론하므로 이를 모방하는 것이 필요하다고 짐작하기는 어렵지 않다. intuiti..

Abstract MotivationVLM에서 directly image 내에서의 reasoning task를 풀도록 하는 것은 잘 되지 않는다.LLM에서 reasoning 할 수 있도록 tool들을 활용하는 program을 만드는 경우도 있지만, generated program은 잘 동작하지 않아 여전히 expert model보다 잘 하지 못한다.VPD(Visual Program Distillation)에서는 cross modality reasoning capability를 VLM에 distill한다.이는 다음 두 가지를 활용한다:tool을 활용하는 visual program들의 advancementCoT reasoing을 통한 distillation 방법 Visual Program Distillati..

Motivation InternVideo2는 three stages of learning scheme으로 spatiotemporal perception을 개선한다. 처음에는 VideoMAE처럼 maksed video token prediction을 수행한다. 두 번째 stage로 multimodal learning을 수행하여 audio와 text에 대해서도 task를 수행할 수 있게 된다. 마지막으로 InternVideo2를 LLM에 붙여 next-token prediction training함으로써 contextually appropriate token을 generate하도록 train된다. Methodvideo encoder로 CLIP을 쓰지 않고 ViT를 쓴다. 여기에 attention pooli..

Motivationpose estimation만을 수행하는 vision model들은 comprehensive한 이해가 결여되어 있다. 여기서는 LLM의 prior knowledge를 활용하여 3D human pose를 SMPL 형태로 generation하도록 한다. 이를 위해 LLM이 기존에 갖고 있는 3D pose에 대한 이해를 확인하고 추가적으로 어떻게 teach할 수 있는지 확인한다. MethodsArchitecturetext 또는 visual input을 받을 수 있다. 이를 이용해 textual output 또는 SMPL pose를 출력한다. 모델은 LLM model $f_φ$와 embedding projection layer $g_Θ$, SMPL로 구성된다. SMPL은 pose와 shap..

MotivationLLaVA는 image-level analysis를 진행하므로 precise location과 같은 pixel-level에서의 작업을 수행할 수 없다는 문제가 있다. 여기에 extra detection model을 붙여서 문제를 해결하는 경우가 있지만, 이 경우 LLaVA가 image cpationing이나 VQA와 같은 image-level analysis의 성능을 잃게 된다는 문제가 있다. OMG-LLaVA에서는 하나의 LLM과 visual encoder, decoder를 가지고 image-level, object-level, pixel-level task를 모두 수행하고자 한다. 특히 OMG-Seg model을 universal perception model로 사용한다. OMG-..

The Linear Explanation of Adversarial ExamplesGoodfellow et al.은 이 논문에서 adversarial example이 가능한 것은 high-dimensional space에서 linear behavior를 보이기 때문이라고 설명한다. linearity는 model의 model의 train을 용이하게 하지만 vulnerability를 크게 만든다. linear model에서의 adversarial example의 existence는 다음과 같이 보일 수 있다. 일반적인 경우 input feature의 precision은 1/255로 제한되고 그 이하의 값은 discard된다. 따라서 feature의 precision보다 작은 perturbation $η, ..

IntroductionAdversarial attack은 machine learning algorithm이 올바르지 않은 행동을 하도록 만드는 공격을 의미한다. 특히 Deep neural network의 경우에는 adversarial attack에 대해 vulnerable하다고 알려져 있는데, 각종 핵심 기능에 사용되는 DNN 모델의 특성 상 security가 강하게 요구된다. 따라서 이를 방어하는 방법을 adversarial defense라고 하고 이 모든 분야를 합쳐 adversarial machine learning이라고 한다. 처음 이 vulnerability가 제안된 것은 Szegdy et al.[2] 의 dnn에서의 image classification task이다. image에 target..

티스토리툴바