HyperSeg (arXiv preprint, seg)

Abstract

HyperSeg는 image, video scenario 모두에서 동작하는 VLM-based universal segmentation model이다.
HyperSeg에서는 hybrid entity recognition module과 fine-grained visual perceiver module을 사용한다.

Motivation

기존 MLLM-based segmentation 방법론들은 한정된 domain 내에서만 동작한다는 limitation이 있다.

HyperSeg에서는 text prompt와 visual prompt(box, mask, etc)를 모두 사용하는 task를 해결한다. 또한 여러 visual domain의 문제를 풀기 위해서 세 가지 방법론을 사용한다:

1. 기존 encode-only methods나 decode-only methods는 mask token을 generate하거나 decode하기만 한다는 점에 tackle해서; incorporate a hybrid entity recognition strategy → VLLM의 generative ability를 활용해서 mask token을 만들면서 동시에 decode한다.

2. CLIP encoder가 coarse-level visual feature를 사용한다는 점을 tackle해서; Fine-grained Visual Perceiver(FVP)를 사용해서 multi-scale visual feature를 fixed-length fine-grained token으로 사용한다.

3. 기존 method들의 temporal understanding이 약하다는 점을 tackle해서; temporal adapter를 새로 제안하여 사용한다.

Methods

Overview

VLLM input은 세 종류인데, 1) visual tokens, 2) fine-grained visual tokens, 3) prompt tokens이다. output은 mask tokens 또는 prompt tokens인데 이는 segmentation predictor에 feed되어 segmentation mask를 predict한다. 추가적으로 space-time information propagation과 global prompt aggregation을 활용한다.

Visual Large Language Model

LLM은 lightweight LLM을 쓰고, visual encoder도 CLIP과 같은 low-resolution encoder를 사용한다.

Eq. 1은 CLIP encoder가 visual input을 처리하고 LLM에 feed해서 output embedding $E_{O}$ 를 return하는 과정이다.

위의 $P$ 는 fine-grained visual tokens이고 $P$ 는 prompt인데, 이는 다시 instruction $P_{I}$ 와 task-specific condition $P_{C}$ 로 나뉜다. 그리고 task마다 이 format을 활용하여 다른 형태로 input을 제공한다.

OVS, VIS

$P_{I}$ : “Please segment all the positive objects according to the following potential categories.”

$P_{C}$ : “[category 1, category 2, category 3, ...]”

RES, R-VOS, ReasonVOS

$P_{I}$ : “Can you perform referring or reasoning segmentation according to the language expression?”

$P_{C}$ : “[referring / reasoning text]”

VOS

$P_{I}$ : “Please segment according to the given visual region reference”

$P_{C}$ : “[vision 1, vision 2, vision 3, ...]”

Segmentation Predictor

Segmentation predictor $F$ 는 mask $m$ , class score $z$ , instance embedding $e$ 를 eq.2와 같이 generate한다. $E_{P}$ 는 task-specific prompt embedding, $E_{Q}$ 는 semantically enhanced mask token이다.

mask는 $ℝ^{H \times W}$ dim으로, 바로 generate되는 것이다. instance embedding은 video domain만을 위한 것이다.

Loss

Loss는 eq. 3, 4와 같이 define된다.

Hybrid Entity Recognition

Figure 3: The comparison of different recognition strategies. (a) Generation-only, (b) Decode-only, (c) Hybrid.

Fig. 3을 보면, (a)는 seq generation으로 object를 찾는데, 이 경우 object를 miss하거나 repetitive objects를 generate하는 경우가 있다.

(b)는 VLLM을 mask를 class name decode하는 용도로 사용한다.

→ 여기서의 decode라고 말하는 것은 prompt embedding을 활용하여 class prediction을 한다는 이야기인 듯. paslm하고 omg-llava 다시 보고 무슨 말인지 확인해봐야 할 듯.

근데 이게 뭔말이여 ?

Fine-grained Visual Perceiver

Figure 4: Comparison between previous vision perceiver and the FVP. (a) previous vision perceiver, (b) FVP.

CLIP encoder만으로는 fine-grained feature를 얻을 수 없다고 생각해서 pyramid vision encoder를 사용한다. vision input $V$ 에 대해서, pyramid vision encoder $F_{s e g}$ 를 가지고 details-aware image features $f_{i m g}$ 를 얻는다. j-th scale and previous fine-grained tokens $P_{j - 1}$ 에 대해서 conditional weighted cross-attention으로 각 token을 enrich한다:

→ frame sampling 어떻게 하는거야?

Temporal Adapter

time-dimension을 aware하기 위해서 global prompt aggregation을 사용한다.

Global prompt aggregation

current prompt embedding $E_{P}$ 에 대해서 time dimension으로 adaptive average pooling을 사용한다.

Local space-time information injection

previous feature의 정보를 포함하는 projection function $G_{l}$ 를 포함하여 fine-grained token에 space-time information을 inject한다.

HyperSeg/eval/eval_ReasonVOS.py at main · congvvc/HyperSeg

Project for "HyperSeg: Towards Universal Visual Segmentation with Large Language Model". - congvvc/HyperSeg

github.com

line 120부터 참조

References

Footnotes

'DL·ML > Paper' 카테고리의 다른 글

PSALM (ECCV 2024, Image Segmentation) (0)	2025.01.10
InstructSeg (arXiv preprint) (1)	2025.01.07
VISA (ECCV 2024, RVOS) (0)	2025.01.03
VideoLISA (NeurIPS 2024,VOS) (1)	2025.01.02
MoRA (arXiv preprint, STVG) (0)	2025.01.02

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

HyperSeg (arXiv preprint, seg)

Abstract

Motivation

Methods

Overview

Hybrid Entity Recognition

Fine-grained Visual Perceiver

Temporal Adapter

Experiments

Results

Discussion

References

Footnotes

'DL·ML > Paper' 카테고리의 다른 글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역