PSALM (ECCV 2024, Image Segmentation)

PSALM: Pixelwise SegmentAtion with Large Multi-Modal Model

PSALM is a powerful extension of the Large Multi-modal Model (LMM) to address the segmentation task challenges. To overcome the limitation of the LMM being limited to textual output, PSALM incorporates a mask decoder and a well-designed input schema to han

arxiv.org

Abstract

PSALM(Pixelwise SegmentAtion with Large multi-modal Model) 제안; LMM으로 여러 형태의 segmentation task를 푸는 방법

Motivation

Figure 1: PSALM has capability to handle multiple segmentation tasks in only one single model.

MLLM으로 mask generation을 하던 초창기 논문이다. LISA보다는 뒤에 나왔는데, LISA가 text로 정보를 주지 않는 경우의 segmentation task를 풀지 못하는 것에서 시작한다.

semantic segmentation은 여러 mask와 각각의 object ID를 generate해야 하고, interactive segmentation은 다양한 형태의 inputs(points, scribbles, bboxes, masks)이 들어갈 수 있기 때문에 각각의 task에 맞는 input/output format을 모두 맞추기가 어렵다.

PSALM에서는 적절한 input/output 구조로 다양한 형태의 segmentation task를 MLLM을 통해 해결한다. input으로는 images, task instruction prompt, condition prompt, mask tokens이 들어가고, output으로는 mask token과 condition embedding을 return한다.

Methods

architecture로는 LLaVA base model에 Swin Transformer를 vision encoder로 사용했다. resource limitation으로, LLM은 Vicuna 7B에서 Phi-1.5 1.3B model로 교체하였다. 교체하여야 했던 것은 vision-language alignment를 다시 했기 때문인데, LLM에 새로운 형태의 prompt를 넣는 형태를 LLM full-finetuning 형태로 하다 보니 VL alignment도 다시 해야 했던 것으로 보인다.

→ 좋지 않은 방법이다. 큰 LLM을 그대로 두고 PEFT FT&IT 하는게 좀 더 scalable한 방법인데..

Task Instruction Prompt

task instruction prompt는 model task를 describe하고 specify하는 text sentence이다. 예를 들어 panoptic segmentation이나 open-vocabulary segmentation setting에서는 다음과 같이 사용될 수 있다:

"You need to segment all objects. This is all the candidate categories."

Condition Prompt

panoptic segmentation에서의 candidate set of categories나 interactive segmentation에서 interactive input과 같이 종종 perform에 필요한 additional information이 있는 경우가 있다. 이때 conditional prompt를 사용한다.

Mask Token

LLM에서 바로 segmentation mask를 만들지 않으므로, input에 mask token을 붙인 다음에 output으로 나온 mask를 segmentation decoder에 넣어서 mask generation했다.

Mask Generator

MaskGenerator는 위와 같이 define되는데, visual feature $v$ , mask token $q$ , conditional embedding $c$ 를 input으로 받아 predicted segmentation mask $m \in ℝ^{H \times W}$ 와 correponding category probability $p \in ℝ^{K}$ 를 return한다.

'DL·ML > Paper' 카테고리의 다른 글

LaSagnA (Segmentation) (0)	2025.01.14
VideoRefer Suite (0)	2025.01.10
InstructSeg (arXiv preprint) (1)	2025.01.07
HyperSeg (arXiv preprint, seg) (0)	2025.01.06
VISA (ECCV 2024, RVOS) (0)	2025.01.03

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

PSALM (ECCV 2024, Image Segmentation)

Abstract

Motivation

Methods

Mask Generator

Experiments

Results

Discussion

References

Footnotes

'DL·ML > Paper' 카테고리의 다른 글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역