Conditional MixLoRA (ACL 2024, MLLM PEFT)

Motivation

multimodal LLM에서 broad range of tasks에 대해서 instruction tuning할 때, LoRA를 사용하면 task interference로 인해서 performance degradation이 발생함이 알려져 있다. 이 paper에서는 (1) 이를 확인하고, (2) 해결 방법으로 Conditional MixLoRA(Mixture-of-LoRA)를 제안한다.

Figure 1: Comparative overview of LoRA(left) and MixLoRA(right).

Fig. 1에서 보이다시피 기존 LoRA는 하나의 shared weight matrix를 사용하는데 반해, Conditional MixLoRA는 두 개의 matrix를 놓고 input instance에 따라서 둘을 dynamically select해서 task interference issue를 mitigate한다.

→ Mixture-of-Expert에서 따온 아이디어인 것 같다.

- 그렇다고 써 있다. (update)

Methods

Investigating Task Interference in Multimodal Instruction Tuning

task $i$와 $j$를 train할 때 둘의 gradient direction이 다르면 두 task 간의 interference가 발생하는 것으로 해석할 수 있다. eq. 2는 그런 의미에서 task i에 대한 parameter를 task j에 대해서 train했을 때의 gradient의 변화를 보여준다. task interference 값은 eq. 3로 quantified될 수 있다:

→ equation 2를 보면, 첫 번째 line에서, 원래의 task i에 대한 loss 값에서 parameter를 task j에 optimize하는 방향으로 update한 뒤의 loss 값의 차이의 expectation을 delta L로 정의한다. 이는 eq. 2에서 말하는 i를 j로 옮겼을 때의 변화가 j로 update한 뒤의 loss 값에의 scalar값으로 quantify되는 것이다.

→ 이를 equation 3에서 정의하는 interference 값으로 연결해서 생각해보면, i와 j가 잘 align된 task라서 gradient update할 때마다 loss 값이 감소했다면 eq. 2의 값은 positive할 것이다. 만약 align이 잘 안 되어 있다면 gradient update할 때마다 loss 값이 커져 eq. 2값이 negative하게 얻어진다. 이를 fraction으로 만든 eq. 3는 따라서, positive할 경우 두 task의 alignment가 좋고, 반대의 경우 두 task의 성격이 다름을 의미할 수 있다. 정량적 값이 큰 것은 더 잘 align된 것을 의미할 수 있으나 update 방식과 model의 구조에 따라 달라질 수 있어 값 자체의 의미는 해석하기 어렵다.

→ 다만 좀 이상하게 생각되는 점은, 왜 vector가 아니라 scalar로 quatification했느냐는 것이다. task는 vector여야 한다. 원래 task에서의 gradient update vector와 task j에서의 gradient update vector를 비교하는 게 맞지 않나? loss는 똑같이 줄어들어도 vector update는 다른 방향일 수도 있으니까..

→ gradient update vector는 robust하지가 않아서 그런가? 잘 모르겠다 ..

Figure 2: Task interference score matrices.

Fig. 2는 task 별 interference score를 보여준다. 같은 task는 1, positive task는 붉은 색, negative task는 푸른 색으로 표기되어 있다.

Conditional Mixture-of-LoRA

Conditional MixLoRA의 아이디어는 간단한데, 여러 LoRA의 weight matrix를 outer product로 construct할 때, 두 vector $a$와 $b$를 dynamic하게 select해서 사용하는 것이다(Dynamic Factor Selection).

Dynamic Factor Selection module은 Independent Factor Selection(IFS) router와 Conditional Factor Selection(CFS) router로 구성된다.

Independent Factor Selection(IFS)

Figure 3: Dynamic Factor Selection in MixLoRA.

아이디어는 간단하다. eq. 6처럼 input instance를 process한 vector $h$를 average하는 $R^A_{IFS}$에 linear layer $W_A$를 붙여 어떤 factor를 selection할 지 정하여 $r$개의 factor를 select한다. B의 경우에도 동일하게 수행된다.

Conditional Factor Selection(CFS)

CFS의 아이디어는 A의 정보를 넘겨받은 후의 B를 선택하는 것이다.

eq. 7으로 $A$의 dimension을 $B$와 동일하게 맞춘 뒤 eq. 8의 마지막 line equation처럼 B의 정보인 IFS와 A의 정보인 CFS를 더한 $g_B$를 계산하여 B를 선택한다.

이후 reconstruction하는 과정인 일반 LoRA와 같다:

Experiments

Table 1: Zero-shot multimodal evaluation on LLaVA-v1.

Table 2: Comparison between various routing strategies.

Figure 4: Effect of conditional factor selection.

Fig 4에서 $E$는 총 factor의 개수, $r$은 그 중 select한 factor의 개수이다.

Fig. 5, 6은 MixLoRA가 아닌 걸 보여줘야 하지 않나 싶다.

Figure 7: The Comparison of Task Interference Score I between LoRA (r=16) and MixLoRA (E = 16, r = 4).

Discussion

- 앞에서 말했던 것. 왜 task interference를 scalar로 측정하는지?

References

Footnotes

'DL·ML > Paper' 카테고리의 다른 글

VTP(EMNLP 2024, STVG) (0)	2025.01.01
CG-STVG(CVPR 2024) (1)	2024.12.31
Video-LaVIT (ICML 2024 Oral, Video tokenization) (0)	2024.09.30
UniHOI (NeurIPS 2023) (0)	2024.09.24
Co-DETR (ICCV 2023, OD) (0)	2024.09.12

Motivation

Methods

Investigating Task Interference in Multimodal Instruction Tuning

Conditional Mixture-of-LoRA

Experiments

Discussion

References

Footnotes

'DL·ML > Paper' 카테고리의 다른 글

티스토리툴바