Grounding DINO architecture

Overall Architecture

Input

    def forward(self, samples: NestedTensor, targets: List = None, **kw):
        """The forward expects a NestedTensor, which consists of:
           - samples.tensor: batched images, of shape [batch_size x 3 x H x W]
           - samples.mask: a binary mask of shape [batch_size x H x W], containing 1 on padded pixels

        It returns a dict with the following elements:
           - "pred_logits": the classification logits (including no-object) for all queries.
                            Shape= [batch_size x num_queries x num_classes]
           - "pred_boxes": The normalized boxes coordinates for all queries, represented as
                           (center_x, center_y, width, height). These values are normalized in [0, 1],
                           relative to the size of each individual image (disregarding possible padding).
                           See PostProcess for information on how to retrieve the unnormalized bounding box.
           - "aux_outputs": Optional, only returned when auxilary losses are activated. It is a list of
                            dictionnaries containing the two above keys for each decoder layer.
        """

model의 input은 image와 caption을 받게 된다. 이를 호출하는 predict method를 살펴보면 다음과 같이 구현되어 있다.

with torch.no_grad():
        outputs = model(image[None], captions=[caption])

Text Encoding

text는 먼저 tokenize한 후 BERT로 vanilla text feature를 얻는다.

 # encoder texts
        tokenized = self.tokenizer(captions, padding="longest", return_tensors="pt").to(
            samples.device
        )
        (
            text_self_attention_masks,
            position_ids,
            cate_to_token_mask_list,
        ) = generate_masks_with_special_tokens_and_transfer_map(
            tokenized, self.specical_tokens, self.tokenizer
        )

        if text_self_attention_masks.shape[1] > self.max_text_len:
            text_self_attention_masks = text_self_attention_masks[
                :, : self.max_text_len, : self.max_text_len
            ]
            position_ids = position_ids[:, : self.max_text_len]
            tokenized["input_ids"] = tokenized["input_ids"][:, : self.max_text_len]
            tokenized["attention_mask"] = tokenized["attention_mask"][:, : self.max_text_len]
            tokenized["token_type_ids"] = tokenized["token_type_ids"][:, : self.max_text_len]

        # extract text embeddings
        if self.sub_sentence_present:
            tokenized_for_encoder = {k: v for k, v in tokenized.items() if k != "attention_mask"}
            tokenized_for_encoder["attention_mask"] = text_self_attention_masks
            tokenized_for_encoder["position_ids"] = position_ids
        else:
            # import ipdb; ipdb.set_trace()
            tokenized_for_encoder = tokenized

        bert_output = self.bert(**tokenized_for_encoder)  # bs, 195, 768

이를 위에서부터 살펴보면 다음과 같다.

tokenizer 객체를 이용하여 tokenize

tokenized = self.tokenizer(captions, padding="longest", return_tensors="pt").to(
            samples.device
        )

self attention mask 생성

(
            text_self_attention_masks,
            position_ids,
            cate_to_token_mask_list,
        ) = generate_masks_with_special_tokens_and_transfer_map(
            tokenized, self.specical_tokens, self.tokenizer
        )

여기서 `generate_masks_with_special_tokens_and_transfer_map` 은 special token 사이에 attention mask를 만들어주는 역할을 한다.

sequence 길이 제한

if text_self_attention_masks.shape[1] > self.max_text_len:
            text_self_attention_masks = text_self_attention_masks[
                :, : self.max_text_len, : self.max_text_len
            ]
            position_ids = position_ids[:, : self.max_text_len]
            tokenized["input_ids"] = tokenized["input_ids"][:, : self.max_text_len]
            tokenized["attention_mask"] = tokenized["attention_mask"][:, : self.max_text_len]
            tokenized["token_type_ids"] = tokenized["token_type_ids"][:, : self.max_text_len]

self attention mask의 길이가 너무 길 경우 길이를 맞춰준다.

text embedding 추출

# extract text embeddings
if self.sub_sentence_present:
        tokenized_for_encoder = {k: v for k, v in tokenized.items() if k != "attention_mask"}
        tokenized_for_encoder["attention_mask"] = text_self_attention_masks
        tokenized_for_encoder["position_ids"] = position_ids
else:
        # import ipdb; ipdb.set_trace()
        tokenized_for_encoder = tokenized

bert_output = self.bert(**tokenized_for_encoder)  # bs, 195, 768

sub_sentence_present는 기본값이 True로 설정되어 있다.

따라서 attention mask와 position id를 설정 후 bert에 넣어 text embedding을 얻는다.

이 때 사용하는 bert는 "bert-base-uncased"이다. 이때 output dimension은 768d이다.

linear layer

encoded_text = self.feat_map(bert_output["last_hidden_state"])  # bs, 195, d_model
text_token_mask = tokenized.attention_mask.bool()  # bs, 195
# text_token_mask: True for nomask, False for mask
# text_self_attention_masks: True for nomask, False for mask

Figure 1 상에는 나타나있지 않지만, feature map을 한 번 더 거치는 과정이 있다. 이는 Linear layer로 구성된다. feat_map은 다음과 같이 정의되어 있다.

`self.feat_map = nn.Linear(self.bert.config.hidden_size, self.hidden_dim, bias=True)`

Image Encoding

image encoding은 groundingdino.py 299-303 line부터 시작한다.

# import ipdb; ipdb.set_trace()
if isinstance(samples, (list, torch.Tensor)):
    samples = nested_tensor_from_tensor_list(samples)
if not hasattr(self, 'features') or not hasattr(self, 'poss'):
    self.set_image_tensor(samples)

이 과정은 image를 backbone을 이용하여 image embedding으로 만드는 과정이다.

set_image_tensor method는 다음과 같이 정의되어 있다.

def set_image_tensor(self, samples: NestedTensor):
    if isinstance(samples, (list, torch.Tensor)):
        samples = nested_tensor_from_tensor_list(samples)
    self.features, self.poss = self.backbone(samples)

여기서 image backbone은 ["resnet50", "resnet101"]의 ResNet 계열이나 ["swin_T_224_1k", "swin_B_224_22k", "swin_B_384_22k", "swin_L_224_22k", "swin_L_384_22k"]의 SwinTransformer 계열을 사용할 수 있다.

backbone architecture는 backbone/backbone.py에서 Joiner와 BackboneBase의 forward를 참조하면 이해하기 좋다.

ㅇㅇ

        srcs = []
        masks = []
        for l, feat in enumerate(self.features):
            src, mask = feat.decompose()
            srcs.append(self.input_proj[l](src))
            masks.append(mask)
            assert mask is not None
        if self.num_feature_levels > len(srcs):
            _len_srcs = len(srcs)
            for l in range(_len_srcs, self.num_feature_levels):
                if l == _len_srcs:
                    src = self.input_proj[l](self.features[-1].tensors)
                else:
                    src = self.input_proj[l](srcs[-1])
                m = samples.mask
                mask = F.interpolate(m[None].float(), size=src.shape[-2:]).to(torch.bool)[0]
                pos_l = self.backbone[1](NestedTensor(src, mask)).to(src.dtype)
                srcs.append(src)
                masks.append(mask)
                self.poss.append(pos_l)

self.feature는 backbone에서 얻은 image feature 값이다.

self.input_proj는 2d convolution layer이다. 이를 이용해서 src와 mask를 만든다.

Transformer Encoder

 input_query_bbox = input_query_label = attn_mask = dn_meta = None
 hs, reference, hs_enc, ref_enc, init_box_proposal = self.transformer(
    srcs, masks, input_query_bbox, self.poss, input_query_label, attn_mask, text_dict
)

self.tranformer는 self-attention 후에 cross attention을 수행하는 module이다. 구현은 transformer.py 안에 되어 있다.

Text Enhancer

        if use_text_enhancer:
            text_enhance_layer = TransformerEncoderLayer(
                d_model=d_model,
                nhead=nhead // 2,
                dim_feedforward=dim_feedforward // 2,
                dropout=text_dropout,
            )
        else:
            text_enhance_layer = None

transformer.py 88-96 line에 text enhancer가 구현되어 있다.

여기의 TransformerEncoderLayer는 multihead self-attention을 구현한다.

근데 설정 상 text enhancer의 사용은 False로 되어 있다. 실제 inference 상에서의 세팅은 알 수 없으나 성능 상 이유로 제거한 것으로 추정된다.

Feature Fusion

        if use_fusion_layer:
            feature_fusion_layer = BiAttentionBlock(
                v_dim=d_model,
                l_dim=d_model,
                embed_dim=dim_feedforward // 2,
                num_heads=nhead // 2,
                dropout=fusion_dropout,
                drop_path=fusion_droppath,
            )
        else:
            feature_fusion_layer = None

feature fusion은 Bi attention으로 되어 있다. BiAttentionBlock은 양 방향에 대한 cross attention을 각각 수행하는 방식이다.

fuse_modules.py에서 구현을 확인할 수 있다.

attn_output_v = torch.bmm(attn_probs_v, value_l_states)
attn_output_l = torch.bmm(attn_probs_l, value_v_states)

결과적으로 transformer encoder가 수행이 끝나고 `return output, memory_text`를 통해 output과 memory_text를 리턴한다.

output은 `output: [bs, sum(hi*wi), 256]`로 정의된다. memory_text는 input과 같다.

Transformer Decoder

Decoder는 위의 language-guided query selection 파트부터가 해당한다.

Decoder는 decoder_layer를 num_layer 횟수만큼 반복하여 이루어져 있다.

124 line을 보면 이 decoder_layer는 DeformableTransformerDecoderLayer로 구성되어 있다.

DeformableTransformerDecoder는 802 line에서 선언되는데, cross attention 이후에 self-attention, ffn를 거치는 방식이다.

따라서 attention map을 보려면 마지막 DeformableTransformerDecoder의 image self attention을 확인하면 되겠다.

# cross attention
        self.cross_attn = MSDeformAttn(
            embed_dim=d_model,
            num_levels=n_levels,
            num_heads=n_heads,
            num_points=n_points,
            batch_first=True,
        )
        self.dropout1 = nn.Dropout(dropout) if dropout > 0 else nn.Identity()
        self.norm1 = nn.LayerNorm(d_model)

        # cross attention text
        if use_text_cross_attention:
            self.ca_text = nn.MultiheadAttention(d_model, n_heads, dropout=dropout)
            self.catext_dropout = nn.Dropout(dropout) if dropout > 0 else nn.Identity()
            self.catext_norm = nn.LayerNorm(d_model)

        # self attention
        self.self_attn = nn.MultiheadAttention(d_model, n_heads, dropout=dropout)
        self.dropout2 = nn.Dropout(dropout) if dropout > 0 else nn.Identity()
        self.norm2 = nn.LayerNorm(d_model)

        # ffn
        self.linear1 = nn.Linear(d_model, d_ffn)
        self.activation = _get_activation_fn(activation, d_model=d_ffn, batch_dim=1)
        self.dropout3 = nn.Dropout(dropout) if dropout > 0 else nn.Identity()
        self.linear2 = nn.Linear(d_ffn, d_model)
        self.dropout4 = nn.Dropout(dropout) if dropout > 0 else nn.Identity()
        self.norm3 = nn.LayerNorm(d_model)

        self.key_aware_proj = None
        self.use_text_feat_guide = use_text_feat_guide
        assert not use_text_feat_guide
        self.use_text_cross_attention = use_text_cross_attention

'DL·ML' 카테고리의 다른 글

VAE Loss Derivation (in progress) (1)	2024.04.07
[ODAI] DOTA benchmark (2)	2024.03.06
[Object Detection] DINO (0)	2024.02.21
[Object Detection] DETR (0)	2024.02.21
[ZSD] GLIP (2)	2024.02.06

Overall Architecture

Input

Text Encoding

Image Encoding

Transformer Encoder

Text Enhancer

Feature Fusion

Transformer Decoder

'DL·ML' 카테고리의 다른 글

티스토리툴바