How can I reassemble 'patch embedded' data back into original data structure in Vision Transformer on DeepNetworkDesigner?

Question

Brian Park 2023년 6월 24일

0
링크

이 질문에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/1987523-how-can-i-reassemble-patch-embedded-data-back-into-original-data-structure-in-vision-transformer-o

답변: Samuel Somuyiwa 2023년 7월 24일

I'm trying to finetune the Vision Transformer model from 'Computer Vision Toolbox Model for Vision Transformer Network' for instance segmentation.

So I need to reassemble the SCB data back into SSCB data using position information in positional vector but cannot find the layer block that makes it work.

How can I seperate the positional vector from 577(S) and make 576(S) into (SS) format data?

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

댓글을 달려면 로그인하십시오.

이 질문에 답변하려면 로그인하십시오.

Answer 1

Samuel Somuyiwa 2023년 7월 24일

0
링크

이 답변에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/1987523-how-can-i-reassemble-patch-embedded-data-back-into-original-data-structure-in-vision-transformer-o#answer_1277773

MATLAB Online에서 열기

Assuming you are using the Vision Transformer model as a backbone/encoder, you can obtain the output embedding from the last block (the last layerNormalizationLayer, before indexing1dLayer) and write a function to remove the output embedding corresponding to the class token, and reshape the resulting embedding to SSCB format. See example in the code below.

The example assumes that your decoder is a model function. If the decoder is a dlnetwork or layer graph, you can do the following instead:

Remove all layers after the last layerNormalizationLayer in the Vision Transformer backbone.
Create a (formattable) custom layer and use the code in reshapePatchEmbedding function in the example below as its predict method. Alternatively, you can use a (formattable) functionLayer (using a handle to reshapePatchEmbedding).
Connect the last layerNormalizationLayer in the Vision Transformer backbone to the custom/function layer, and connect the custom/function layer to the input of the decoder network.

% Get Vision Transformer model
net = visionTransformer;
% Create dummy input
input = dlarray(rand(384,384,3),'SSCB');
% Obtain output embedding from last layerNormalizationLayer
out = forward(net, input, Outputs='encoder_norm');
% Reshape output patch embedding
out = reshapePatchEmbedding(out);
function out = reshapePatchEmbedding(in)
% Remove output embedding corresponding to class token from input
out = in(2:end,:,:);
% Reshape resulting embedding to input format
WH = sqrt(size(out, 1));
C = size(out, 2);
out = reshape(out, WH, WH, C, []); % Shape is W x H x C x N
out = permute(out, [2, 1, 3, 4]); % Shape is H x W x C x N
% Convert to formatted dlarray
out = dlarray(out, 'SSCB');
end

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

댓글을 달려면 로그인하십시오.

How can I reassemble 'patch embedded' data back into original data structure in Vision Transformer on DeepNetworkDesigner?

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

채택된 답변

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

추가 답변 (0개)

참고 항목

카테고리

태그

제품

Community Treasure Hunt

How can I reassemble 'patch embedded' data back into original data structure in Vision Transformer on DeepNetworkDesigner?

댓글 수: 0 이전 댓글 -2개 표시이전 댓글 -2개 숨기기

채택된 답변

댓글 수: 0 이전 댓글 -2개 표시이전 댓글 -2개 숨기기

추가 답변 (0개)

참고 항목

카테고리

태그

제품

Community Treasure Hunt

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기