How can i make vision transformer model that recives input, multiple images

Question

수민 안 2023년 12월 26일

1
링크

이 질문에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/2064127-how-can-i-make-vision-transformer-model-that-recives-input-multiple-images

편집: Debraj Maji 2023년 12월 27일

Is it possible to create or learn a deep learning model in Matlab that receives multiple images as input and has one sequence as output?

For example, I wonder how to receive 20 consecutive images as inputs and output a sequence such as '11153'.

Thanks for reading

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

댓글을 달려면 로그인하십시오.

이 질문에 답변하려면 로그인하십시오.

Answer 1

Debraj Maji 2023년 12월 27일

2
링크

이 답변에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/2064127-how-can-i-make-vision-transformer-model-that-recives-input-multiple-images#answer_1378312

편집: Debraj Maji 2023년 12월 27일

Hi @수민 안

I understand that you are trying to create a Vision Transformer(ViT) Model which takes multiple images as input and generates a sequence.

Creating a Vision Transformer (ViT) model that receives multiple images as input in MATLAB involves adapting the ViT architecture to handle sequences of images. The original ViT architecture is designed for single-image classification tasks. To modify it for sequential multi-image input, you would treat each image as a token in the sequence and process these tokens in a way similar to how transformers process sequential data in natural language processing (NLP).

Here is a conceptual outline of how you could approach this:

Step 1: Preprocessing:

Resize all images to a fixed size.
Flatten each image into a 1D vector or use patches as tokens, as done in ViT.
Optionally, add positional encoding to retain the order of the images.

Step 2: Transformer Encoder:

Use a series of transformer encoder layers to process the sequence of image tokens.
Each transformer encoder layer would include multi-head self-attention and feedforward neural networks.

Step 3: Sequence Decoder:

After processing the images through the transformer encoder, you need to decode the output into a sequence.
You can use an RNN, LSTM, or another transformer decoder to generate the output sequence.

Step 4: Output Layer:

The output layer would produce the final sequence, which could be a series of classification layers, one for each position in the output sequence.

In MATLAB, you can use Deep Learning Toolbox to create custom layers and models. Currently MATLAB does support a pre-defined ViT. However this scenario would require you to implement the transformer layers manually. You can follow this documentation for steps on how to define custom Deep Learning Layers:

https://www.mathworks.com/help/deeplearning/ug/define-custom-deep-learning-layer.html

For the Pretrained ViT available in MATLAB you can refer to the following documentation: https://www.mathworks.com/help/vision/ref/visiontransformer.html

For additional info on pre-defined Deep Learning Layers in MATLAB you can refer to the following link:

https://www.mathworks.com/help/deeplearning/ug/list-of-deep-learning-layers.html

I hope this resolves your query.

With regards,

Debraj.

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

댓글을 달려면 로그인하십시오.

Answer 2

Shubham 2023년 12월 27일

0
링크

이 답변에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/2064127-how-can-i-make-vision-transformer-model-that-recives-input-multiple-images#answer_1378252

Hi 수민 안,

Yes, it is possible to create a deep learning model in MATLAB that takes multiple images as input and outputs a sequence of numbers. This can be done by using a Convolutional Neural Network (CNN) for image feature extraction combined with a Recurrent Neural Network (RNN) or Long Short-Term Memory (LSTM) network for sequence prediction.

Here's a high-level overview of how you might approach this:

Data Preparation:

Organize your images and corresponding sequence labels.
Preprocess the images (resizing, normalization, etc.).
Split the data into training, validation, and test sets.

2. Model Architecture:

Use a CNN as the feature extractor for the images. You can use pre-trained networks like VGG, ResNet, or create your own.
Flatten the output of the CNN or use global pooling to reduce the dimensionality.
Feed the output into an RNN or LSTM layer(s) to handle the sequence prediction.
The final output layer should have the number of units corresponding to the length of your output sequence with a softmax activation if you are treating each position as a classification problem.

3. Training:

Compile the model with an appropriate loss function (e.g., categorical cross-entropy if you're treating the sequence prediction as a classification problem).
Train the model using the training data with validation data to monitor performance.

4. Evaluation and Testing:

Evaluate the model's performance on the test set.
Adjust hyperparameters or model architecture as needed based on performance.

5. Prediction:

Use the trained model to predict sequences from new sets of images.

I hope this helps!

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

댓글을 달려면 로그인하십시오.

How can i make vision transformer model that recives input, multiple images

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

답변 (2개)

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

참고 항목

카테고리

태그

제품

릴리스

Community Treasure Hunt

How can i make vision transformer model that recives input, multiple images

댓글 수: 0 이전 댓글 -2개 표시이전 댓글 -2개 숨기기

답변 (2개)

댓글 수: 0 이전 댓글 -2개 표시이전 댓글 -2개 숨기기

댓글 수: 0 이전 댓글 -2개 표시이전 댓글 -2개 숨기기

참고 항목

카테고리

태그

제품

릴리스

Community Treasure Hunt

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기