Overcoming VRAM limitations on Nvidia A100

Question

Christopher McCausland 2023년 3월 13일

0
링크

이 질문에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/1928115-overcoming-vram-limitations-on-nvidia-a100

댓글: Joss Knight 2023년 3월 14일

I have access to a cluster with several Nvidia A100 40GB GPU's. I am training a deep learning network on these GPU's, however using trainNetwork() only makes use of around 10GB of the GPU's vRAM. I beleive this is a limitation of Nvidia Cuda, see here.

I have two related questions;

Other cluster users are writting in python with the 'DistributedDataParallel' module in PyTorch and are able to load in 40Gb of data (over the cuda limitation) onto the GPU's; is there a similar work around for MATLAB?
If this isn't the case is there any way to use Multi-instance GPU's, so essentially split the physical card into several smaller virtual GPU's and compute in parrellel?

Ideally I would like to speed up computation, so having a 3/4 of the vRAM empty which could otherwise be used for mini-batches is a little heart breaking.

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

댓글을 달려면 로그인하십시오.

이 질문에 답변하려면 로그인하십시오.

Answer 1

Joss Knight 2023년 3월 14일

2
링크

이 답변에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/1928115-overcoming-vram-limitations-on-nvidia-a100#answer_1192575

Just increase the MiniBatchSize and it'll use more memory.

댓글 수: 6
이전 댓글 4개 표시이전 댓글 4개 숨기기

Christopher McCausland 2023년 3월 14일

MATLAB Online에서 열기

Hi Joss,

Thank you for the answer. This was my inital port of call.

My initial MiniBatchSize was 30 which was about 10GB of vRAM, as this is a 40GB card I then doubled MiniBatchSize to 60 expecting around 20GB of vRAM usage. At that stage I get the dreaded;

Maximum variable size allowed on the device is exceeded.

However, when I monitor vRAM usage in real time I seem to 'max' out at 12GB. (Right hand graph, Red plot. A second idle A100 can be seen in the backgound as blue)

I can see GPU utilisation is around 90% for this GPU which is good (Left hand graph, green plot), but also that there are regular dips which I assume is a pause to load in more data. (There could also be spikes of vRAM usage which I dont't see due to a slow polling rate.)

I am using the inbuilt trainNetwork with a datastore, I can share this code if needed however it is all very vanilla.

net = trainNetwork(dstrain,layers,options);

@Joss Knight, am I correct in saying that the expected behaviour is that increasing MiniBatchSize should allow me to use most of the vRAM on the card, say ~36GB. (With some vRAM reserved for Nvidia tasks etc.)?

Christopher

Joss Knight 2023년 3월 14일

You'll have to ask a specific question about DispatchInBackground but I can certainly help.

GPU memory usage is determined by the number and size of GPU arrays that are allocated on the card (plus some other stuff). But the allowed size of each of those arrays is also limited by the number of elements that can be indexed with an int32, so you can't have more than 2147483647 elements.

Most people never hit the array size limit, so they can just increase their MiniBatchSize until they run out of memory. This increases the size of the arrays stored on the GPU during training.

Once your card is running something on every available thread, it can't do more work in parallel, so it has to schedule to do it in chunks. So if you're using every available thread and passing it data as fast as you can, it's neither here nor there whether you're processing huge arrays in a single go or smaller arrays in multiple goes. You should worry less about whether your memory is full and more about whether your GPU is working flat out.

Christopher McCausland 2023년 3월 14일

Hi Joss,

That makes much more sense, thank you for the explination too.

I will be able to eek out the final 10% of GPU utilsation by finding the exact minibatchsize that cases the fail. Regardless, as you mentioned, down-sampling the data should allow for larger minibatchsize size too.

I will wait for an answer to https://uk.mathworks.com/matlabcentral/answers/1926685-deep-learning-with-partitionable-datastores-on-a-cluster?s_tid=srchtitle as i'll need partition to be true before I can use DispatchInBackground. Ideally, I would like to distrabute this over multiple GPU workers so hopefully I can get partition working.

In the mean time I will mark this question as answered and will @ you in the next one if I can get partition behaving.

Thank you!

Christopher

Joss Knight 2023년 3월 14일

You may never get that 10% so don't get your hopes up! Also, the best utilization is not necessarily at the highest batch size.

Why not ask a new question where you show your code for your datastore and one of us can help you make it partitionable.

댓글을 달려면 로그인하십시오.

Overcoming VRAM limitations on Nvidia A100

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

채택된 답변

댓글 수: 6
이전 댓글 4개 표시이전 댓글 4개 숨기기

추가 답변 (0개)

참고 항목

카테고리

태그

제품

릴리스

Community Treasure Hunt

Overcoming VRAM limitations on Nvidia A100

댓글 수: 0 이전 댓글 -2개 표시이전 댓글 -2개 숨기기

채택된 답변

댓글 수: 6 이전 댓글 4개 표시이전 댓글 4개 숨기기

추가 답변 (0개)

참고 항목

카테고리

태그

제품

릴리스

Community Treasure Hunt

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

댓글 수: 6
이전 댓글 4개 표시이전 댓글 4개 숨기기