How can I fix the CUDNN errors when I'm running train with RTX 2080?

Question

Aydin Sümer 2018년 12월 5일

0
링크

이 질문에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/433944-how-can-i-fix-the-cudnn-errors-when-i-m-running-train-with-rtx-2080

편집: Sumaiya Ahmad 2024년 2월 6일

Hello,

Recently, I bought 2 RTX 2080 gpu. We currently have 2 RTX 2080 and 1 TITAN xp gpu. I want these gpu to work in parallel. But, I am constantly getting the CUDNN_STATUS_EXECUTION_FAILED error. I've added the cuda cache to the system requirements in the environment variables as 512mb, but I still get the same error.

Training across multiple GPUs.
Initializing image normalization.
|=======================================================================================================================|
|     Epoch    |   Iteration  | Time Elapsed |  Mini-batch  |  Validation  |  Mini-batch  |  Validation  | Base Learning|
|              |              |  (seconds)   |     Loss     |     Loss     |     RMSE     |     RMSE     |     Rate     |
|=======================================================================================================================|
Error using trainNetwork (line 150)
Unexpected error calling cuDNN: *CUDNN_STATUS_EXECUTION_FAILED.*
Error in segnet_deadpixel_train (line 86)
[net info] = trainNetwork(pximds,lgraph,options);
Caused by:
    Error using nnet.internal.cnn.ParallelTrainer/train (line 68)
    Error detected on worker 1.
        Error using nnet.internal.cnn.layer.util.Convolution2DGPUStrategy/forward (line 16)
        Unexpected error calling cuDNN: CUDNN_STATUS_EXECUTION_FAILED._

Expect this error, when working with a single gpu with the TITAN xp is working well but the RTX 2080,it's working slowly and giving the following warning.

Warning: GPU is low on memory, which can slow performance due to additional data transfers with main memory. Try reducing 
the 'MiniBatchSize' training option. This warning will not appear again unless you run the command :
warning('on','nnet_cnn:warning:GPULowMemory').

I've tried MATLAB 2018a and 2018b versions with windows 10 64bit. Which version of MATLAB should I use to resolve these issues? Which versions of CUDA and CUDNN support RTX 2080? How can i fix this errors ?

댓글 수: 1
이전 댓글 -1개 표시이전 댓글 -1개 숨기기

Sumaiya Ahmad 2024년 2월 6일

편집: Sumaiya Ahmad 2024년 2월 6일

I got this error:

"cuDNN failed to return a valid plan for cudnn Backend Execute for convolution."

I reduced my validation size and it worked! No more errors!

댓글을 달려면 로그인하십시오.

이 질문에 답변하려면 로그인하십시오.

Answer 1

Joss Knight 2018년 12월 5일

7
링크

이 답변에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/433944-how-can-i-fix-the-cudnn-errors-when-i-m-running-train-with-rtx-2080#answer_350639

편집: Joss Knight 2018년 12월 5일

MATLAB Online에서 열기

This a known issue. Before you start anything else run

try
    nnet.internal.cnngpu.reluForward(1);
catch ME
end

That should clear the issue.

댓글 수: 5
이전 댓글 3개 표시이전 댓글 3개 숨기기

Aydin Sümer 2018년 12월 5일

MATLAB Online에서 열기

Thanks for your reply. I tried as you said but the error continues. When I started to parallel pool, It's giving like that warning. Maybe problem can be in here ?

Starting parallel pool (parpool) using the 'local' profile ...
connected to 3 workers.
Lab 1: 
  Warning: The CUDA driver must recompile the GPU libraries because your device is more recent than the libraries. Recompiling can take several minutes.
  > In parallel.internal.gpu.selectDevice
Lab 2: 
  Warning: The CUDA driver must recompile the GPU libraries because your device is more recent than the libraries. Recompiling can take several minutes.
  > In parallel.internal.gpu.selectDevice
Lab 1: 
    In parallel.gpu.GPUDevice.current (line 44)
    In gpuDevice (line 23)
    In nnet.internal.cnn.util.isGPUCompatible (line 10)
    In nnet.internal.cnn.assembler.setupExecutionEnvironment>iGetHostnameAndDeviceIndex (line 284)
    In spmdlang.remoteBlockExecution (line 50)
Lab 2: 
    In parallel.gpu.GPUDevice.current (line 44)
    In gpuDevice (line 23)
    In nnet.internal.cnn.util.isGPUCompatible (line 10)
    In nnet.internal.cnn.assembler.setupExecutionEnvironment>iGetHostnameAndDeviceIndex (line 284)
    In spmdlang.remoteBlockExecution (line 50)
Initializing image normalization.

After that I'm taking a CUDNN_STATUS_EXECUTION_FAILED error.

Aydin Sümer 2018년 12월 5일

MATLAB Online에서 열기

Thank you so much. That's code fix my problem.

I want to ask different question because of I didn't understand why it's happen like that. When I use a single GPU with TITAN XP, it iterates more quickly.

Training on single GPU.
Initializing image normalization.
|========================================================================================|
|  Epoch  |  Iteration  |  Time Elapsed  |  Mini-batch  |  Mini-batch  |  Base Learning  |
|         |             |   (hh:mm:ss)   |   Accuracy   |     Loss     |      Rate       |
|========================================================================================|
|       1 |           1 |       00:00:02 |       63.61% |       0.7633 |          0.0010 |
|       1 |           2 |       00:00:04 |       63.68% |       0.7207 |          0.0010 |
|       1 |           4 |       00:00:07 |       63.86% |       0.7527 |          0.0010 |
|       1 |           6 |       00:00:09 |       64.10% |       0.7661 |          0.0010 |
|       1 |           8 |       00:00:12 |       64.05% |       0.7409 |          0.0010 |
|       1 |          10 |       00:00:15 |       64.64% |       0.7467 |          0.0010 |

If you use multiple gpu,iterates longer.

Starting parallel pool (parpool) using the 'local' profile ...
connected to 3 workers.
Lab 1: 
  Warning: The CUDA driver must recompile the GPU libraries because your device is more recent than the libraries. Recompiling can take several minutes.
    In spmdlang.remoteBlockExecution (line 50)
Lab 2: 
  Warning: The CUDA driver must recompile the GPU libraries because your device is more recent than the libraries. Recompiling can take several minutes.
    In spmdlang.remoteBlockExecution (line 50)
Initializing image normalization.
|========================================================================================|
|  Epoch  |  Iteration  |  Time Elapsed  |  Mini-batch  |  Mini-batch  |  Base Learning  |
|         |             |   (hh:mm:ss)   |   Accuracy   |     Loss     |      Rate       |
|========================================================================================|
|       1 |           1 |       00:00:08 |       60.96% |       0.9061 |          0.0010 |
|       1 |           2 |       00:00:16 |       60.31% |       0.8097 |          0.0010 |
|       1 |           4 |       00:00:31 |       60.24% |       0.8561 |          0.0010 |
|       1 |           6 |       00:00:46 |       60.14% |       0.7761 |          0.0010 |
|       1 |           8 |       00:01:01 |       60.57% |       0.7926 |          0.0010 |
|       1 |          10 |       00:01:16 |       60.69% |       0.7847 |          0.0010 |
Lab 1: 
  Warning: GPU is low on memory, which can slow performance due to additional data transfers with main memory. 
  Try reducing the 'MiniBatchSize' training option. This warning will not appear again unless you run the command: warning('on','nnet_cnn:warning:GPULowOnMemory').

what is the reason for this?

Joss Knight 2018년 12월 5일

When your device has to start paging, it goes A LOT slower. This is just a backup to help your training finish, but it will slow things down.

Secondly, if you're on Windows, multi-GPU training is slower.

Thirdly, if the MiniBatchSize is 1, multi-GPU training is pointless because there is no way to divide the mini-batch between workers. Set the miniBatchSize to 2. But the 2080 will still run out of memory.

Aydin Sümer 2018년 12월 5일

편집: Aydin Sümer 2018년 12월 5일

Thanks for your reply. I understood better now.

댓글을 달려면 로그인하십시오.

Answer 2

Joss Knight 2018년 12월 5일

0
링크

이 답변에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/433944-how-can-i-fix-the-cudnn-errors-when-i-m-running-train-with-rtx-2080#answer_350646

Regarding issues with memory, the Titan XP has 12GB of memory while the RTX 2080 has only 8GB. You'll need to reduce your MiniBatchSize further to train SegNet on the 2080.

댓글 수: 4
이전 댓글 2개 표시이전 댓글 2개 숨기기

Aydin Sümer 2018년 12월 5일

MATLAB Online에서 열기

I'm using that configuration. I think, it's original version.

lgraph = segnetLayers(imageSize,numClasses,'vgg16');

That's my training options.

options = trainingOptions('sgdm', ...
    'Momentum',0.9, ...
    'InitialLearnRate',1e-3, ...
    'L2Regularization',0.0005, ...
    'MaxEpochs',100, ...  
    'MiniBatchSize',1, ...
    'Shuffle','every-epoch', ...
    'CheckpointPath', ConvnetFolder, ...
    'VerboseFrequency',2,'ExecutionEnvironment','multi-gpu');

Joss Knight 2018년 12월 5일

I'm looking into this. It may be that the options to segnetLayers do not allow a small enough network for training on an 8GB GPU. You may have to edit the network manually to create a smaller one.

댓글을 달려면 로그인하십시오.

Answer 3

Julian Beckmann 2021년 12월 7일

0
링크

이 답변에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/433944-how-can-i-fix-the-cudnn-errors-when-i-m-running-train-with-rtx-2080#answer_849475

I am getting the same error message (CUDNN_STATUS_EXECUTION_FAILED) trying to train a network on a local RTX 3070 gpu (using Matlab 2021a). The odd thing here is that I trained the same network structure with identical hyperparameters yesterday and it worked perfectly fine. However, starting this morning I receive the CUDNN_STATUS_EXECUTION_FAILED error message, whenever I try to start to train a network.

I literaly did not change anything regarding my hardware or software, the only thing I did was shutting my pc down yesterday in the evening and when I wanted to start training again this morning, I received the error message as described, which seems kind of odd to me.

I already tried Joss proposals to fix the issue, but it still persists.

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

댓글을 달려면 로그인하십시오.

How can I fix the CUDNN errors when I'm running train with RTX 2080?

댓글 수: 1
이전 댓글 -1개 표시이전 댓글 -1개 숨기기

채택된 답변

댓글 수: 5
이전 댓글 3개 표시이전 댓글 3개 숨기기

추가 답변 (2개)

댓글 수: 4
이전 댓글 2개 표시이전 댓글 2개 숨기기

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

참고 항목

카테고리

태그

Community Treasure Hunt

How can I fix the CUDNN errors when I'm running train with RTX 2080?

댓글 수: 1 이전 댓글 -1개 표시이전 댓글 -1개 숨기기

채택된 답변

댓글 수: 5 이전 댓글 3개 표시이전 댓글 3개 숨기기

추가 답변 (2개)

댓글 수: 4 이전 댓글 2개 표시이전 댓글 2개 숨기기

댓글 수: 0 이전 댓글 -2개 표시이전 댓글 -2개 숨기기

참고 항목

카테고리

태그

Community Treasure Hunt

댓글 수: 1
이전 댓글 -1개 표시이전 댓글 -1개 숨기기

댓글 수: 5
이전 댓글 3개 표시이전 댓글 3개 숨기기

댓글 수: 4
이전 댓글 2개 표시이전 댓글 2개 숨기기

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기