How can I fix CUDNN_STATUS_EXECUTION_FAILED error while training a Faster RCNN on a laptop RTX 3070?

Question

Pedro Garcia 2021년 4월 29일

0
링크

이 질문에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/816915-how-can-i-fix-cudnn_status_execution_failed-error-while-training-a-faster-rcnn-on-a-laptop-rtx-3070

답변: CANBERK TATLI 2022년 7월 24일

Hey everyone,

I am relatively new to deep learning and I've been trying to train a Faster RCNN for multi-class object detection on a custom dataset (3086 training images and 386 validation images of size [224 396 3]). The number of classes is 11.

Recently I've started having a warning of "CUDA_ERROR_ILLEGAL_ADDRESS" which leads to an "Error using nnet.internal.cnngpu.reluForward

Unexpected error calling cuDNN: CUDNN_STATUS_EXECUTION_FAILED".

My gpu devide has the following properties (the available memory line dissapeared after this error popped up, but usually is 7.33GB of available memory):

                      Name: 'NVIDIA GeForce RTX 3070 Laptop GPU'
                     Index: 1
         ComputeCapability: '8.6'
            SupportsDouble: 1
             DriverVersion: 11.3000
            ToolkitVersion: 11
        MaxThreadsPerBlock: 1024
          MaxShmemPerBlock: 49152
        MaxThreadBlockSize: [1024 1024 64]
               MaxGridSize: [2.1475e+09 65535 65535]
                 SIMDWidth: 32
               TotalMemory: 8.5899e+09
       MultiprocessorCount: 40
              ClockRateKHz: 1620000
               ComputeMode: 'Default'
      GPUOverlapsTransfers: 1
    KernelExecutionTimeout: 1
          CanMapHostMemory: 1
           DeviceSupported: 1
           DeviceAvailable: 1
            DeviceSelected: 1

And these are my training options (backbone network is a ResNet-50):

options = trainingOptions('sgdm', ...
    'MiniBatchSize', 1, ...
    'InitialLearnRate', 1e-3, ...
    'LearnRateSchedule', 'piecewise', ...
    'LearnRateDropFactor', 0.2, ...
    'LearnRateDropPeriod', 2, ...
    'MaxEpochs', 3, ...
    'ExecutionEnvironment','gpu',...
    'ValidationData', resizedDsVal, ...
    'Verbose',true);
try
    net.internal.cnngpu.reluForward(1);
catch ME
end
% Train the Faster R-CNN detector    
fasterRCNN = trainFasterRCNNObjectDetector(augmentedresizedDsTrain, lgraph, options, ...
        'NegativeOverlapRange',[0 0.3], 'PositiveOverlapRange',[0.6 1]);

I also should add that this error is appearing in the middle of the training process, so in the beginning of it everything looks fine but then this appears:

Initializing input data normalization.
|==========================================================================================================================================================================================|
|  Epoch  |  Iteration  |  Time Elapsed  |  Mini-batch  |  Mini-batch  |  Mini-batch  |  RPN Mini-batch  |  RPN Mini-batch  |  Validation  |  Validation  |  Validation  |  Base Learning  |
|         |             |   (hh:mm:ss)   |     Loss     |   Accuracy   |     RMSE     |     Accuracy     |       RMSE       |     Loss     |   Accuracy   |     RMSE     |      Rate       |
|==========================================================================================================================================================================================|
|       1 |           1 |       00:03:26 |       3.1444 |       30.39% |         0.18 |           57.48% |             0.90 |       3.2009 |       34.06% |         0.16 |          0.0010 |
(...)
|       1 |        1150 |       01:32:11 |       0.0892 |       99.42% |         0.16 |          100.00% |             0.21 |       0.2707 |       99.32% |         0.17 |          0.0010 |
|       1 |        1200 |       01:36:02 |       0.9868 |       98.56% |         0.14 |           94.53% |             1.82 |       0.2551 |       99.44% |         0.17 |          0.0010 |
|       1 |        1250 |       01:39:53 |       0.1590 |       99.80% |         0.13 |           98.44% |             0.37 |       0.2649 |       99.49% |         0.16 |          0.0010 |
|       1 |        1300 |       01:43:18 |       2.0556 |      100.00% |              |           98.44% |             2.77 |       0.2512 |       99.19% |         0.15 |          0.0010 |
|       1 |        1350 |       01:46:36 |       0.0542 |      100.00% |         0.08 |          100.00% |             0.23 |       0.2476 |       99.45% |         0.14 |          0.0010 |
Warning: Encountered unexpected error during CUDA execution. The CUDA error was:
CUDA_ERROR_ILLEGAL_ADDRESS 
Warning: Encountered unexpected error during CUDA execution. The CUDA error was:
CUDA_ERROR_ILLEGAL_ADDRESS 
Warning: Encountered unexpected error during CUDA execution. The CUDA error was:
CUDA_ERROR_ILLEGAL_ADDRESS 
Error using nnet.internal.cnngpu.reluForward
Unexpected error calling cuDNN: CUDNN_STATUS_EXECUTION_FAILED.
Error in nnet.internal.cnn.util.VectorReporter/computeAndReport (line 68)
                    feval( method, this.Reporters{i}, varargin{:} );
Error in nnet.internal.cnn.util.VectorReporter/computeIteration (line 24)
            computeAndReport( this, 'computeIteration', summary, network );
Error in nnet.internal.cnn.Trainer/train (line 144)
                    reporter.computeIteration( this.Summary, net );
Error in vision.internal.cnn.trainNetwork (line 110)
trainedNet = trainer.train(trainedNet, trainingDispatcher);
Error in trainFasterRCNNObjectDetector>iTrainEndToEnd (line 901)
    [net, info] = vision.internal.cnn.trainNetwork(...
Error in trainFasterRCNNObjectDetector (line 428)
    [detector, info] = iTrainEndToEnd(trainingData, fastRCNN, options, params, executionSettings, imageInfo);
Error in FasterRCNN_ResNet50 (line 106)
fasterRCNN = trainFasterRCNNObjectDetector(augmentedresizedDsTrain, lgraph, options, ...

My MATLAB version is 2021a and when I reboot my PC this issue seems to temporarly disapear as I am able to restart my training process. However, after 1-2hours this error pops up again and stops my training.

Any information regarding on how to fix this issue would be valuable, thank you!