How can I fix CUDNN_STATUS_EXECUTION_FAILED error while training a Faster RCNN on a laptop RTX 3070?
조회 수: 3 (최근 30일)
이전 댓글 표시
Hey everyone,
I am relatively new to deep learning and I've been trying to train a Faster RCNN for multi-class object detection on a custom dataset (3086 training images and 386 validation images of size [224 396 3]). The number of classes is 11.
Recently I've started having a warning of "CUDA_ERROR_ILLEGAL_ADDRESS" which leads to an "Error using nnet.internal.cnngpu.reluForward
Unexpected error calling cuDNN: CUDNN_STATUS_EXECUTION_FAILED".
My gpu devide has the following properties (the available memory line dissapeared after this error popped up, but usually is 7.33GB of available memory):
Name: 'NVIDIA GeForce RTX 3070 Laptop GPU'
Index: 1
ComputeCapability: '8.6'
SupportsDouble: 1
DriverVersion: 11.3000
ToolkitVersion: 11
MaxThreadsPerBlock: 1024
MaxShmemPerBlock: 49152
MaxThreadBlockSize: [1024 1024 64]
MaxGridSize: [2.1475e+09 65535 65535]
SIMDWidth: 32
TotalMemory: 8.5899e+09
MultiprocessorCount: 40
ClockRateKHz: 1620000
ComputeMode: 'Default'
GPUOverlapsTransfers: 1
KernelExecutionTimeout: 1
CanMapHostMemory: 1
DeviceSupported: 1
DeviceAvailable: 1
DeviceSelected: 1
And these are my training options (backbone network is a ResNet-50):
options = trainingOptions('sgdm', ...
'MiniBatchSize', 1, ...
'InitialLearnRate', 1e-3, ...
'LearnRateSchedule', 'piecewise', ...
'LearnRateDropFactor', 0.2, ...
'LearnRateDropPeriod', 2, ...
'MaxEpochs', 3, ...
'ExecutionEnvironment','gpu',...
'ValidationData', resizedDsVal, ...
'Verbose',true);
try
net.internal.cnngpu.reluForward(1);
catch ME
end
% Train the Faster R-CNN detector
fasterRCNN = trainFasterRCNNObjectDetector(augmentedresizedDsTrain, lgraph, options, ...
'NegativeOverlapRange',[0 0.3], 'PositiveOverlapRange',[0.6 1]);
I also should add that this error is appearing in the middle of the training process, so in the beginning of it everything looks fine but then this appears:
Initializing input data normalization.
|==========================================================================================================================================================================================|
| Epoch | Iteration | Time Elapsed | Mini-batch | Mini-batch | Mini-batch | RPN Mini-batch | RPN Mini-batch | Validation | Validation | Validation | Base Learning |
| | | (hh:mm:ss) | Loss | Accuracy | RMSE | Accuracy | RMSE | Loss | Accuracy | RMSE | Rate |
|==========================================================================================================================================================================================|
| 1 | 1 | 00:03:26 | 3.1444 | 30.39% | 0.18 | 57.48% | 0.90 | 3.2009 | 34.06% | 0.16 | 0.0010 |
(...)
| 1 | 1150 | 01:32:11 | 0.0892 | 99.42% | 0.16 | 100.00% | 0.21 | 0.2707 | 99.32% | 0.17 | 0.0010 |
| 1 | 1200 | 01:36:02 | 0.9868 | 98.56% | 0.14 | 94.53% | 1.82 | 0.2551 | 99.44% | 0.17 | 0.0010 |
| 1 | 1250 | 01:39:53 | 0.1590 | 99.80% | 0.13 | 98.44% | 0.37 | 0.2649 | 99.49% | 0.16 | 0.0010 |
| 1 | 1300 | 01:43:18 | 2.0556 | 100.00% | | 98.44% | 2.77 | 0.2512 | 99.19% | 0.15 | 0.0010 |
| 1 | 1350 | 01:46:36 | 0.0542 | 100.00% | 0.08 | 100.00% | 0.23 | 0.2476 | 99.45% | 0.14 | 0.0010 |
Warning: Encountered unexpected error during CUDA execution. The CUDA error was:
CUDA_ERROR_ILLEGAL_ADDRESS
Warning: Encountered unexpected error during CUDA execution. The CUDA error was:
CUDA_ERROR_ILLEGAL_ADDRESS
Warning: Encountered unexpected error during CUDA execution. The CUDA error was:
CUDA_ERROR_ILLEGAL_ADDRESS
Error using nnet.internal.cnngpu.reluForward
Unexpected error calling cuDNN: CUDNN_STATUS_EXECUTION_FAILED.
Error in nnet.internal.cnn.util.VectorReporter/computeAndReport (line 68)
feval( method, this.Reporters{i}, varargin{:} );
Error in nnet.internal.cnn.util.VectorReporter/computeIteration (line 24)
computeAndReport( this, 'computeIteration', summary, network );
Error in nnet.internal.cnn.Trainer/train (line 144)
reporter.computeIteration( this.Summary, net );
Error in vision.internal.cnn.trainNetwork (line 110)
trainedNet = trainer.train(trainedNet, trainingDispatcher);
Error in trainFasterRCNNObjectDetector>iTrainEndToEnd (line 901)
[net, info] = vision.internal.cnn.trainNetwork(...
Error in trainFasterRCNNObjectDetector (line 428)
[detector, info] = iTrainEndToEnd(trainingData, fastRCNN, options, params, executionSettings, imageInfo);
Error in FasterRCNN_ResNet50 (line 106)
fasterRCNN = trainFasterRCNNObjectDetector(augmentedresizedDsTrain, lgraph, options, ...
My MATLAB version is 2021a and when I reboot my PC this issue seems to temporarly disapear as I am able to restart my training process. However, after 1-2hours this error pops up again and stops my training.
Any information regarding on how to fix this issue would be valuable, thank you!
댓글 수: 0
답변 (1개)
CANBERK TATLI
2022년 7월 24일
Did you solve the problem? I'm also encountering the same problem. I'm also using a 3070 and Matlab R2019a version.
댓글 수: 0
참고 항목
카테고리
Help Center 및 File Exchange에서 GPU Computing에 대해 자세히 알아보기
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!