CUDA_ERROR_LAUNCH_FAILED and CUDA_ERROR​_ILLEGAL_A​DDRESS on Quadro RTX in TCC mode

We have a Quadro RTX 6000 in TCC mode with a fresh Windows 10 and Matlab 2021a U5 installation. The output of gpuDevice is:
CUDADevice with properties:
Name: 'Quadro RTX 6000'
Index: 1
ComputeCapability: '7.5'
SupportsDouble: 1
DriverVersion: 11.4000
ToolkitVersion: 11
MaxThreadsPerBlock: 1024
MaxShmemPerBlock: 49152
MaxThreadBlockSize: [1024 1024 64]
MaxGridSize: [2.1475e+09 65535 65535]
SIMDWidth: 32
TotalMemory: 2.3916e+10
AvailableMemory: 2.3584e+10
MultiprocessorCount: 72
ClockRateKHz: 1770000
ComputeMode: 'Default'
GPUOverlapsTransfers: 1
KernelExecutionTimeout: 0
CanMapHostMemory: 1
DeviceSupported: 1
DeviceAvailable: 1
DeviceSelected: 1
If I run the following simple code:
gpu = gpuDevice;
n = 0;
while true
A = rand(1000, 1000, 100, 'single');
A_ = gpuArray(A);
A_ = sum(A_ .^ 2, 2) .^ (1 ./ 2);
reset(gpu)
n = n + 1;
end
eventually (n ~ 10) I get a range of errors, from "unspecified launch failure" to "CUDA_ERROR_LAUNCH_FAILED" to "CUDA_ERROR_ILLEGAL_ADDRESS". I tried:
  • With clean installations of latest and previous nVidia drivers, no luck.
  • Disabling Window's TDR, no luck.
  • Switching to WDDM mode. This seems to solve the issue. But it's not really a solution for us (Windows uses precious GPU memory).
The code above is just a test case. In reality we encounter these errors at random during the execution of large scripts. No custom kernels.
Any other suggestions?

댓글 수: 6

Is there any particular reason why you need to reset the device on every loop? I suspect the TCC driver is objecting to that for some reason (no doubt a bug in the driver but even so...you are repeatedly running kernels, then resetting, clearing the CUDA context, reinitializing...it's certainly a stress test for the CUDA runtime).
If you really have to do this, try sneaking in a wait(gpu) before you reset.
The reason is to make the error reproducible. But in practice the error happens regardless of a GPU reset. I have longer scripts where I initialize the GPU only once at the beginning and never reset but the error pops up anyways, at seemingly random locations.
Can you try watching your GPU's resource usage and particular temperature in the Task Manager while you run this code? Perhaps it is overheating.
I checked the temperature and everything is within norm (maximum ~ 70°C). I recently tried with newer nVidia drivers but the problem is still there. It was not giving this problem with MATLAB 2019...
I can't think of any reason why this would happen. To check it's not a driver issue check:
  • Does it now reproduce in earlier versions of MATLAB?
  • Downgrade your drivers as far as you can (unfortunately I can only see drivers as far back as 461.40 for the RTX 6000)
To check it's not related to low memory issues, monitor GPU memory in the Task Manager while you run your code.
It may unfortunately just be a faulty card. In which case it will probably reproduce in older versions of MATLAB.
Hoping you get back to me in less than 7 months!
Haha, I will try! Thank you for the ideas!

댓글을 달려면 로그인하십시오.

답변 (0개)

카테고리

도움말 센터File Exchange에서 Startup and Shutdown에 대해 자세히 알아보기

제품

릴리스

R2021a

태그

질문:

2021년 10월 11일

댓글:

2022년 6월 17일

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by