CUDA_ERROR_LAUNCH_FAILED and CUDA_ERROR_ILLEGAL_ADDRESS on Quadro RTX in TCC mode
이전 댓글 표시
We have a Quadro RTX 6000 in TCC mode with a fresh Windows 10 and Matlab 2021a U5 installation. The output of gpuDevice is:
CUDADevice with properties:
Name: 'Quadro RTX 6000'
Index: 1
ComputeCapability: '7.5'
SupportsDouble: 1
DriverVersion: 11.4000
ToolkitVersion: 11
MaxThreadsPerBlock: 1024
MaxShmemPerBlock: 49152
MaxThreadBlockSize: [1024 1024 64]
MaxGridSize: [2.1475e+09 65535 65535]
SIMDWidth: 32
TotalMemory: 2.3916e+10
AvailableMemory: 2.3584e+10
MultiprocessorCount: 72
ClockRateKHz: 1770000
ComputeMode: 'Default'
GPUOverlapsTransfers: 1
KernelExecutionTimeout: 0
CanMapHostMemory: 1
DeviceSupported: 1
DeviceAvailable: 1
DeviceSelected: 1
If I run the following simple code:
gpu = gpuDevice;
n = 0;
while true
A = rand(1000, 1000, 100, 'single');
A_ = gpuArray(A);
A_ = sum(A_ .^ 2, 2) .^ (1 ./ 2);
reset(gpu)
n = n + 1;
end
eventually (n ~ 10) I get a range of errors, from "unspecified launch failure" to "CUDA_ERROR_LAUNCH_FAILED" to "CUDA_ERROR_ILLEGAL_ADDRESS". I tried:
- With clean installations of latest and previous nVidia drivers, no luck.
- Disabling Window's TDR, no luck.
- Switching to WDDM mode. This seems to solve the issue. But it's not really a solution for us (Windows uses precious GPU memory).
The code above is just a test case. In reality we encounter these errors at random during the execution of large scripts. No custom kernels.
Any other suggestions?
댓글 수: 6
Joss Knight
2021년 10월 24일
편집: Joss Knight
2021년 10월 24일
Is there any particular reason why you need to reset the device on every loop? I suspect the TCC driver is objecting to that for some reason (no doubt a bug in the driver but even so...you are repeatedly running kernels, then resetting, clearing the CUDA context, reinitializing...it's certainly a stress test for the CUDA runtime).
If you really have to do this, try sneaking in a wait(gpu) before you reset.
Massimiliano Zanoli
2021년 10월 25일
Joss Knight
2021년 11월 1일
Can you try watching your GPU's resource usage and particular temperature in the Task Manager while you run this code? Perhaps it is overheating.
Massimiliano Zanoli
2022년 5월 31일
Joss Knight
2022년 6월 12일
I can't think of any reason why this would happen. To check it's not a driver issue check:
- Does it now reproduce in earlier versions of MATLAB?
- Downgrade your drivers as far as you can (unfortunately I can only see drivers as far back as 461.40 for the RTX 6000)
To check it's not related to low memory issues, monitor GPU memory in the Task Manager while you run your code.
It may unfortunately just be a faulty card. In which case it will probably reproduce in older versions of MATLAB.
Hoping you get back to me in less than 7 months!
Massimiliano Zanoli
2022년 6월 17일
답변 (0개)
카테고리
도움말 센터 및 File Exchange에서 Startup and Shutdown에 대해 자세히 알아보기
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!