AN Update: I can confirm that the first part of the issue, cache rebuild, was solved by defining Windows system variables:
CUDA_CACHE_MAXSIZE=2147483648
CUDA_CACHE_DISABLE=0
The cache folders size then grew from 256MB to 440MB. But the initial gpuDevice call takes now 4s, not 4 minutes. I found the solution in some other MATLAB Central question: parallel.gpu.CUDAKernel slow on GTX 1080
