MATLAB is column major but the algorithm could be implemented for an optimized row-major implementation. In the generated code, if your fastest changing dimension is not the innermost loop, then memory is not coalesced. Often, transposing the input matrices can simply fix this problem.
Try transposing the data.
If your problem/data size is too small, then the overhead of moving data to GPU (even if it is just at the I/O boundary) can offset any performance gains of running on the GPU.
Try the algorithm with larger data sizes.
If you use only
coder.gpu.kernel, then everything outside the
loop goes to the CPU. To try to keep most of the code on the GPU, use of both pragmas is
recommended. Also, presence of unsupported functions or any function/statement that cannot
run on the GPU, causes more
cudaMemcpys to be generated.
If certain inputs of your entry-point function are constant, wrap them using the
coder.const object. Use of
coder.const object indicates that these variables are constant
during code generation. Without this function, GPU Coder™ considers these inputs to be variables and hence treats all matrices sized
by these variables as variable-dimension matrices. GPU Coder does not create good kernels out of variable-dimension matrices since
currently there is no support for dynamic sizing of kernels or dynamic
cudaMemcpy function calls.
Using large stack memory inside kernels can reduce the performance of the generated code. Under such conditions consider rewriting the algorithm in a different fashion or breaking it into smaller computations to reduce stack memory usage and improve performance.