Preconditioning for iterative solvers on GPU - Performance issues

조회 수: 22 (최근 30일)
Paulo Ribeiro
Paulo Ribeiro 2019년 11월 14일
댓글: Joss Knight 2019년 11월 25일
Dear all,
I'm experimenting some preconditioners for iterative solvers on GPU in a linear system [A]{x}={B}. The problem is defined by this simple command line:
sol=pcg(A_gpu,B_gpu,tol,maxit,P)
where A and B are gpuArrays and P is the preconditioner.
Some simple tests point out that the solution is faster than any iterative CPU solver, whenever P=[ ], with speedups up to 12x;
However, what I still can't figure out, is the reason why the performance drops whenever any type of preconditioner is selected. For an instance, using Incomplete Cholesky factorization:
L=ichol(A)
sol=pcg(A_gpu,B_gpu,tol,maxit,L*L')
Blows out the performance when compared to no preconditioner at all on the GPU. The solution is even slower than the CPU version, where this same preconditioner improves the CPU performance by 1.5x. That's really strange.
I've also tried passing A_gpu as preconditioner, but the solution takes forever:
sol=pcg(A_gpu,B_gpu,tol,maxit,A_gpu)
This issue is also related to other iterative solvers, such as: BICG and SYMMLQ
Am I doing something wrong? It appears that any preconditioner on the GPU is acting as a drawback, even when it is efficient for the CPU version.
Please share your thoughts and experiences. Thanks!
  댓글 수: 7
Paulo Ribeiro
Paulo Ribeiro 2019년 11월 21일
편집: Paulo Ribeiro 2019년 11월 22일
Thanks Joss. These are really impressive results on a Titan V. It's even faster than a backslash solver A\B on the CPU with an Intel i7 8700:
tic; A\B; toc
Elapsed time is 1.712258 seconds.
For this specific case it appears that the best option is to avoid preconditioning on the GPU.
Regards.
Joss Knight
Joss Knight 2019년 11월 25일
I investigated further and found that applying the preconditioner - not just decomposing it - does appear to be taking an unusually long time. This does warrant further investigation, since these two triangular solves should be fast, and your system matrix is band-diagonal. It does have quite a large bandwidth of 543 however, so that could be the issue.
Iterative solvers are always faster than direct solves for large sparse matrices (assuming they have reasonable convergence properties). Direct solves are hugely memory intensive because there is a lot of fill-in during factorization.

댓글을 달려면 로그인하십시오.

답변 (0개)

카테고리

Help CenterFile Exchange에서 Parallel and Cloud에 대해 자세히 알아보기

제품


릴리스

R2019a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by