Applying vectorization techniques to speedup the performance of dividing a 3D matrix by a 2D matrix

Question

Matthew Kehoe 2021년 7월 29일

0
링크

이 질문에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/888132-applying-vectorization-techniques-to-speedup-the-performance-of-dividing-a-3d-matrix-by-a-2d-matrix

댓글: Matthew Kehoe 2021년 7월 31일

I'm working on removing a for loop in my Matlab code to improve performance. My original code has one for loop (from j=1:Nx) that is harmful to performance (in my production code, this for loop is processed over 20 million times if I test large simulations). I am curious if I can remove this for loop through vectorization, repmat, or a similar technique. My original Matlab implementation is given below.

clc; clear all;
% Test Data 
% I'm trying to remove the for loop for j in the code below
N = 10; 
M = 10;
Nx = 32; % Ny=Nx=Nz
Nz = 32;
Ny = 32;
Fnmhat = rand(Nx,Nz+1);
Jnmhat = rand(Nx,1);
xi_n_m_hat = rand(Nx,N+1,M+1);
Uhat = zeros(Nx,Nz+1);
Uhat_2 = zeros(Nx,Nz+1);
identy = eye(Ny+1,Ny+1);
p = rand(Nx,1);
gammap = rand(Nx,1);
D = rand(Nx+1,Ny+1);
D2 = rand(Nx+1,Ny+1);
D_start = D(1,:);
D_end = D(end,:);
gamma = 1.5;
alpha = 0; % this could be non-zero
ntests = 100;
% Original Code (Partially vectorized)
tic
for n=0:N
  for m=0:M
    b = Fnmhat.';
    alphaalpha = 1.0;
    betabeta = 0.0;  % this could be non-zero
    gammagamma = gamma*gamma - p.^2 - 2*alpha.*p; % size (Nx,1)
    d_min = 1.0;
    n_min = 0.0;  % this could be non-zero
    r_min = xi_n_m_hat(:,n+1,m+1);
    d_max = -1i.*gammap;
    n_max = 1.0;
    r_max = Jnmhat;
    A = alphaalpha*D2 + betabeta*D + permute(gammagamma,[3,2,1]).*identy;
    A(end,:,:) = repmat(n_min*D_end,[1,1,Nx]);
    b(end,:) = r_min;
      
    A(end,end,:) = A(end,end,:) + d_min;
    A(1,:,:) = repmat(n_max*D_start,[1,1,Nx]);
    A(1,1,:) = A(1,1,:) + permute(d_max,[2,3,1]);
    b(1,:) = r_max;
      
    % Non-vectorized code - can this part be vectorized?
    for j=1:Nx
      utilde = linsolve(A(:,:,j),b(:,j)); % A\b
      Uhat(j,:) = utilde.';
    end
  end
end
toc

Here is my attempt at vectorizing the code (and removing the for loop for j).

% Same test data as original code
% New Code (completely vectorized but incorrect)
tic
for n=0:N
  for m=0:M
    b = Fnmhat.';
    alphaalpha = 1.0;
    betabeta = 0.0;  % this could be non-zero
    gammagamma = gamma*gamma - p.^2 - 2*alpha.*p; % size (Nx,1)
    d_min = 1.0;
    n_min = 0.0;  % this could be non-zero
    r_min = xi_n_m_hat(:,n+1,m+1);
    d_max = -1i.*gammap;
    n_max = 1.0;
    r_max = Jnmhat;
    A2 = alphaalpha*D2 + betabeta*D + permute(gammagamma,[3,2,1]).*identy;
    A2(end,:,:) = repmat(n_min*D_end,[1,1,Nx]);
    b(end,:) = r_min;
      
    A2(end,end,:) = A2(end,end,:) + d_min;
    A2(1,:,:) = repmat(n_max*D_start,[1,1,Nx]);
    A2(1,1,:) = A2(1,1,:) + permute(d_max,[2,3,1]);
    b(1,:) = r_max;
      
    % Non-vectorized code - can this part be vectorized?
    %for j=1:Nx
    %  utilde_2 = linsolve(A2(:,:,j),b(:,j)); % A2\b
    %  Uhat_2(j,:) = utilde_2.';
    %end
    
    % My attempt - this doesn't work since I don't loop through the index j
    % in repmat
    utilde_2 = squeeze(repmat(linsolve(A2(:,:,Nx),b(:,Nx)),[1,1,Nx]));
    utilde_2 = utilde_2(:,1);
    Uhat_2 = squeeze(repmat(utilde_2',[1,1,Nx]));
    Uhat_2 = Uhat_2';
  end
end
toc
diff = norm(Uhat - Uhat_2,inf); % is 0 if correct

I'm curious if repmat (or a different builtin Matlab function) can speed up this part of the code:

for j=1:Nx
      utilde = linsolve(A(:,:,j),b(:,j)); % A\b
      Uhat(j,:) = utilde.';
end

Is the for loop for j absolutely necessary or can it be removed?

댓글 수: 1
이전 댓글 -1개 표시이전 댓글 -1개 숨기기

Matthew Kehoe 2021년 7월 29일

@the cyclist: This question is a follow up question to an earlier question created this afternoon.

댓글을 달려면 로그인하십시오.

이 질문에 답변하려면 로그인하십시오.

Answer 1

Bruno Luong 2021년 7월 29일

1
링크

이 답변에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/888132-applying-vectorization-techniques-to-speedup-the-performance-of-dividing-a-3d-matrix-by-a-2d-matrix#answer_756701

If you have C compilers the fatest methods are perhaps mmx and MultipleQR avaikable on FEX

https://www.mathworks.com/matlabcentral/answers/776812-how-to-vectorize-a-b-operation-on-slices-of-3d-matrices?s_tid=srchtitle

댓글 수: 9
이전 댓글 7개 표시이전 댓글 7개 숨기기

Matthew Kehoe 2021년 7월 29일

편집: Matthew Kehoe 2021년 7월 29일

Oddly enough, it appears that the for loop is faster than MultipleQR (at least locally on my machine). I could be testing something wrong. I implemented the following test scenario:

clc; clear all;
% Test Data 
N = 10; 
M = 10;
Nx = 32; % Ny=Nx=Nz
Nz = 32;
Ny = 32;
Fnmhat = rand(Nx,Nz+1);
Jnmhat = rand(Nx,1);
xi_n_m_hat = rand(Nx,N+1,M+1);
Uhat = zeros(Nx,Nz+1);
Uhat_2 = zeros(Nx,Nz+1);
identy = eye(Ny+1,Ny+1);
p = rand(Nx,1);
gammap = rand(Nx,1);
D = rand(Nx+1,Ny+1);
D2 = rand(Nx+1,Ny+1);
D_start = D(1,:);
D_end = D(end,:);
gamma = 1.5;
alpha = 0; % this could be non-zero
ntests = 100;
% Original Code (Partially vectorized)
tic
for ii=1:ntests
for n=0:N
  for m=0:M
    b = Fnmhat.';
    alphaalpha = 1.0;
    betabeta = 0.0;  % this could be non-zero
    gammagamma = gamma*gamma - p.^2 - 2*alpha.*p; % size (Nx,1)
    d_min = 1.0;
    n_min = 0.0;  % this could be non-zero
    r_min = xi_n_m_hat(:,n+1,m+1);
    d_max = -1i.*gammap;
    n_max = 1.0;
    r_max = Jnmhat;
    A = alphaalpha*D2 + betabeta*D + reshape(gammagamma,1,1,Nx).*identy;
    A(end,:,:) = repmat(n_min*D_end,[1,1,Nx]);
    b(end,:) = r_min;
      
    A(end,end,:) = A(end,end,:) + d_min;
    A(1,:,:) = repmat(n_max*D_start,[1,1,Nx]);
    A(1,1,:) = A(1,1,:) + reshape(d_max,1,1,Nx);
    b(1,:) = r_max;
    
    % Non-vectorized code
    for j=1:Nx
     utilde = linsolve(A(:,:,j),b(:,j)); % A\b
     Uhat(j,:) = utilde.';
    end
  end
end
end
toc
% New Code with MultipleQRSolve
tic
for ii=1:ntests
for n=0:N
  for m=0:M
    b = Fnmhat.';
    alphaalpha = 1.0;
    betabeta = 0.0;  % this could be non-zero
    gammagamma = gamma*gamma - p.^2 - 2*alpha.*p; % size (Nx,1)
    d_min = 1.0;
    n_min = 0.0;  % this could be non-zero
    r_min = xi_n_m_hat(:,n+1,m+1);
    d_max = -1i.*gammap;
    n_max = 1.0;
    r_max = Jnmhat;
    A2 = alphaalpha*D2 + betabeta*D + reshape(gammagamma,1,1,Nx).*identy;
    A2(end,:,:) = repmat(n_min*D_end,[1,1,Nx]);
    b(end,:) = r_min;
      
    A2(end,end,:) = A2(end,end,:) + d_min;
    A2(1,:,:) = repmat(n_max*D_start,[1,1,Nx]);
    A2(1,1,:) = A2(1,1,:) + reshape(d_max,1,1,Nx);
    b(1,:) = r_max;
    
    % Non-vectorized code
    %for j=1:Nx
    %  utilde_2 = linsolve(A2(:,:,j),b(:,j)); % A\b
    %  Uhat_2(j,:) = utilde_2.';
    %end
    
    b = reshape(b,Nx+1,1,Nx);
    Uhat_2 = squeeze(MultipleQRSolve(A2,b,1e-12));
    Uhat_2 = Uhat_2';
  end
end
end
toc
diff = norm(Uhat - Uhat_2,inf); % returns 15.7722, I might have wrote something wrong above.

which returned

Elapsed time of method 1 (for loop) is 14.830669 seconds.
Elapsed time of method 2 (MultipleQRSolve) is 15.170965 seconds.

I'm also testing mmx (and am working through a setup error).

Matthew Kehoe 2021년 7월 29일

편집: Matthew Kehoe 2021년 7월 29일

@Bruno Luong: (This is a very long comment that I might want to put in an answer) I fixed the installation error with MMX. I tested against 4 different solutions:

The for loop A(:,:,j)\b(:,j) with Matlab divides replaced by linsolve (which is faster with my data).
The MultipleQRSolve located from the File Exchange.
The Multiple same size linear solver from the File Exchange.
MMX which is now moved to Github.

My test code is below.

clc; clear all;
% Test Data 
N = 8; 
M = 8;
Nx = 32; % Ny=Nx=Nz
Nz = 32;
Ny = 32;
Fnmhat = rand(Nx,Nz+1);
Jnmhat = rand(Nx,1);
xi_n_m_hat = rand(Nx,N+1,M+1);
Uhat = zeros(Nx,Nz+1);
Uhat_2 = zeros(Nx,Nz+1);
identy = eye(Ny+1,Ny+1);
p = rand(Nx,1);
gammap = rand(Nx,1);
D = rand(Nx+1,Ny+1);
D2 = rand(Nx+1,Ny+1);
D_start = D(1,:);
D_end = D(end,:);
gamma = 1.5;
alpha = 0; % this could be non-zero
ntests = 100;
% Method 1: Original Code (Partially vectorized)
tic
for ii=1:ntests
for n=0:N
  for m=0:M
    b = Fnmhat.';
    alphaalpha = 1.0;
    betabeta = 0.0;  % this could be non-zero
    gammagamma = gamma*gamma - p.^2 - 2*alpha.*p; % size (Nx,1)
    d_min = 1.0;
    n_min = 0.0;  % this could be non-zero
    r_min = xi_n_m_hat(:,n+1,m+1);
    d_max = -1i.*gammap;
    n_max = 1.0;
    r_max = Jnmhat;
    A = alphaalpha*D2 + betabeta*D + reshape(gammagamma,1,1,Nx).*identy;
    A(end,:,:) = repmat(n_min*D_end,[1,1,Nx]);
    b(end,:) = r_min;
      
    A(end,end,:) = A(end,end,:) + d_min;
    A(1,:,:) = repmat(n_max*D_start,[1,1,Nx]);
    A(1,1,:) = A(1,1,:) + reshape(d_max,1,1,Nx);
    b(1,:) = r_max;
    
    % Non-vectorized code
    for j=1:Nx
     utilde = linsolve(A(:,:,j),b(:,j)); % A\b
     Uhat(j,:) = utilde.';
    end
  end
end
end
toc
% Method 2: New Code with MultipleQRSolve
tic
for ii=1:ntests
for n=0:N
  for m=0:M
    b = Fnmhat.';
    alphaalpha = 1.0;
    betabeta = 0.0;  % this could be non-zero
    gammagamma = gamma*gamma - p.^2 - 2*alpha.*p; % size (Nx,1)
    d_min = 1.0;
    n_min = 0.0;  % this could be non-zero
    r_min = xi_n_m_hat(:,n+1,m+1);
    d_max = -1i.*gammap;
    n_max = 1.0;
    r_max = Jnmhat;
    A2 = alphaalpha*D2 + betabeta*D + reshape(gammagamma,1,1,Nx).*identy;
    A2(end,:,:) = repmat(n_min*D_end,[1,1,Nx]);
    b(end,:) = r_min;
      
    A2(end,end,:) = A2(end,end,:) + d_min;
    A2(1,:,:) = repmat(n_max*D_start,[1,1,Nx]);
    A2(1,1,:) = A2(1,1,:) + reshape(d_max,1,1,Nx);
    b(1,:) = r_max;
    
    % Non-vectorized code
    %for j=1:Nx
    %  utilde_2 = linsolve(A2(:,:,j),b(:,j)); % A\b
    %  Uhat_2(j,:) = utilde_2.';
    %end
    
    b = reshape(b,Nx+1,1,Nx);
    Uhat_2 = squeeze(MultipleQRSolve(A2,b,1e-12));
    Uhat_2 = Uhat_2';
  end
end
end
toc
% Method 3: New Code with MultiSolver
tic
for ii=1:ntests
for n=0:N
  for m=0:M
    b = Fnmhat.';
    alphaalpha = 1.0;
    betabeta = 0.0;  % this could be non-zero
    gammagamma = gamma*gamma - p.^2 - 2*alpha.*p; % size (Nx,1)
    d_min = 1.0;
    n_min = 0.0;  % this could be non-zero
    r_min = xi_n_m_hat(:,n+1,m+1);
    d_max = -1i.*gammap;
    n_max = 1.0;
    r_max = Jnmhat;
    A3 = alphaalpha*D2 + betabeta*D + reshape(gammagamma,1,1,Nx).*identy;
    A3(end,:,:) = repmat(n_min*D_end,[1,1,Nx]);
    b(end,:) = r_min;
      
    A3(end,end,:) = A3(end,end,:) + d_min;
    A3(1,:,:) = repmat(n_max*D_start,[1,1,Nx]);
    A3(1,1,:) = A3(1,1,:) + reshape(d_max,1,1,Nx);
    b(1,:) = r_max;
    
    % Non-vectorized code
    %for j=1:Nx
    %  utilde_2 = linsolve(A3(:,:,j),b(:,j)); % A\b
    %  Uhat_2(j,:) = utilde_2.';
    %end
    
    Uhat_3 = squeeze(MultiSolver(A3, b));
    Uhat_3 = Uhat_3';
  end
end
end
toc
% Method 4: New Code with MMX
tic
for ii=1:ntests
for n=0:N
  for m=0:M
    b = Fnmhat.';
    alphaalpha = 1.0;
    betabeta = 0.0;  % this could be non-zero
    gammagamma = gamma*gamma - p.^2 - 2*alpha.*p; % size (Nx,1)
    d_min = 1.0;
    n_min = 0.0;  % this could be non-zero
    r_min = xi_n_m_hat(:,n+1,m+1);
    d_max = -1i.*gammap;
    n_max = 1.0;
    r_max = Jnmhat;
    A4 = alphaalpha*D2 + betabeta*D + reshape(gammagamma,1,1,Nx).*identy;
    A4(end,:,:) = repmat(n_min*D_end,[1,1,Nx]);
    b(end,:) = r_min;
      
    A4(end,end,:) = A4(end,end,:) + d_min;
    A4(1,:,:) = repmat(n_max*D_start,[1,1,Nx]);
    A4(1,1,:) = A4(1,1,:) + reshape(d_max,1,1,Nx);
    b(1,:) = r_max;
    
    % Non-vectorized code
    %for j=1:Nx
    %  utilde_2 = linsolve(A4(:,:,j),b(:,j)); % A\b
    %  Uhat_2(j,:) = utilde_2.';
    %end
    b = reshape(b,Nx+1,1,Nx);
    Uhat_4 = squeeze(mmx('backslash',A4,b)); % this turns complex double to type double
    Uhat_4 = Uhat_4';
  end
end
end
toc
diff = norm(Uhat - Uhat_2,inf); % returns 15.7722, I might have wrote something wrong above.
diff2 = norm(Uhat - Uhat_3,inf); % returns 15.7722, I might have wrote something wrong above.
diff3 = norm(Uhat - Uhat_4,inf); % returns 474 since mmx turns the complex double to type double!

The timings are

Elapsed time of method 1 (for loop) is 10.034689 seconds.
Elapsed time of method 2 (MultipleQRSolve) is 10.101741 seconds.
Elapsed time of method 3 (MultiSolver) is 19.232889 seconds.
Elapsed time of method 4 (MMX) is 3.241713 seconds. Changes data from type complex double to type double. 

Therefore, MMX is faster. However, the values of diff, diff2, and diff3 are all nonzero. The mmx 'backslash' command changes my data in the matrix A from a complex double to a double (which will destroy my real test data). Through reading the documentation for MMX, complex numbers aren't supported. Inside the 'discussions part' of the URL on file exchange, the user Israel Vaughn suggests writing as new function called mmxc which I have copied below.

function [ C ] = mmxc( actionStr,A,B, transString )
% complex mmx
% not very optimal, but hey, it works
if (strcmp(actionStr,'square') & isempty(B))
    At = real(A);
    Ai = imag(A);
    clear A;
    C = mmx('square',At,[]);
    C = C - 1i.*mmx('mult',At,Ai,'nt');
    C = C + 1i.*mmx('mult',Ai,At,'nt');
    C = C + mmx('square',Ai,[]);
else if (exist('transString'))
    At = real(A);
    Ai = imag(A);
    clear A;
    Bt = real(B);
    Bi = imag(B);
    clear B;
    C = mmx(actionStr,At,Bt,transString);
    C = C + 1i.*mmx(actionStr,At,Bi,transString);
    clear At;
    C = C + 1i.*mmx(actionStr,Ai,Bt,transString);
    clear Bt;
    C = C - mmx(actionStr,Ai,Bi,transString);
else
    At = real(A);
    Ai = imag(A);
    clear A;
    Bt = real(B);
    Bi = imag(B);
    clear B;
    C = mmx(actionStr,At,Bt);
    C = C + 1i.*mmx(actionStr,At,Bi);
    clear At;
    C = C + 1i.*mmx(actionStr,Ai,Bt);
    clear Bt;
    C = C - mmx(actionStr,Ai,Bi);
end
end

This new function appears to be incorrect since replacing

Uhat_4 = squeeze(mmxc('backslash',A4,b)); % Changed to complex mmxc

still returns a non-zero diff3. Through reviewing this question on Stack Exchange, it looks like there is a potential "hack" in the answer provided by user Alec Jacobson:

% Method 5: Try the "hack" on Stack Exchange
tic
for ii=1:ntests
for n=0:N
  for m=0:M
    b = Fnmhat.';
    alphaalpha = 1.0;
    betabeta = 0.0;  % this could be non-zero
    gammagamma = gamma*gamma - p.^2 - 2*alpha.*p; % size (Nx,1)
    d_min = 1.0;
    n_min = 0.0;  % this could be non-zero
    r_min = xi_n_m_hat(:,n+1,m+1);
    d_max = -1i.*gammap;
    n_max = 1.0;
    r_max = Jnmhat;
    A5 = alphaalpha*D2 + betabeta*D + reshape(gammagamma,1,1,Nx).*identy;
    A5(end,:,:) = repmat(n_min*D_end,[1,1,Nx]);
    b(end,:) = r_min;
      
    A5(end,end,:) = A5(end,end,:) + d_min;
    A5(1,:,:) = repmat(n_max*D_start,[1,1,Nx]);
    A5(1,1,:) = A5(1,1,:) + reshape(d_max,1,1,Nx);
    b(1,:) = r_max;
    
    % Non-vectorized code
    %for j=1:Nx
    %  utilde_2 = linsolve(A5(:,:,j),b(:,j)); % A\b
    %  Uhat_2(j,:) = utilde_2.';
    %end
    
    % Slightly esoteric solution from Alec Jacobson
    w = size(A5,1);
    k = size(A5,3);
    AS = reshape(permute(A,[1 3 2]),w*k,w);
    S = sparse( ...
      repmat(1:w*k,w,1)', ...
      bsxfun(@plus,reshape(repmat(1:w:w*k,w,1),[],1),0:w-1), ...
      AS,w*k,w*k);
    % The line below blows up since b is of size(Nx+1,Nx) and code assumes
    % it is of size(Nx,1)
    utilde_5 = reshape(S\repmat(b,k,1),w,k);
    utilde_5 = utilde_5';
  end
end
end
toc

I'm going to try and fix this. It seems that MMS backslash command would work for double data types (as opposed to complex doubles). Maybe I can get the hack above to work. I will also look at the parallel computing toolbox.

Matthew Kehoe 2021년 7월 30일

편집: Matthew Kehoe 2021년 7월 30일

I think that the performance of individual methods heavily depends on the size of the input matrices. For larger matrices, the for loop appears to win every time:

% These parameters mimic the real data in my code
m=33;
n=33;
p=1;
q=32; % don't run out of memory!
ntests = 100000; % One Hundred Thousand
% for-loop
tic
for ii=1:ntests
    Ac=randn(m,n,q)+1i*randn(m,n,q); % A is a complex double of size (33,33,32)
    Bc=randn(m,p,q)+1i*rand(m,p,q); % B is a double (not complex) of size (33,32). I resphape it to size (33,1,32) for the backslash/divide function.
    z=zeros(n,p,q);
    for k=1:q
        z(:,:,k) = Ac(:,:,k)\Bc(:,:,k);
    end
end
toc % Elapsed time is 236.284139 seconds.
% mmx
tic
for ii=1:ntests
    Ar = real(Ac);
    Ai = imag(Ac);
    Br = real(Bc);
    Bi = imag(Bc);
    AA = [Ar,-Ai;Ai,Ar];
    BB = [Br;Bi];
    zz = mmx('backslash', AA, BB);
    z2=zz(1:n,:,:)+1i*zz(n+1:end,:,:);
end
toc % Elapsed time is 331.318650 seconds.
% MultipleQRSolve
tic
for ii=1:ntests
    z3 = MultipleQRSolve(Ac,Bc);
end
toc % Elapsed time is 272.109689 seconds.

I'm on a slower computer and will be on a faster one tomorrow morning. I will do some more testing with real Matlab code. It is strange that the for loop appears to beat all of the other solutions with larger data (outside of the parallel computing toolbox, which I will investigate).

Bruno Luong 2021년 7월 30일

MMX and MultipleQRSolve make parallel loop on page, MATLAB for-loop make parallize on the algorithm of linsolve.

That's why the matrix size matters, and possibly the number of the physical processor cores.

MultipleQRSolve is less efficient for larg matrix because my implementation of QR is less efficient than the Lapack function called by MMX and MATLAB native for-loop.

Matthew Kehoe 2021년 7월 31일

I think that MMX can be optimized if both A and B are not complex doubles. In my data, B is not a complex double so it may be possible to speed up the MMX calculation. Here is how I would implement the three different methods in my real Matlab code.

% These parameters mimic the real data in my code
m=33;
n=33;
p=1;
q=32; 
ntests = 10000; 
% My code calculates Ac and Br before going into the for loop
Ac = rand(m,n,q)+1i*rand(m,n,q); % A is a complex double of size (33,33,32)
Br = rand(m,q); % B is a (real) double of size (33,32)
% Before I decide to use a for loop/mmx/MultipleQRSolve my code
% "understands" that A is a complex double of size (33,33,32) and B is a 
% (real) double of size (33,32). I don't need to calculate what A or B are inside
% the for loop. I only reshape B inside MMX and MultipleQRSolve because I
% have to for the divides operation.
% Here is how I would write the three functions below in my "real" code.
% for-loop
tic
for ii=1:ntests
    z1 = zeros(q,m);
    for j=1:q
      % This is how my code currently computes A\b
      utilde = linsolve(Ac(:,:,j),Br(:,j)); % A\b
      z1(j,:) = utilde.';
    end
end
toc % Elapsed time is 14.231135 seconds.
% mmx
tic
for ii=1:ntests
    Bnew = reshape(Br,m,1,q); % Make Br size(33,1,32) to apply MMX
    Ar = real(Ac);
    Ai = imag(Ac);
    Br = real(Bnew);
    Bi = imag(Bnew); % is zero as b is a real double
    
    % z_1 = Ar+Ai*i
    % z_2 = Br+Bi*i
    % z_1/z_2 = [(Ar*Br + Ai*Bi) + 1i*(Ai*Br - Ar*Bi)]/(Br^2 + Bi^2);
    % Since Bi == 0, this is simplified to
    % z_1/z_2 = [(Ar*Br) + 1i*(Ai*Br)]/(Br^2);
    
    % I think that this makes the code below
    %AA = [Ar,-Ai;Ai,Ar];
    %BB = [Br;Bi];
    %zz = mmx('backslash', AA, BB);
    %z2=zz(1:n,:,:)+1i*zz(n+1:end,:,:);
    
    % Into the faster version
    Num = mmx('mult', Ar, Br);
    Num = Num + 1i*mmx('mult', Ai, Br);
    Den = Br.^2; 
    z2 = mmx('backslash',Num,Den);
    z2 = permute(z2,[3 1 2]);
end
toc % Elapsed time is 2.441799 seconds.
% MultipleQRSolve
tic
for ii=1:ntests
    Bnew_2 = reshape(Br,m,1,q); % Make Br size(33,1,32) to apply MultipleQRSolve
    z3 = MultipleQRSolve(Ac,Bnew_2);
    z3 = permute(z3,[3 1 2]);
end
toc % Elapsed time is 25.991396 seconds.
diff = norm(z1-z2,inf); % Not zero since my code for z_1/z_2 isn't correct.
diff2 = norm(z1-z3,inf);

If the code for

AA = [Ar,-Ai;Ai,Ar];
BB = [Br;Bi];
zz = mmx('backslash', AA, BB);
z2=zz(1:n,:,:)+1i*zz(n+1:end,:,:);

isn't needed (as B is not a complex double) then MMX would "beat" the for loop. Thanks for all of your help with this question (and for writing the MultipleQRSolve).

댓글을 달려면 로그인하십시오.

Answer 2

Matt J 2021년 7월 29일

1
링크

이 답변에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/888132-applying-vectorization-techniques-to-speedup-the-performance-of-dividing-a-3d-matrix-by-a-2d-matrix#answer_756807

편집: Matt J 2021년 7월 29일

Another idea.

clc; clear all;
% Test Data 
% I'm trying to remove the for loop for j in the code below
N = 10; 
M = 10;
Nx = 32; % Ny=Nx=Nz
Nz = 32;
Ny = 32;
AA=kron(speye(Nx),ones(Nx+1));
map=logical(AA);
% Original Code (Partially vectorized)
tic
for n=0:N
  for m=0:M
      ....
          
    %Vectorized code
    AA(map)=A(:);
    Uhat=reshape(AA\b(:),Nx+1,Nx).';
    
    
  end
end
toc

댓글 수: 5
이전 댓글 3개 표시이전 댓글 3개 숨기기

Matthew Kehoe 2021년 7월 29일

편집: Matthew Kehoe 2021년 7월 29일

In a local test kron appears to be slower than writing for loops:

clc; clear all;
% Test Data 
N = 10; 
M = 10;
Nx = 32; % Ny=Nx=Nz
Nz = 32;
Ny = 32;
Fnmhat = rand(Nx,Nz+1);
Jnmhat = rand(Nx,1);
xi_n_m_hat = rand(Nx,N+1,M+1);
Uhat = zeros(Nx,Nz+1);
Uhat_2 = zeros(Nx,Nz+1);
identy = eye(Ny+1,Ny+1);
p = rand(Nx,1);
gammap = rand(Nx,1);
D = rand(Nx+1,Ny+1);
D2 = rand(Nx+1,Ny+1);
D_start = D(1,:);
D_end = D(end,:);
gamma = 1.5;
alpha = 0; % this could be non-zero
ntests = 100;
AA=kron(speye(Nx),ones(Nx+1));
map=logical(AA);
% Method 1: Original Code (Partially vectorized)
tic
for ii=1:ntests
for n=0:N
  for m=0:M
    b = Fnmhat.';
    alphaalpha = 1.0;
    betabeta = 0.0;  % this could be non-zero
    gammagamma = gamma*gamma - p.^2 - 2*alpha.*p; % size (Nx,1)
    d_min = 1.0;
    n_min = 0.0;  % this could be non-zero
    r_min = xi_n_m_hat(:,n+1,m+1);
    d_max = -1i.*gammap;
    n_max = 1.0;
    r_max = Jnmhat;
    A = alphaalpha*D2 + betabeta*D + reshape(gammagamma,1,1,Nx).*identy;
    A(end,:,:) = repmat(n_min*D_end,[1,1,Nx]);
    b(end,:) = r_min;
      
    A(end,end,:) = A(end,end,:) + d_min;
    A(1,:,:) = repmat(n_max*D_start,[1,1,Nx]);
    A(1,1,:) = A(1,1,:) + reshape(d_max,1,1,Nx);
    b(1,:) = r_max;
    
    % Non-vectorized code
    for j=1:Nx
     utilde = linsolve(A(:,:,j),b(:,j)); % A\b
     Uhat(j,:) = utilde.';
    end
  end
end
end
toc
% Method 2: New Code with kron
tic
for ii=1:ntests
for n=0:N
  for m=0:M
    b = Fnmhat.';
    alphaalpha = 1.0;
    betabeta = 0.0;  % this could be non-zero
    gammagamma = gamma*gamma - p.^2 - 2*alpha.*p; % size (Nx,1)
    d_min = 1.0;
    n_min = 0.0;  % this could be non-zero
    r_min = xi_n_m_hat(:,n+1,m+1);
    d_max = -1i.*gammap;
    n_max = 1.0;
    r_max = Jnmhat;
    A2 = alphaalpha*D2 + betabeta*D + reshape(gammagamma,1,1,Nx).*identy;
    A2(end,:,:) = repmat(n_min*D_end,[1,1,Nx]);
    b(end,:) = r_min;
      
    A2(end,end,:) = A2(end,end,:) + d_min;
    A2(1,:,:) = repmat(n_max*D_start,[1,1,Nx]);
    A2(1,1,:) = A2(1,1,:) + reshape(d_max,1,1,Nx);
    b(1,:) = r_max;
    
    % Non-vectorized code
    %for j=1:Nx
    %  utilde_2 = linsolve(A2(:,:,j),b(:,j)); % A\b
    %  Uhat_2(j,:) = utilde_2.';
    %end
    
    AA(map)=A2(:);
    Uhat_2=reshape(AA\b(:),Nx+1,Nx).';
  end
end
end
toc
diff = norm(Uhat - Uhat_2,inf); % is near 1e-14

The elapsed time is

Elapsed time for method 1 (for loops) is 16.629913 seconds.
Elapsed time for method 2 (kron) is 42.182836 seconds.

The line

AA(map)=A2(:);

appears to take more time than the for loop.

Bruno Luong 2021년 7월 29일

AA(map)=A2(:)

Well, I know someone who is wonder why (:) can be much slower than reshape. ;-)

Matt J 2021년 7월 29일

Yeah, I didn't see that A was complex-valued. So,

    AA(map)=rehape(A,[],1);
    Uhat=reshape(AA\b(:),Nx+1,Nx).';

댓글을 달려면 로그인하십시오.

Answer 3

Matt J 2021년 7월 29일

0
링크

이 답변에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/888132-applying-vectorization-techniques-to-speedup-the-performance-of-dividing-a-3d-matrix-by-a-2d-matrix#answer_756107

편집: Matt J 2021년 7월 29일

On the GPU (i.e. if A and b are gpuArrays), the for-loop can be removed:

Uhat = permute( pagefun(@mldivide,A,reshape(b,[],1,Nx)) ,[2,1,3]);

댓글 수: 1
이전 댓글 -1개 표시이전 댓글 -1개 숨기기

Matthew Kehoe 2021년 7월 29일

This approach requires the Parallel Computing Toolbox. I will investigate getting this toolbox. Is there another approach that doesn't require a separate toolbox?

댓글을 달려면 로그인하십시오.

Applying vectorization techniques to speedup the performance of dividing a 3D matrix by a 2D matrix

댓글 수: 1
이전 댓글 -1개 표시이전 댓글 -1개 숨기기

채택된 답변

댓글 수: 9
이전 댓글 7개 표시이전 댓글 7개 숨기기

추가 답변 (2개)

댓글 수: 5
이전 댓글 3개 표시이전 댓글 3개 숨기기

댓글 수: 1
이전 댓글 -1개 표시이전 댓글 -1개 숨기기

참고 항목

카테고리

태그

제품

릴리스

Community Treasure Hunt

Applying vectorization techniques to speedup the performance of dividing a 3D matrix by a 2D matrix

댓글 수: 1 이전 댓글 -1개 표시이전 댓글 -1개 숨기기

채택된 답변

댓글 수: 9 이전 댓글 7개 표시이전 댓글 7개 숨기기

추가 답변 (2개)

댓글 수: 5 이전 댓글 3개 표시이전 댓글 3개 숨기기

댓글 수: 1 이전 댓글 -1개 표시이전 댓글 -1개 숨기기

참고 항목

카테고리

태그

제품

릴리스

Community Treasure Hunt

댓글 수: 1
이전 댓글 -1개 표시이전 댓글 -1개 숨기기

댓글 수: 9
이전 댓글 7개 표시이전 댓글 7개 숨기기

댓글 수: 5
이전 댓글 3개 표시이전 댓글 3개 숨기기

댓글 수: 1
이전 댓글 -1개 표시이전 댓글 -1개 숨기기