MeX implementation of Tensorproduct multiplications

Dear all,
I am searching for a more efficient way to calculate tensorproduct multiplications of arbitrary sizes. First of all, i have already searched the web to get inspired by different solutions, e.g. here Tensorproduct at stackexchange:
1.) External Libraries, e.g. Eigen C++.
2.) Pure Matlab solution based in reshape and shiftdim.
3.) Matlab Tensor Toolbox.
However, so far i ended up with a MeX-file implementation with OpenMP parallelization.
--
In the following, i will give a short introduction of the math behind the implementation:
Suppose you have a matrix vector multiplication, where a matrix C is constructed by a Kronecker product of matrices A with size (n x m) and B with size (p x q). More informations can be found here Wikipedia.
.
In two dimensions this operation can be performed with O(npq+qnm) operations instead of O(mqnp) operations.
--
So far so good. In following example the sizes of the arrays are
matrix_xi = 4x4,
matrix_eta = 4x4,
vector = 16x500000x5,
(new version: vector = 16x2500000)
and may be initialized with ones or zeros. Moreover, rather than calculating only one Kronecker matrix vector multiplication, i want to calculate e.g. 500000x5 operations with one call (see the size of vector).
--
My current MeX-C file implementation:
/*************************************************
* CALLING:
*
* out = tensorproduct(A,B,vector)
*
*************************************************/
#include "mex.h"
#include "omp.h"
#define PP_A prhs[0]
#define PP_B prhs[1]
#define PP_vector prhs[2]
#define PP_out plhs[0]
#define n 4
#define m 4
#define p 4
#define q 4
void mexFunction(int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[])
{
const mwSize *dim;
int i,j,k,s,l;
double temp[n*q];
double *A = mxGetPr(PP_A);
double *B = mxGetPr(PP_B);
double *vector = mxGetPr(PP_vector);
dim = mxGetDimensions(PP_vector);
l = dim[1];
PP_out = mxCreateDoubleMatrix(l*n*p,1,mxREAL);
double *out = mxGetPr(PP_out);
#pragma omp parallel for private(i,j,k,s,temp)
for(k=0; k<l; k++){
for(i=0; i<n; i++){
for(j=0; j<q; j++){
temp[i+j*n]=0;
}
}
for(s=0; s<m; s++){
for(i=0; i<n; i++){
for(j=0; j<m; j++){
temp[i+s*n]+=A[i+j*n]*vector[j+s*m+k*m*p];
}
}
}
for(s=0; s<n; s++){
for(i=0; i<p; i++){
for(j=0; j<q; j++){
out[i*n+s+k*n*p]+=B[i+j*p]*temp[j*n+s];
}
}
}
}
}
--
The code can be compiled with:
mex FFLAGS='$FFLAGS -fopenmp -Ofast -funroll-loops' ...
CFLAGS='$CFLAGS -fopenmp -Ofast -funroll-loops' ...
LDFLAGS='$LDFLAGS -fopenmp -Ofast -funroll-loops' ...
COPTIMFLAGS='$COPTIMFLAGS -fopenmp -Ofast -funroll-loops' ...
LDOPTIMFLAGS='$LDOPTIMFLAGS -fopenmp -Ofast -funroll-loops' ...
DEFINES='$DEFINES -fopenmp -Ofast -funroll-loops' ...
tensorproduct.c
--
Currently on my Notebook: (Ubuntu 18.04, GCC 7.5.0, 4 Cores)
matrix_xi = ones(4,4);
matrix_eta = ones(4,4);
vector = ones(16,500000,5);
n = 50;
tic
for i=1:n
vector_out = reshape(tensorproduct(matrix_xi,matrix_eta,vector),size(vector));
end
toc
% Elapsed time is 8.036490 seconds.
--
A quite simple Matlab implementation without the use of the Kronecker property gives
matrix=kron(matrix_xi,matrix_eta);
tic
for i=1:n
vector_out = reshape(matrix*vector(:,:),size(vector));
end
toc
% Elapsed time is 10.489606 seconds.
--
With a different size of n,m,p,q = 7 the difference surprisingly does not change and gets worse (here you have to change the constants in the C-file)
matrix_xi = ones(7,7);
matrix_eta = ones(7,7);
vector = ones(49,500000,5);
n = 50;
tic
for i=1:n
vector_out = reshape(tensorproduct(matrix_xi,matrix_eta,vector),size(vector));
end
toc
% Elapsed time is 31.436541 seconds.
matrix=kron(matrix_xi,matrix_eta);
tic
for i=1:n
vector_out = reshape(matrix*vector(:,:),size(vector));
end
toc
% Elapsed time is 35.829957 seconds.
Luckily I am still faster than Matlab.
--
Note that only n,m,p,q are constants during compile time. To make the code useable for arbitrary sizes i have written a Macro to generate many C-files with different const n,m,p,q which will be included later with a header and a main C-file. This will help the compiler to unrole and inline the code.
This is the most efficient way i have achieved so far. I hope there are some experts here in this forum who can help to optimize these few lines. I am also open for other solutions, e.g. Matlab based or based on external libraries.
Thank you!

댓글 수: 4

Bruno Luong
Bruno Luong 2020년 12월 6일
편집: Bruno Luong 2020년 12월 6일
"In two dimensions this operation can be performed with O(npq+qnm) operations instead of O(mqnp) operations. "
Which operation? Please define clearly what variable you want to compute from what are input variables.
So far so good. In following example the sizes of the arrays are
matrix_xi = 4x4,
matrix_eta = 4x4,
vector = 16x500000x5,
What are the relation with A, B, C, X? on Wikipedia? They are all matrices and none is 3D array. Please use consistent notation to descripte the desired calculation.
ConvexHull
ConvexHull 2020년 12월 6일
편집: ConvexHull 2020년 12월 6일
I refere to matrix vector multiplication, where matrix C is build up by kron. Commonly it is called "vec trick".
C=kron(A,B);
With N ≡ n=m=p=q you can perform a matrix vector product with O(2*N*N*N) instead of O(N*N*N*N).
Here:
C = matrix
B = matrix_xi.';
A = matrix_eta;
vec(X) = vector (500000x5 times).
You can think of it in a simple way by applying matrix vector multiplications linewise in each direction. Here i have visualized X in matrix representation in order to apply the Tensorproduct multiplication linewise. Without the Kronecker property X would be handled as vector.
  • A horizontal arrow represents a matrix vector multiplication with e.g. "matrix_xi" on a small set of X.
  • A vertical arrow represents a matrix vector multiplication with e.g. "matrix_eta" on a small set of X.
  • The naive implementation would be to unfold X in a vector representation and apply "matrix" on whole X.
Are you familiar with Tensorproducts and their properties?
What exactly is unclear? I thought i have defined all stuff quite sufficient.
Regards
Yes I do know tensor and kron. But your notation is confusing. Actually your second statement
A = matrix_xi
B = matrix_eta
is wrong. Actually it's
B = matrix_xi.';
A = matrix_eta;
And your vector consists of a set of multiple X matrices.
Each reshape(vector(:,p,q), size(A,2), size(B,1)) is an X.
Your description is very confusing.
I knew what you want to compute. For the general reader, the description is careless and confusing.
ConvexHull
ConvexHull 2020년 12월 6일
편집: ConvexHull 2020년 12월 6일
Sorry, I am a little bit inaccurate here. I hope you got it.
Summarizing, i simply want to use the Kronecker property to apply many (500000*5) efficient matrix vector multiplications using O(npq+qnm) with one call.
Thank you for your help.
Edit: My main concern is a more efficient C implementation.

댓글을 달려면 로그인하십시오.

 채택된 답변

Bruno Luong
Bruno Luong 2020년 12월 6일
편집: Bruno Luong 2021년 8월 22일
I propose two other methods of MATLAB.
The first simply replaces (:,:) by RESHAPE. The first doing some data copy, the second do not. It seems to be the fatest.
The second method uses MATLAB pagemtimes introduced in R2020b. (Edit: Jan corrects the release)
function benchtensor
matrix_xi = rand(7,7);
matrix_eta = rand(7,7);
vector = rand(49,500000,5);
n = 50;
tic
for i=1:n
vector_out = reshape(tensorproduct(matrix_xi,matrix_eta,vector),size(vector));
end
toc % Elapsed time is 19.429296 seconds.
matrix=kron(matrix_xi,matrix_eta);
tic
for i=1:n
vector_out = reshape(matrix*vector(:,:),size(vector));
end
toc % Elapsed time is 28.08963 seconds.
tic
for i=1:n
vector_out = reshape(matrix*reshape(vector,49,[]),size(vector));
end
toc % Elapsed time is 14.30025 seconds.
B = matrix_xi.';
A = matrix_eta;
tic
for i=1:n
X = reshape(vector, size(A,2), size(B,1), []);
CX = pagemtimes(pagemtimes(A, X), B); % Reshape CX if needed
end
toc % Elapsed time is 29.296709 seconds.
end

댓글 수: 17

ConvexHull
ConvexHull 2020년 12월 6일
편집: ConvexHull 2020년 12월 6일
Thank you Bruno! I would like to understand the performance gain of your example:
tic
for i=1:n
vector_out = reshape(matrix*reshape(vector,49,[]),size(vector));
end
toc % Elapsed time is 14.45178 seconds.
On my computer it is nearly 3 times faster than:
tic
for i=1:n
vector_out = reshape(matrix*vector(:,:),size(vector));
end
toc % Elapsed time is 38.523173 seconds.
What's the reason?
I already explain: "The first doing some data copy, the second do not. "
ConvexHull
ConvexHull 2020년 12월 6일
편집: ConvexHull 2020년 12월 6일
Ok i got it. Does this also work for n~=m~=p~=q? The output would have a different dimension.
ConvexHull
ConvexHull 2020년 12월 6일
편집: ConvexHull 2020년 12월 6일
Yes it does!
This will help me. Thank you for your help.
Edit: I have accepted your answer. However, perhaps there is also a faster solution by using the Kronecker property.
Bruno Luong
Bruno Luong 2020년 12월 6일
편집: Bruno Luong 2020년 12월 6일
The kronecker simplifies 2 left/right product to a single matrix-vector product for each X.
If you looks closely the flops, they requires respectively 4*N^3 vs 2*N^3 flops. The Kronecker simplies 3 nested loop to 2. The memory access of Kronecker might be more consecutive, so cache will be more efficient.
It's all win/win for Kronecker method. You might adapt your C code with preprocess by Kronecker rather than doing left/right multiplication.
ConvexHull
ConvexHull 2020년 12월 6일
편집: ConvexHull 2020년 12월 6일
What exactly do you mean? The C-Code already uses the line wise multiplication property. Can the C-code be improved in some way?
Your C code does carry out
TMP=(X*B) % on the loop
OUT=A*TMP
and not
C=kron(B.',A); % once
OUT=C*vec(X); % on the loop
ConvexHull
ConvexHull 2020년 12월 6일
편집: ConvexHull 2020년 12월 6일
Yes this is correct. From my knowledge, the first one is the cheap one (in case of flops).
Otherwise you would have to perform a 49 x 49 matrix vector multiplication. My goal is to avoid building and using the matrix C.
No Kronecker is twice cheaper, as I indicates the FLOPS of both methods in my earlier post. You multiply the matrix with a vectors. Count again.
Bruno Luong
Bruno Luong 2020년 12월 6일
편집: Bruno Luong 2020년 12월 6일
Sorry I stand corrected, Kronecker requires 2*N^4 flops.
ConvexHull
ConvexHull 2020년 12월 6일
편집: ConvexHull 2020년 12월 6일
I corrected an error in my previous post. I should be O(2*N*N*N) aka O(npq+qnm) with
TMP=(X*B) % on the loop
OUT=A*TMP
instead of O(N*N*N*N) aka O(n*m*p*q) with
C=kron(B.',A); % once
OUT=C*vec(X); % on the loop
Edit: Hopefully it is correct now.
Make it O(N*N*N+N*N*N) for
TMP=(X*B) % on the loop
OUT=A*TMP
The (m,n) x (n x p) matrix-product trequires more or less 2*m*n*p flops
ConvexHull
ConvexHull 2020년 12월 6일
편집: ConvexHull 2020년 12월 6일
You mentioned careless and confusing ;-). Now i am sure i have corrected the most incorrect stuff i think, also in my previous anwers.
Some suggestions you could try is (depending if GCC see the optimization or not)
  • Merging k/l loops.
  • Precompute (k*vd+l*vd*vb) in the last nested loop computing OUT.
  • Use memset to initialize temp
Jan
Jan 2021년 8월 22일
@Bruno Luong: pagemtimes was introduced in R2020b, not 2018b.
Thanks Jan I fix it
ConvexHull
ConvexHull 2021년 8월 23일
편집: ConvexHull 2021년 8월 23일
Note that i merged k/l loops in the original post as suggested. For the C-code you have to use vector=size(49x2500000) instead of vector=size(49x500000x5) .
The other two hints did not give me any speed up.

댓글을 달려면 로그인하십시오.

추가 답변 (0개)

카테고리

도움말 센터File Exchange에서 Particle & Nuclear Physics에 대해 자세히 알아보기

질문:

2020년 12월 6일

편집:

2021년 8월 23일

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by