MeX implementation of Tensorproduct multiplications

Question

0 개 추천

Dear all,

I am searching for a more efficient way to calculate tensorproduct multiplications of arbitrary sizes. First of all, i have already searched the web to get inspired by different solutions, e.g. here Tensorproduct at stackexchange:

1.) External Libraries, e.g. Eigen C++.

2.) Pure Matlab solution based in reshape and shiftdim.

3.) Matlab Tensor Toolbox.

However, so far i ended up with a MeX-file implementation with OpenMP parallelization.

--

In the following, i will give a short introduction of the math behind the implementation:

Suppose you have a matrix vector multiplication, where a matrix C is constructed by a Kronecker product of matrices A with size (n x m) and B with size (p x q). More informations can be found here Wikipedia.

.

In two dimensions this operation can be performed with O(npq+qnm) operations instead of O(mqnp) operations.

--

So far so good. In following example the sizes of the arrays are

matrix_xi = 4x4,

matrix_eta = 4x4,

vector = 16x500000x5,

(new version: vector = 16x2500000)

and may be initialized with ones or zeros. Moreover, rather than calculating only one Kronecker matrix vector multiplication, i want to calculate e.g. 500000x5 operations with one call (see the size of vector).

--

My current MeX-C file implementation:

/*************************************************
 * CALLING:
 *
 * out = tensorproduct(A,B,vector)     
 * 
 *************************************************/
#include "mex.h"
#include "omp.h"
#define PP_A      prhs[0]
#define PP_B      prhs[1]
#define PP_vector prhs[2]
#define PP_out    plhs[0]
#define n 4 
#define m 4 
#define p 4 
#define q 4
void mexFunction(int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[])
{
  const mwSize   *dim;
  int            i,j,k,s,l;
  double         temp[n*q];
  double *A      = mxGetPr(PP_A);
  double *B      = mxGetPr(PP_B);
  double *vector = mxGetPr(PP_vector);
  dim            = mxGetDimensions(PP_vector);
  l              = dim[1];
  PP_out         = mxCreateDoubleMatrix(l*n*p,1,mxREAL);
  double *out    = mxGetPr(PP_out);
            
  #pragma omp parallel for private(i,j,k,s,temp)
  for(k=0; k<l; k++){
    for(i=0; i<n; i++){
      for(j=0; j<q; j++){
        temp[i+j*n]=0;
      }
    }
    for(s=0; s<m; s++){
      for(i=0; i<n; i++){
        for(j=0; j<m; j++){
          temp[i+s*n]+=A[i+j*n]*vector[j+s*m+k*m*p];
        }
      }
    }
    for(s=0; s<n; s++){
      for(i=0; i<p; i++){
        for(j=0; j<q; j++){
           out[i*n+s+k*n*p]+=B[i+j*p]*temp[j*n+s];
        }
      }
    }
  }
} 

--

The code can be compiled with:

mex FFLAGS='$FFLAGS             -fopenmp -Ofast -funroll-loops' ...    
    CFLAGS='$CFLAGS             -fopenmp -Ofast -funroll-loops' ...    
    LDFLAGS='$LDFLAGS           -fopenmp -Ofast -funroll-loops' ...    
    COPTIMFLAGS='$COPTIMFLAGS   -fopenmp -Ofast -funroll-loops' ...    
    LDOPTIMFLAGS='$LDOPTIMFLAGS -fopenmp -Ofast -funroll-loops' ...    
    DEFINES='$DEFINES           -fopenmp -Ofast -funroll-loops' ... 
    tensorproduct.c 

--

Currently on my Notebook: (Ubuntu 18.04, GCC 7.5.0, 4 Cores)

matrix_xi  = ones(4,4);
matrix_eta = ones(4,4);
vector     = ones(16,500000,5);
n          = 50;
tic
for i=1:n
    vector_out = reshape(tensorproduct(matrix_xi,matrix_eta,vector),size(vector));
end
toc
% Elapsed time is 8.036490 seconds.

--

A quite simple Matlab implementation without the use of the Kronecker property gives

matrix=kron(matrix_xi,matrix_eta);
tic
for i=1:n
    vector_out = reshape(matrix*vector(:,:),size(vector));
end
toc
% Elapsed time is 10.489606 seconds.

--

With a different size of n,m,p,q = 7 the difference surprisingly does not change and gets worse (here you have to change the constants in the C-file)

matrix_xi  = ones(7,7);
matrix_eta = ones(7,7);
vector     = ones(49,500000,5);
n          = 50;
tic
for i=1:n
    vector_out = reshape(tensorproduct(matrix_xi,matrix_eta,vector),size(vector));
end
toc
% Elapsed time is 31.436541 seconds.
matrix=kron(matrix_xi,matrix_eta);
tic
for i=1:n
    vector_out = reshape(matrix*vector(:,:),size(vector));
end
toc
% Elapsed time is 35.829957 seconds.

Luckily I am still faster than Matlab.

--

Note that only n,m,p,q are constants during compile time. To make the code useable for arbitrary sizes i have written a Macro to generate many C-files with different const n,m,p,q which will be included later with a header and a main C-file. This will help the compiler to unrole and inline the code.

This is the most efficient way i have achieved so far. I hope there are some experts here in this forum who can help to optimize these few lines. I am also open for other solutions, e.g. Matlab based or based on external libraries.

Thank you!

댓글 수: 4
이전 댓글 2개 표시 이전 댓글 2개 숨기기

ConvexHull 2020년 12월 6일

편집: ConvexHull 2020년 12월 6일

MATLAB Online에서 열기

I refere to matrix vector multiplication, where matrix C is build up by kron. Commonly it is called "vec trick".

C=kron(A,B);

With N ≡ n=m=p=q you can perform a matrix vector product with O(2*N*N*N) instead of O(N*N*N*N).

Here:

C = matrix

B = matrix_xi.';

A = matrix_eta;

vec(X) = vector (500000x5 times).

You can think of it in a simple way by applying matrix vector multiplications linewise in each direction. Here i have visualized X in matrix representation in order to apply the Tensorproduct multiplication linewise. Without the Kronecker property X would be handled as vector.

A horizontal arrow represents a matrix vector multiplication with e.g. "matrix_xi" on a small set of X.
A vertical arrow represents a matrix vector multiplication with e.g. "matrix_eta" on a small set of X.
The naive implementation would be to unfold X in a vector representation and apply "matrix" on whole X.

Are you familiar with Tensorproducts and their properties?

What exactly is unclear? I thought i have defined all stuff quite sufficient.

Regards

Bruno Luong 2020년 12월 6일

편집: Bruno Luong 2020년 12월 6일

MATLAB Online에서 열기

Yes I do know tensor and kron. But your notation is confusing. Actually your second statement

A = matrix_xi
B = matrix_eta

is wrong. Actually it's

B = matrix_xi.';
A = matrix_eta;

And your vector consists of a set of multiple X matrices.

Each reshape(vector(:,p,q), size(A,2), size(B,1)) is an X.

Your description is very confusing.

I knew what you want to compute. For the general reader, the description is careless and confusing.

ConvexHull 2020년 12월 6일

편집: ConvexHull 2020년 12월 6일

Sorry, I am a little bit inaccurate here. I hope you got it.

Summarizing, i simply want to use the Kronecker property to apply many (500000*5) efficient matrix vector multiplications using O(npq+qnm) with one call.

Thank you for your help.

Edit: My main concern is a more efficient C implementation.

댓글을 달려면 로그인하십시오.

이 질문에 답변하려면 로그인하십시오.

Follow Question

Answer 1

Bruno Luong 2020년 12월 6일

편집: Bruno Luong 2021년 8월 22일

MATLAB Online에서 열기

0 개 추천

I propose two other methods of MATLAB.

The first simply replaces (:,:) by RESHAPE. The first doing some data copy, the second do not. It seems to be the fatest.

The second method uses MATLAB pagemtimes introduced in R2020b. (Edit: Jan corrects the release)

function benchtensor
matrix_xi  = rand(7,7);
matrix_eta = rand(7,7);
vector     = rand(49,500000,5);
n          = 50;
tic
for i=1:n
    vector_out = reshape(tensorproduct(matrix_xi,matrix_eta,vector),size(vector));
end
toc % Elapsed time is 19.429296 seconds.
matrix=kron(matrix_xi,matrix_eta);
tic
for i=1:n
    vector_out = reshape(matrix*vector(:,:),size(vector));
end
toc % Elapsed time is 28.08963 seconds.
tic
for i=1:n
    vector_out = reshape(matrix*reshape(vector,49,[]),size(vector));
end
toc % Elapsed time is 14.30025 seconds.
B = matrix_xi.';
A = matrix_eta;
tic
for i=1:n
    X = reshape(vector, size(A,2), size(B,1), []);
    CX = pagemtimes(pagemtimes(A, X), B); % Reshape CX if needed
end
toc % Elapsed time is 29.296709 seconds.
end

댓글 수: 17
이전 댓글 15개 표시 이전 댓글 15개 숨기기

Bruno Luong 2021년 8월 22일

Thanks Jan I fix it

ConvexHull 2021년 8월 23일

편집: ConvexHull 2021년 8월 23일

Note that i merged k/l loops in the original post as suggested. For the C-code you have to use vector=size(49x2500000) instead of vector=size(49x500000x5) .

The other two hints did not give me any speed up.

댓글을 달려면 로그인하십시오.

MeX implementation of Tensorproduct multiplications

댓글 수: 4
이전 댓글 2개 표시 이전 댓글 2개 숨기기

채택된 답변

댓글 수: 17
이전 댓글 15개 표시 이전 댓글 15개 숨기기

추가 답변 (0개)

카테고리

태그

Community Treasure Hunt

MeX implementation of Tensorproduct multiplications

댓글 수: 4 이전 댓글 2개 표시 이전 댓글 2개 숨기기

채택된 답변

댓글 수: 17 이전 댓글 15개 표시 이전 댓글 15개 숨기기

추가 답변 (0개)

카테고리

태그

참고 항목

Community Treasure Hunt

댓글 수: 4
이전 댓글 2개 표시 이전 댓글 2개 숨기기

댓글 수: 17
이전 댓글 15개 표시 이전 댓글 15개 숨기기