Major speed reduction (~50x) when doing multiple matrix multiply operations in a row. Is this a (major) bug? Any ideas for a fix?

Question

Anthony Barone 2018년 10월 5일

1
링크

이 질문에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/422376-major-speed-reduction-50x-when-doing-multiple-matrix-multiply-operations-in-a-row-is-this-a-maj

댓글: Anthony Barone 2018년 10월 8일

The code I am running more-or-less implements the following

 % PREALLOCATE. For the current tests I'm running: N=1200, M=2049, and L=256
 dataCur = <some [N x N x M] array>
 dataOut = zeros(L,N,N);
 % START PRIMARY / OUTER LOOP
 for kkOuter = 1:L
     % COMPUTE 'dataOut' SLICE FOR CURRENT OUTER LOOP ITERATION
     % NOTE: this is the same as doing "dataOut(kkOuter,:,:) = reshape(sum(dataCur,3),[1,N,N])"
     dataOut(kkOuter,:,:) = reshape((reshape(dataCur,[],M)*ones(M,1)).',[1,N,N]);    
     % UPDATE 'dataCur' BY LOOPING THROUGH IT ONE SLICE AT A TIME USING AN INNER LOOP
     for kkInner = 1:M
         W = someFunction(kkInner);  % this is a [N x N] array
        dataCur(:,:,kkInner) = transpose(W) * dataCur(:,:,kkInner) * W;
     end
 end

The evolution of the code's efficiency is shown here (note: the x axis is actually inner loop iterations / 10, since I collected time data every 10th iteration of the inner loop).

Clearly something is wrong.

Note that there is a "shift" where it goes from a "slow mode" of ~0.5 iterations per second to a "fast mode" of ~20 iterations per second. This can be seen in the upper image. The lower image shows overall average rate since the beginning of the current output loop iteration, so you dont see the sudden shift there as much.

.

WHAT IVE FIGURED OUT

The problem seems to be related to thread scheduling somehow or another. In that link the first 2 images show it running in "fast mode" (on linux mint), and the last 3 show it running in "slow mode".

Unfortunately, thats about all Ive got. Ive ruled a few things out (see below), but have no idea why this consistently keeps happening.

.

WHAT IVE RULED OUT

1) The OS. Unless Windows 10 and Linux Mint 19 both have this exact same issue.

2) MKL. Unless MKL 2017.0, 2018.0, 2018.3 and 2019.0 all have the exact same problem. 2018.3 and 2019.0 were used by launching MATLAB with the following command (this shows 2019.0 oon linux, using windows and using 2018.3 are fairly analogous)

/opt/intel/compilers_and_libraries_2019.0.117/linux/mkl/bin/mklvars.sh intel64 ilp64 && export "BLAS_VERSION=/opt/intel/compilers_and_libraries_2019.0.117/linux/mkl/lib/intel64_lin/libmkl_rt.so" && export "LAPACK_VERSION=/opt/intel/compilers_and_libraries_2019.0.117/linux/mkl/lib/intel64_lin/libmkl_rt.so" && export "MKL_THREADING_LAYER=INTEL" && export "MKL_INTERFACE_LAYER=ILP64" && "/usr/local/MATLAB/$(ls /usr/local/MATLAB | sort -g | tail -n 1)/bin/matlab"

So, unless the processor is straight ignoring the requested thread scheduling for a while (which idk if that is even possible), MATLAB seems the most likely culprit.

3) Memory fragmentation. I thought this might be some weird physical vs virtual memory issue where MATLAB's virtual addresses were sequential but the corresponding physical addresses got more and more fragmented, but ive tried a) implementing the inner loop in a separate function, and b) forcing the data into new memory addresses at every outer loop iteration using the following at the start of each outer loop iteration:

 clear dataOld; dataOld=zeros(size(dataCur),'like',dataCur)+ones(1,1,class(dataCur))-ones(1,1,class(dataCur)); dataOld=dataCur; clear dataCur; dataCur=zeros(size(dataOld),'like',dataOld);

If that doesnt force the data to be deep copied to new addresses then IDK what would.

.

OTHER INFO

CPU is a i9-7940x. This has AVX-512, which MATLAB doesnt seem to understand. For example, running `version -blas` tells me the CNR branch is unknown (this usually tells you if it uses AVX/AVX2/SSE/etc.) MKL is definitely using AVX512, but I figure this might possibly have something to do with it? (mainly because nothing else I try seems to have any effect)

.

Any ideas for a fix that dont involve "waiting for an undetermined number of MATLAB versions unless it works right out-of-the-bpx" would be much appreciated.

Thanks in advance.

댓글 수: 2
없음 표시없음 숨기기

Matt J 2018년 10월 5일

What version of Matlab is this?

Anthony Barone 2018년 10월 5일

편집: Anthony Barone 2018년 10월 5일

Happens on both R2018a and R2018b.

It is perhaps worth noting that the data are complex, so maybe it is something with the interleaved complex api? If this seems even remotely possible I can install an older version and try it. Id think this would make it easier, since MKL uses interleaved complex values, but who knows...

댓글을 달려면 로그인하십시오.

이 질문에 답변하려면 로그인하십시오.

Answer 1

James Tursa 2018년 10월 5일

0
링크

이 답변에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/422376-major-speed-reduction-50x-when-doing-multiple-matrix-multiply-operations-in-a-row-is-this-a-maj#answer_339955

편집: James Tursa 2018년 10월 5일

MATLAB Online에서 열기

You should not store your slices as follows, since this forces each slice to be scattered in memory and does not make efficient use of the cache when accessing the slices. It also forces a deep data copy every time you access the slices:

dataOut(kkOuter,:,:)

Rather, you should be storing your slices this way, since this forces each slice to be contiguous in memory and makes efficient use of the cache when accessing the slices:

dataOut(:,:,kkOuter)

Complex data comments:

The BLAS complex matrix multiply routines require interleaved data. Since R2018a and later also use this format, there will be no data copy required for this particular reason. However, there may be a data copy required when accessing slices of a matrix (rather than the entire matrix). There are indications in other Answers threads that MATLAB may implement methods that do not require data copies in some circumstances when the slices are in the first two dimensions as recommended above, but the rules for this are not published to my knowledge. You could of course write mex code that calls the BLAS routines directly to do the matrix multiplies and avoid any data copying in this case, but I don't think any of the FEX submissions that do this are updated for R2018a and later yet (including my own submission).

댓글 수: 9
이전 댓글 7개 표시이전 댓글 7개 숨기기

Bruno Luong 2018년 10월 5일

MATLAB Online에서 열기

To avoid data copying by addressing slice, one could use "shared data" array in 2018 with my MEX files.

The matlab code would goes as following.

One could do the same with LHS slice as well.

Beware: if you do it the Shared/Unshared not correctly, MATLAB will crash.

[m,n,p] = size(dataCur);
for k= 1:p
    W = someFunction(k);  % this is a [N x N] array
    slice_k = mxCreateSharedMatrix2018(dataCur, (k-1)*m*n, m, n);
    dataCur(:,:,k) = W.' * slice_k * W;
    mxUnshareMatrix2018(slice_k);
end
clear slide_k;

MEX files are:

/* File "mxInternals_R2018B.h"                      */
/* Hacked internal MXARRAY DEFINITION               */
/* Matlab version: R2018b                           */
#ifndef MXINTERNALS_H
#define MXINTERNALS_H
#include "mex.h"
#define NumericalMask       0x00000080
#define ComplexMask         0x00000800
#define SparseMask          0x00000010
#define MyShareMask      0x00008000
typedef struct mxArray_tag mxArray_tag;
struct mxArray_tag {
      mxArray_tag     *B_Xlink;     
      mxClassID       classID;            
      unsigned int    unknown1;           
      mxArray_tag     *F_Xlink;    
      mwSize          ndims;              
      unsigned int    refcount;                   /* Number of mxArray* objects that point to this structure */
      unsigned int    Mask;                       /* See some of the mask values defined above */     
      union { 
          mwSize          rows;                   /* Valid if the array has <= 2 dimensions */
          mwSize          *pSize;                 /* vector of size, in heap, Valid if the array has > 2 dimensions */
      } size;
      mwSize          cols;                       /* If the array has > 2 dimensions, this is actually numel(X) / size(X,1) */
      union { 
          struct {
              void            *Data;              
              unsigned int    unknown0;         
              unsigned int    unknown1;           
              unsigned int    unknown2;           
              unsigned int    unknown3;           
              unsigned int    unknown4;           
              unsigned int    unknown5;           
          } number_array;
          struct {
              void          * Data;                     
              mwIndex       * pRowIndex;          
              mwIndex       * pColFirstElement;           /* pColFirstElement[j] is the index of the first entry in column j (see mxSetJc).  The number of non-sparse elements is pColFirstElement[cols] */
              mwSize          nzmax;                      /* The number of non zero elements allocated for the sparse array */
          } sparse_array;
          struct {
              mxArray_tag  ** Data;              
              unsigned int    unknown1;           
              unsigned int    unknown2;           
              unsigned int    unknown3;           
              unsigned int    unknown4;           
              unsigned int    unknown5;           
              unsigned int    unknown6;           
              unsigned int    unknown7;           
          } cell_array;
          struct {
              mxArray_tag  ** Data;              
              void          * pFieldNames;                /* N.B. I have no idea how field names are stored */
              unsigned int    unknown1;           
              unsigned int    unknown2;           
              unsigned int    unknown3;           
              unsigned int    unknown4;           
              unsigned int    unknown5;           
          } struct_array;
      } data;
      unsigned int    unknown3;           
      unsigned int    unknown4;           
      unsigned int    unknown5;           
      unsigned int    unknown6;           
      unsigned int    unknown7;           
  };
#endif

and

    /*
     * B = mxCreateSharedMatrix2018(A)
     *
     * B = mxCreateSharedMatrix2018(A,offset,m,n)
     * Create array B that have data pointer shift by offset elements 
     * wrt to A date and have new size (m x n)
     * 
     * INPUT: A is numerical vector/matrix real or complex,
     *       (i.e., dim=2, requirement not checked)
     *
     * Create an array B (mxArray differents than A) but shares data pointer with A
     *
     * >> mex -R2018a mxCreateSharedMatrix2018.c
     */
    #include "mxInternals_R2018B.h"
    #define A_IN        prhs[0]
    #define OFFSET_IN   prhs[1]
    #define M_IN        prhs[2]
    #define N_IN        prhs[3]
    #define A_OUT       plhs[0] 
    void mexFunction( int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[])
    {
        size_t m, n;
        mxClassID ClassID;
        mxComplexity ComplexFlag;
        mxArray_tag *Ain, *Aout;
        mwSize dims[2] = { 0, 0 };
      /* typedef enum {
        mxUNKNOWN_CLASS,
        mxCELL_CLASS,
        mxSTRUCT_CLASS,
        mxLOGICAL_CLASS,
        mxCHAR_CLASS,
        mxVOID_CLASS,
        mxDOUBLE_CLASS,
        mxSINGLE_CLASS,
        mxINT8_CLASS,
        mxUINT8_CLASS,
        mxINT16_CLASS,
        mxUINT16_CLASS,
        mxINT32_CLASS,
        mxUINT32_CLASS,
        mxINT64_CLASS,
        mxUINT64_CLASS,
        mxFUNCTION_CLASS
      } mxClassID;*/
      size_t bytesizetab[mxFUNCTION_CLASS+1] = {8,8,8,1,2,0,8,4,1,1,2,2,4,4,8,8,8};
      size_t offset, recbytesize;
      if (nrhs < 1) mexErrMsgTxt("mxCreateSharedMatrix2018: one input required.");
        if (nrhs >= 2)
        {
        recbytesize = bytesizetab[mxGetClassID(A_IN)];
        offset = recbytesize * (size_t)mxGetScalar(OFFSET_IN);
        }
        else offset = 0;
      if (nrhs >= 4)
      {
        m = (size_t)mxGetScalar(M_IN);
        n = (size_t)mxGetScalar(N_IN);
      }
      else
      {
        m = mxGetM(A_IN);
        n = mxGetN(A_IN);
      }
        ComplexFlag = mxIsComplex(A_IN) ? mxCOMPLEX : mxREAL;
        ClassID = mxGetClassID(A_IN);
        A_OUT = mxCreateNumericArray(2, dims, mxGetClassID(A_IN), ComplexFlag);
        mxSetM(A_OUT,m);
        mxSetN(A_OUT,n);
        Ain     = (mxArray_tag*)A_IN;
        Aout    = (mxArray_tag*)A_OUT;
        /* Update cross-link, insert Aout as back of Ain */
        Aout->B_Xlink = Ain->B_Xlink;
        Aout->F_Xlink = Ain;
        if (Aout->B_Xlink) Aout->B_Xlink->F_Xlink = Aout;
        else Aout->B_Xlink = Ain;
        if (Ain->F_Xlink==NULL) Ain->F_Xlink = Aout;
        Ain->B_Xlink = Aout;
        /* Shared data */
        Aout->data.number_array.Data = (char*)Ain->data.number_array.Data + offset;
      /* Mark it */
      // Aout->Mask |= MyShareMask;
        /* It seems I do not need to update refcount of Ain and Aout */
    }

and

/*
 * mxUnshareMatrix2018(A)
 *
 * INPUT: A is double matrix real or complex, (i.e., dim=2, not checked)
 *
 * Unshare mxArray differents than A) and Destroy it when it is the last 
 * reference to mxArray* 
 *
 * >> mex -R2018a mxUnshareMatrix2018.c
 */
#include "mxInternals_R2018B.h"
#define A_IN        prhs[0]
void mexFunction( int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[])
{
    mxArray_tag *AB, *AF, *A, *Ain;
  unsigned char Found;
  int i;
  if (nrhs < 1) mexErrMsgTxt("mxUnshareMatrix2018: one input required.");
      Ain     = (mxArray_tag*)A_IN;
    A = Ain;
  AB = A->B_Xlink;
  AF = A->F_Xlink;
  /* Update cross-link, remove A from the double X-link chain */
  if (AB) AB->F_Xlink = AF;
  if (AF) AF->B_Xlink = AB;
  A->B_Xlink = A->F_Xlink = NULL;
  /* Avoid MATLAB Garbage collector to clean shared data */
  mxSetM(A, 0);
  mxSetN(A, 0);
  A->refcount = 0;
  A->data.number_array.Data = NULL;
}

Anthony Barone 2018년 10월 5일

Sorry, I meant that more as "why I set things up this way", not a "why I cant change them". I just meant that Initially it seemed like a better tradeoff to slice into non-sequential memory and not have a memory intensive (relatively speaking) operation at the end to transpose it. Using dataOut(:,:,n) isnt a huge problem, it just means at the end I need to reorder (output is to a standardized data format, so I dont have a choice but to order it a particular way).

That said I really dont think that particular operation is the problem. According to the profiler, virtually all of the compute time (minus what the `W = someFunction(kkInner);` call requires) is spent in the inner loop's `dataCur(:,:,kkInner) = transpose(W) * dataCur(:,:,kkInner) * W;` call. The recorded time in the call that slices into dataOut is a factor of 100 or more smaller. Granted sometimes MATLAB's time-keeping is a bit screwy (e.g., delayed allocation of variables), but still.

Anthony Barone 2018년 10월 5일

dataCur is symmetric, but not hermitian

Basically, dataCur is recorded ground motion from, a seismic wavefield where the (reordered) dimensions are [sourceLocationIndex, receiverLocationIndex, timeIndex]. The data are made symmetric in the source/receiver dimensions using an idea called reciprocity, which basically states that (in the high/infinite frequency limit) you record the same wavefield going from source@A-->receiver@B as from source@B-->receiver@A, since the path the wavefield takes is the same. This means data(a,b,c) and data(b,a,c) should be the same - if one is missing then the other uses the same value as the one you have, otherwise you average them.

Put another way, taking the transpose flips the positions in space, taking the conjugate flips the positions in time, and taking the hermition conjugate flips both. In this case I want the spatial positions flipped but I still want data from a wavefield moving forward in time.

Just to be 100% sure, I looked at the handful of data points where I actually had data in both data(a,b,:) and data(b,a,:). If these should be symmetric then their imaginary parts should subtract to zero. If these should be hermition then the imaginary parts should add to zero. This is what I found. Its real data (and not particularly good data) so things dont cancel perfect, but the residuals are substantially lower for the transpose case than the hermition conjugate case.

I know which MKL routines you are reffering to, but unfortunately, I dont think that MKL supports the accelerated BLAS routines you are talking about for symmetric complex values. It does for symmetric reals and hermition conjugates, but I dont think symmetric complexes will work. Ill double check though. Maybe 2019.0 added support for these...

.

As far as N and M: it depends, but my current test case is quite small by seismic data standards and has N=2000 and M=2049. This is probably standard for M (seismic data very frequently has 3000 time samples, so zero padding up to the next power of 2 and dropping the negative frequencies gives 2049). This is close to a lower limit for N though. Realistically, N will be as large as the machine can support. I only have 128 gigs in my personal rig that Im currently running this on, but dual-socket xeons with 512gb-1tb are fairly common, and there are clusters with significantly more than that. So, idk, anywhere in the range of M=1,000 to M=40,000?

Anthony Barone 2018년 10월 8일

MATLAB Online에서 열기

So, I think i may have figured out the problem. I noticed that a fair number of the environment variables that `mklvars.sh intel64 mod ilp64` set were being silently overwritten / removed during startup. If I re-set these using `setenv` things seem to be quite a bit more stable. I also added an enviroment variable that should dictate the threading affinity.

Specifically, Im using the following command.

 alias mymatlab='/opt/intel/compilers_and_libraries/linux/mkl/bin/mklvars.sh intel64 mod ilp64 verbose && export "BLAS_VERSION=/opt/intel/compilers_and_libraries_2019.0.117/linux/mkl/lib/intel64_lin/libmkl_rt.so" && export "LAPACK_VERSION=/opt/intel/compilers_and_libraries_2019.0.117/linux/mkl/lib/intel64_lin/libmkl_rt.so" && export "MKL_THREADING_LAYER=INTEL" && export "MKL_INTERFACE_LAYER=ILP64" && export "KMP_AFFINITY=granularity=fine,compact,1,0" && export "MKL_NUM_THREADS=14" && export "OMP_NUM_THREADS=14" && export && export "MKL_CBWT=AVX512" && cd ${HOME}/Documents/MATLAB && "/usr/local/MATLAB/$(ls /usr/local/MATLAB | sort -g | tail -n 1)/bin/matlab" -r "setenv('\''LD_LIBRARY_PATH'\'',['\''/opt/intel/compilers_and_libraries_2019.0.117/linux/tbb/lib/intel64_lin/gcc4.7:/opt/intel/compilers_and_libraries_2019.0.117/linux/compiler/lib/intel64_lin:/opt/intel/compilers_and_libraries_2019.0.117/linux/mkl/lib/intel64_lin'\'','\'':'\''*~isempty(getenv('\''LD_LIBRARY_PATH'\'')),getenv('\''LD_LIBRARY_PATH'\'')]); setenv('\''LIBRARY_PATH'\'',['\''/opt/intel/compilers_and_libraries_2019.0.117/linux/tbb/lib/intel64_lin/gcc4.7:/opt/intel/compilers_and_libraries_2019.0.117/linux/compiler/lib/intel64_lin:/opt/intel/compilers_and_libraries_2019.0.117/linux/mkl/lib/intel64_lin'\'','\'':'\''*~isempty(getenv('\''LIBRARY_PATH'\'')),getenv('\''LIBRARY_PATH'\'')]); setenv('\''NLSPATH'\'',['\''/opt/intel/compilers_and_libraries_2019.0.117/linux/mkl/lib/intel64_lin/locale/%l_%t/%N'\'','\'':'\''*~isempty(getenv('\''NLSPATH'\'')),getenv('\''NLSPATH'\'')]); setenv('\''CPATH'\'',['\''/opt/intel/compilers_and_libraries_2019.0.117/linux/mkl/include/intel64_lin/ilp64:/opt/intel/compilers_and_libraries_2019.0.117/linux/mkl/include'\'','\'':'\''*~isempty(getenv('\''CPATH'\'')),getenv('\''CPATH'\'')]); setenv('\''PKG_CONFIG_PATH'\'',['\''/opt/intel/compilers_and_libraries_2019.0.117/linux/mkl/bin/pkgconfig'\'','\'':'\''*~isempty(getenv('\''PKG_CONFIG_PATH'\'')),getenv('\''PKG_CONFIG_PATH'\'')]); setenv('\''MKL_NUM_THREADS'\'','\''14'\'')"'

That said ive made quite a few other changes to the code over the past few days, but this kind of makes sense that it might cause this (since I doubt MATLAB's build in version of libraries like libiomp5.so know how to deal with AVX-512), but thats just a guess on my part.

댓글을 달려면 로그인하십시오.

Major speed reduction (~50x) when doing multiple matrix multiply operations in a row. Is this a (major) bug? Any ideas for a fix?

댓글 수: 2
없음 표시없음 숨기기

답변 (1개)

댓글 수: 9
이전 댓글 7개 표시이전 댓글 7개 숨기기

참고 항목

카테고리

태그

제품

릴리스

Community Treasure Hunt

Major speed reduction (~50x) when doing multiple matrix multiply operations in a row. Is this a (major) bug? Any ideas for a fix?

댓글 수: 2 없음 표시없음 숨기기

답변 (1개)

댓글 수: 9 이전 댓글 7개 표시이전 댓글 7개 숨기기

참고 항목

카테고리

태그

제품

릴리스

Community Treasure Hunt

댓글 수: 2
없음 표시없음 숨기기

댓글 수: 9
이전 댓글 7개 표시이전 댓글 7개 숨기기