Distributed Computing with RamDisk

조회 수: 4 (최근 30일)
Yanir Hainick
Yanir Hainick 2017년 2월 20일
답변: Anthony Barone 2017년 12월 9일
Hello,
One of my projects involves heavy computational tasks which luckily can be parallelized. After optimizing the code for 'parfor' usage, i can get a nice factor of ~ Ncores/3. The bottleneck of squeezing even more lies (i'm almost sure) in the following: although the code optimization reduced the amount of data sent to each worker, the calculation's output is large (100MB-1GB), and probably can't be dramatically reduced. So, large overheads occur, stemming from data transfer (an indication to this can be seen when the workers seem to be finished, but one processor still works to full capacity for a long time before exiting the parfor loop is exited). my question(s): 1. Is my hypothesis correct? Meaning - is this due to the fact that parfor writes temporary data files to the HD? 2. if i'm correct, RamDisk can prove beneficial (i have A LOT of RAM, 512GB). How do i 'tell' the distributed computing to use the RAM virtual drive?
Much appreciated, Yanir H.

답변 (2개)

Anthony Barone
Anthony Barone 2017년 12월 9일
This may or may not be related and is more of a general "this might be worth looking into" suggestion than something specific, but...
In another reply you mention that you are using a 32-core machine, which means you are using a machine with NUMA (The largest UMA CPU I know of is the 28-core skylake xeon. Epyc Chips can have 32 cores but are inherently NUMA, even though they only use a single socket). In my experience, Matlab is completely blind to NUMA, and this can lead to a huge amount of overhead and can slow things down quite a bit.
If I were you, I would try locking Matlab to only use the memory and CPU cores from a single NUMA node and see how fast it runs. If you have N nodes and it runs at 1/N the speed you are currently getting then NUMA isnt the problem. If it runs dramatically faster than NUMA is (at least in part) to blame. On Linux this is easy to do by starting the Matlab instance using "numactl" and setting both the cpu core and memory affinities to only use a single NUMA node, though if it is a windows or mac server Im sure there are ways to do this as well. If the problem is NUMA, the easiest way (if possible) is to break up the data into N equal parts, run each on its own NUMA node, and then recombine them.
For reference, I was using a 2x NUMA node machine with 8x cores per node, and was only getting ~1.1x the performance on both Nodes as I was on a single node. Assuming your 32 cores is a 4x8core design, this is roughly in line with the Ncores/3 speed-up you were seeing.

Walter Roberson
Walter Roberson 2017년 2월 20일
편집: Walter Roberson 2017년 2월 20일
As you specifically said Distributed computing rather than Parallel Computing then my understanding is that data is not written to disk but rather is being transferred by tcp.
For parallel computing (same system) I do not have any information about whether more efficient transfer methods such as DMA are used.
I would tend to doubt DMA itself, directly, as that requires driver mode access. However, that would not rule out the use of shared memory segments, and a kernel implementation of those might use DMA. On the other hand, within a single system shared memory aligned on a page boundary can be transferred just by inserting the appropriate memory descriptor into the hardware virtual memory map.
I can say though that the programming model used between workers is the transfer of structured data, much like the serialization process for writing to disk—an encapsulation that might use offsets but not addresses. The question is just whether that serialized data is passed by tcp always or by some other message passing implementation like swapping control of buffers in shared memory. MATLAB is probably written not to care about that, handing off the decision to a layer to do the best it can.
Simply providing access to into the virtual memory of the process is not done, and neither is using a strategy of allocating a shared segment at a common address and having the C/C++ dynamic memory allocate out of there so that the other process can use the very same pointers. I can say that because that strategy requires that the code be written with a lot of attention to thread safety and a deliberate design about which process is responsible for deallocating the memory afterwards, but MATLAB lags a fair bit behind in thread safety, with memory allocation and deallocation between threads almost always causing problems.
Anyhow, in short, ramdisk probably will not help.
  댓글 수: 2
Yanir Hainick
Yanir Hainick 2017년 2월 20일
Thanks for the elaborate answer Walter!
I think i wasn't precise enough in my description, specifically i made a mistake using the term 'Distributed' instead of 'Parallel' computing.
I use a strong server with 32 physical cores (64 logical), and a lot of RAM (512GB, as i pointed out).
Next, i implement a 'parfor' loop instead of a serial 'for' loop. To that end, i made sure that as a few as possible amount of data is transferred to each core, and as much work is being done by each core, but unavoidably the amount of data that is outputed is very large.
Does this changes the answer?
Thanks, and apologies for the confusion...
Walter Roberson
Walter Roberson 2017년 2월 20일
I would continue to be certain that it is not writing to disk.
You should probably assume that MPI (Message Passing Interface) is being used. Probably OpenMPI.
"The principal MPI-1 model has no shared memory concept, and MPI-2 has only a limited distributed shared memory concept. Nonetheless, MPI programs are regularly run on shared memory computers, and both MPICH and Open MPI can use shared memory for message transfer if it is available. Designing programs around the MPI model (contrary to explicit shared memory models) has advantages over NUMA architectures since MPI encourages memory locality. Explicit shared memory programming was introduced in MPI-3."
MATLAB probably invokes (Open) MPI routines, which are responsible for doing the best they can on the target system, with the library having been compiled according to the operating system and hardware facilities.
RAMDISK is unlikely to help.
A relevant question would be whether it is feasible to compress the data for transfer?
For example if you use Fast Serialize/ Deserialize from the File Exchange, and apply a Java zip routine, transfer the result, unzip, deserialize, then it might be the case that you would reduce your transfer times enough for it to be worth-while. This would tend to require large intermediate memory structures, but you have the memory.

댓글을 달려면 로그인하십시오.

카테고리

Help CenterFile Exchange에서 Parallel Computing Fundamentals에 대해 자세히 알아보기

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by