Mapreduce does not seem to use all available cores

Question

Mehrdad Oveisi 2014년 11월 10일

1
링크

이 질문에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/162140-mapreduce-does-not-seem-to-use-all-available-cores

답변: Rick Amos 2014년 11월 24일

Hello,

I am using mapreduce on a machine with 16 cores. I make a pool with 15 workers (cores) which works fine. When I run mapreduce though, it only utilizes one or two workers: sometimes one for the mapper and one for the reducer. This is how I check which worker is processing the data (in addition to using a system monitor to watch CPU/core activities):

tk=getCurrentTask();
disp(tk.ID)

There are tens of files to be processed and each mapper is called with one file to process. Each time a mapper is called it loads and processes one file. I expect that during the first call to the mapper and while it is loading and processing the first file on one worker (core), there are other parallel calls to mapper to process the next files on other workers. However, this is not how it happens; it just sequentially calls the mapper on the same worker. Sometimes it uses a second worker for the reducer calls. So at most it uses two workers, while there are 15 available in the pool.

What would be a simple code to check if mapreduce is making use of all the available cores?

EDIT: Actually now I can confirm that the mapper is always run by a single worker, but the reducer may be run by a few different workers, as expected.

Your help is appreciated, Mehrdad

댓글 수: 10
이전 댓글 8개 표시이전 댓글 8개 숨기기

Mehrdad Oveisi 2014년 11월 13일

편집: Mehrdad Oveisi 2014년 11월 13일

MATLAB Online에서 열기

workers_test.m

Actually I have now come up with a simple example code to illustrate this problem (changing the example presented in Getting Started with MapReduce). Running the following code (also attached) on my system shows that there is only one worker for the mapper function. Note the single value 9 for the key 'MapperTaskID' in the output.

Output:

            Key           Value  
      _______________    ________
      'ReducerTaskID'    [     9]
      'Mean'             [702.16]
      'ReducerTaskID'    [     7]
      'MapperTaskID'     [     9]
      'MapperTaskID'     [     9]
      'MapperTaskID'     [     9]
      'MapperTaskID'     [     9]
      'MapperTaskID'     [     9]
      'MapperTaskID'     [     9]
      ...

The testing code:

function keyvalues = workers_test
    ds = datastore('airlinesmall.csv','TreatAsMissing','NA');
    ds.SelectedVariableNames = 'Distance';
    ds.RowsPerRead = 5000; % smaller values increase the num of mapper calls
    preview(ds)
    outds = mapreduce(ds, @MeanDistMapFun, @MeanDistReduceFun);
    keyvalues = readall(outds);
end
function MeanDistMapFun(data, info, intermKVStore)
    tk=getCurrentTask();
    add(intermKVStore, 'MapperTaskID', tk.ID);
      distances = data.Distance(~isnan(data.Distance));
      sumLenValue = [sum(distances)  length(distances)];
      add(intermKVStore, 'sumAndLength', sumLenValue);
  end
function MeanDistReduceFun(intermKey, intermValIter, outKVStore)
    tk=getCurrentTask();
    add(outKVStore, 'ReducerTaskID', tk.ID);
      if strcmp(intermKey, 'MapperTaskID') 
          while hasnext(intermValIter)  % pass the same key/values along
              add(outKVStore, intermKey, getnext(intermValIter));
          end
          return
      end
      sumLen = [0 0];
      while hasnext(intermValIter)
          sumLen = sumLen + getnext(intermValIter);
      end
      add(outKVStore, 'Mean', sumLen(1)/sumLen(2));
  end

Mehrdad Oveisi 2014년 11월 13일

> This example hits a separate limitation that the input data currently needs to "large" to provide meaningful parallelism.

I guess this limitation is behind the problem I am having. I have about 600 files to be processed. The files are about 40M on average (ranging from 5M to 130M max). All of them are in .mat format containing exactly four structs, which contain the data, meta data, etc. So the actual "data" table in each file is inside a struct in that file. I wasn't sure if it is possible to directly make datastores from these tables that are inside structs in the files. So instead I pass to the datastore as input a text file containing the 600 .mat filenames. (And set ds.RowsPerRead=1 to go through the filenames one by one.)

Then as I mentioned in the original post "each time a mapper is called it loads and processes one file."

Given the limitation you are mentioning, since the input to the mapper is just a filename, it will not provide parallelism.

Is there any setting options to change this assumption that small input requires small amount of processing?
Or is there any way to make a datastore of tables that are inside structs in the input files?

Rick Amos 2014년 11월 17일

MATLAB Online에서 열기

Currently, the one very specific form of mat files that can be read by datastore is the output of another mapreduce call. An unofficial shortcut that creates such a mat file is the following code:-

data.Key = {'Test'};
data.Value = {struct('a', 'Hello World!', 'b', 42)};
save('myMatFile.mat', '-struct', 'data');
ds = datastore('myMatFile.mat');
readall(ds)

Mehrdad Oveisi 2014년 11월 19일

Thank you Rick! I found your reply here useful. So I thought it's good to have a separate thread for this tip.

댓글을 달려면 로그인하십시오.

이 질문에 답변하려면 로그인하십시오.

Answer 1

Rick Amos 2014년 11월 24일

0
링크

이 답변에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/162140-mapreduce-does-not-seem-to-use-all-available-cores#answer_160012

In R2014b, there are some limitations with the minimum size of data that can be parallelized. To avoid this limitation, the input datastore must contain at least one of the following:

Multiple files, where each file will be handled in parallel.
Files that are larger than 32 MB, where each 32 MB will be handled in parallel.

If the input datastore contains a single small file, you will need to find a way to split that file into multiple files. For example, if the input datastore contains a single file listing many filenames (to the actual data), you can split this up into many files each containing a single or small number of filenames to ensure parallelism.

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

댓글을 달려면 로그인하십시오.

Mapreduce does not seem to use all available cores

댓글 수: 10
이전 댓글 8개 표시이전 댓글 8개 숨기기

채택된 답변

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

추가 답변 (0개)

참고 항목

카테고리

태그

제품

Community Treasure Hunt

Mapreduce does not seem to use all available cores

댓글 수: 10 이전 댓글 8개 표시이전 댓글 8개 숨기기

채택된 답변

댓글 수: 0 이전 댓글 -2개 표시이전 댓글 -2개 숨기기

추가 답변 (0개)

참고 항목

카테고리

태그

제품

Community Treasure Hunt

댓글 수: 10
이전 댓글 8개 표시이전 댓글 8개 숨기기

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기