필터 지우기
필터 지우기

Mapreduce does not seem to use all available cores

조회 수: 1 (최근 30일)
Mehrdad Oveisi
Mehrdad Oveisi 2014년 11월 10일
답변: Rick Amos 2014년 11월 24일
Hello,
I am using mapreduce on a machine with 16 cores. I make a pool with 15 workers (cores) which works fine. When I run mapreduce though, it only utilizes one or two workers: sometimes one for the mapper and one for the reducer. This is how I check which worker is processing the data (in addition to using a system monitor to watch CPU/core activities):
tk=getCurrentTask();
disp(tk.ID)
There are tens of files to be processed and each mapper is called with one file to process. Each time a mapper is called it loads and processes one file. I expect that during the first call to the mapper and while it is loading and processing the first file on one worker (core), there are other parallel calls to mapper to process the next files on other workers. However, this is not how it happens; it just sequentially calls the mapper on the same worker. Sometimes it uses a second worker for the reducer calls. So at most it uses two workers, while there are 15 available in the pool.
What would be a simple code to check if mapreduce is making use of all the available cores?
EDIT: Actually now I can confirm that the mapper is always run by a single worker, but the reducer may be run by a few different workers, as expected.
Your help is appreciated, Mehrdad
  댓글 수: 10
Rick Amos
Rick Amos 2014년 11월 17일
Currently, the one very specific form of mat files that can be read by datastore is the output of another mapreduce call. An unofficial shortcut that creates such a mat file is the following code:-
data.Key = {'Test'};
data.Value = {struct('a', 'Hello World!', 'b', 42)};
save('myMatFile.mat', '-struct', 'data');
ds = datastore('myMatFile.mat');
readall(ds)
Mehrdad Oveisi
Mehrdad Oveisi 2014년 11월 19일
Thank you Rick! I found your reply here useful. So I thought it's good to have a separate thread for this tip.

댓글을 달려면 로그인하십시오.

채택된 답변

Rick Amos
Rick Amos 2014년 11월 24일
In R2014b, there are some limitations with the minimum size of data that can be parallelized. To avoid this limitation, the input datastore must contain at least one of the following:
  1. Multiple files, where each file will be handled in parallel.
  2. Files that are larger than 32 MB, where each 32 MB will be handled in parallel.
If the input datastore contains a single small file, you will need to find a way to split that file into multiple files. For example, if the input datastore contains a single file listing many filenames (to the actual data), you can split this up into many files each containing a single or small number of filenames to ensure parallelism.

추가 답변 (0개)

카테고리

Help CenterFile Exchange에서 MapReduce에 대해 자세히 알아보기

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by