Mapreduce does not seem to use all available cores
이전 댓글 표시
Hello,
I am using mapreduce on a machine with 16 cores. I make a pool with 15 workers (cores) which works fine. When I run mapreduce though, it only utilizes one or two workers: sometimes one for the mapper and one for the reducer. This is how I check which worker is processing the data (in addition to using a system monitor to watch CPU/core activities):
tk=getCurrentTask();
disp(tk.ID)
There are tens of files to be processed and each mapper is called with one file to process. Each time a mapper is called it loads and processes one file. I expect that during the first call to the mapper and while it is loading and processing the first file on one worker (core), there are other parallel calls to mapper to process the next files on other workers. However, this is not how it happens; it just sequentially calls the mapper on the same worker. Sometimes it uses a second worker for the reducer calls. So at most it uses two workers, while there are 15 available in the pool.
What would be a simple code to check if mapreduce is making use of all the available cores?
EDIT: Actually now I can confirm that the mapper is always run by a single worker, but the reducer may be run by a few different workers, as expected.
Your help is appreciated, Mehrdad
댓글 수: 10
Rick Amos
2014년 11월 11일
The steps you provide should be sufficient for checking how many cores are being used. From what you say, the ID was the same for every map call. This is not correct, it should give each of 1 to 15 at least once when you have enough data (and having over 15 files in the input is one of several ways of ensuring this).
How big are the input files and how roughly how long does it take to complete the mapreduce call?
Mehrdad Oveisi
2014년 11월 11일
Rick Amos
2014년 11월 12일
Several further questions, does the function handle for the mapper have attached data? For example, is it constructed with:
mapFunction = @(data, info, output) myFunction(data, info, output, someOtherConstantData);
If so, how large is this attached data? One way to check how much data is attached to a handle is to do the following code:
f = functions(mapFunction);
whos f
f.workspace{:}
If this last line errors, there is no attached data. Otherwise, the data in each of the fields shown in the output will be attached to the function handle. The output of 'whos' will give a rough estimate of size.
Also for clarification, when you say tens of files, I assume you mean a number in the range 10 to 100? Or is the number of files far larger, for example 1000+ files or 10000+ files?
Mehrdad Oveisi
2014년 11월 13일
편집: Mehrdad Oveisi
2014년 11월 13일
Rick Amos
2014년 11월 13일
This example hits a separate limitation that the input data currently needs to "large" to provide meaningful parallelism. As of R2014b, "large" means either single files larger than 32 MB or a collection of files. If you change workers_test to the following, you should see many ID's in the output:
function keyvalues = workers_test
files = repmat({'airlinesmall.csv'}, 1, 15);
ds = datastore(files,'TreatAsMissing','NA');
ds.SelectedVariableNames = 'Distance';
ds.RowsPerRead = 5000; % smaller values increase the num of mapper calls
preview(ds)
outds = mapreduce(ds, @MeanDistMapFun, @MeanDistReduceFun);
keyvalues = readall(outds);
end
Mehrdad Oveisi
2014년 11월 13일
Rick Amos
2014년 11월 14일
It is not possible as of R2014b to remove this assumption, or to create a datastore from mat files containing structs (without modifying the files). Both of these items are on our radar.
The workaround here is to split the filenames across several text files. For example, if you create 15 text files each listing 60 mat filenames, a datastore of this should result in parallelism at the map stage.
Mehrdad Oveisi
2014년 11월 14일
편집: Mehrdad Oveisi
2014년 11월 14일
Rick Amos
2014년 11월 17일
Currently, the one very specific form of mat files that can be read by datastore is the output of another mapreduce call. An unofficial shortcut that creates such a mat file is the following code:-
data.Key = {'Test'};
data.Value = {struct('a', 'Hello World!', 'b', 42)};
save('myMatFile.mat', '-struct', 'data');
ds = datastore('myMatFile.mat');
readall(ds)
Mehrdad Oveisi
2014년 11월 19일
채택된 답변
추가 답변 (0개)
카테고리
도움말 센터 및 File Exchange에서 MapReduce에 대해 자세히 알아보기
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!