Mapreduce on parallel cluster - Database or disk full - How control storage of intermediate files?

Question

0 개 추천

I wish to calculate several statistics (Spectra, Correlation Functions, etc.) of ~400 files with 6e6 doubles per file and afterwards average over all files to get average spectra, correlation functions, etc. To make things fast, I try to use mapreduce on a parallel cluster. This works like a charm as long as there are relatively few files (~100), but with a larger amount of files I get this error message:

Error using parallel.mapreduce.KeyValueOutputStore/addmulti (line 63)
Error in adding keys and values.
Error in Analysis20190708>Analysis (line 115)
    addmulti(intermKVStore, {'StatNames'}, {Stats});
Error in parallel.internal.pool.deserialize>@(data,info,intermKVStore)Analysis(data,Parameters,info,intermKVStore)
Error in mapreduce (line 116)
    outds = execMapReduce(mrcer, ds, mapfun, reducefun, parsedStruct);
Error in Analysis20190708 (line 72)
outDS = mapreduce(ds, mapper, @reduceAnalysis,inpool);
Caused by:
    The database /tmp/filename/TaskOutput7.db is full. (database or disk is full)

The message occurs after around 50% of the map phase is done, but later when I reduce the size of the result vectors (less frequently sampled spectra for example). I checked with an admin and the /tmp indeed has very limited free space.

The question is now: How do I tell MATLAB to store these intermediate(?) files to a different location with more storage