Problem with text output during parallel computation
조회 수: 17(최근 30일)
I have been working for a year now to solve a puzzling problem with text output produced during parallel computation with Matlab.
I am using the parpool function from the Parallel Computing Toolbox (Matlab v. 2020b) to run a set of jobs at my university’s HPC system. The computation is a standard controlled search for optimization, where the goal is to find a solution vector xMin that minimizes an objective function S = f(x), where x is an arbitrary candidate solution, and S is the misfit value for that candidate solution. Parallel computing is great for this problem given that the objective function can be calculated independently for each candidate solution.
I have set up the computation in a way to avoid any obvious clashes between the workers and the master. I start Matlab on one of the 28 cores available on a single node, and I start a master program on that first core that initializes the parallel pool with a set of 27 jobs, using Matlab’s parpool function. The idea is to ensure that the master program and the job programs are each isolated to their own core.
The master then starts the 27 jobs with a different solution vector for each job. The master scans for finished jobs, and processes each one independently, which involves invoking fprintf to write a line of text (about 200 characters long) to a text file. The text states the solution vector x and associated objective value S for that job. At this point, the master program replaces the finished job with a new job. All of this is designed to occur sequentially within the master program, so there should be no collisions.
The problem is that there are rare instances where the text output fails. This is generally isolated to a single solution record and is marked by a long string of null characters, as illustrated by the following schematic example (null characters are replaced here by “?”):
112 58.1453 58.1453 152 11.4779 38.8290
114 13.4881 13.4881 30 19.1352 118.670
This error shows up at a rate of about 1 out of 5,000 solutions, which makes it very hard to isolate. I have consulted with our IT people and they have indicated that they have not seen this error with others who are running on our HPC. The string of nulls is not necessarily specific to a single solution record. For example, the nulls might start towards the end of the previous solution record. In addition, the number of nulls in each string can vary.
I have tried to fix this problem by using the buffering option provided by fopen. More specifically, when the text file is first opened with fopen, one can use the -w option, which forces a “write to file” with each call of fprintf, and -W, which sets up a 4 kB buffer so that the file writes occur less frequently. Neither of these attempts have solved this problem.
My guess is that the write process in fprintf is suffering from some kind of timing problem. All of computations are done using the default “multithreaded” mode, and may that mode is factor.
I am hoping that others may have seen this problem, and might be able to provide evidence and/or ideas to fix it.
Raymond Norris 2021년 2월 24일
I'm gathering from above that none of the jobs fail -- that if you were to look at the output before it's written, you would see job 113, correct? It's not that the output is returning the null characters.
Looking at the doc for both fopen and fprintf, I don't see a -w switch. For fopen, I only see an optoin to opt-out
'W' Open file for writing without automatic flushing of the current output buffer.
By default, each worker is started in single-threaded mode, not multi-threaded.
I'm curriuos about something. MATLAB starts on Core #1 and then starts a parallel pool of 27 worker. Then each worker is given a "job". When the job finishes, MATLAB then trieves the results and writes to the file. Then MATLAB submits another job. What parallel construct are you using to do this? parfor? parfeval? How do you ensure that each iteration is getting printed out in order?
Down the road, something to consider is running a larger pool by using the MATLAB Parallel Server, rather than just a local pool with PCT. I'd depend on how long it takes to run your sims