Parallel computing with MDCS

I have some problems with MDCS
After importing the appropriate Configuration for a third-party scheduler, using findResource, defining job and task, and submitting the job, I cannot read or retrieve output.
Specifically:
- working on the remote directory - from where I submitted the job - getAllOutputArguments(job) returns an empty cell array 1-0. Moreover, when I scp all files from this remote directory to my local directory and I try to read the existing .mat files in a MATLAB session at my laptop (MacBook Pro) I get the error message
- to make things worse when I start a new matlab session at the remote directory and I use findJob to retrieve the job I get the message <Undefined variable or function >
Can anyone help?
Thanks in advance
Miltos

댓글 수: 1

Miltos
Miltos 2012년 12월 14일
I think I found what was happening after lots of trial-and-error!
Apparently job.PathDependencies needed the full path /home/username/filename not just ~/filename (that I was using)

댓글을 달려면 로그인하십시오.

 채택된 답변

Jason Ross
Jason Ross 2012년 12월 7일

1 개 추천

What is the job state? Is it still running? You can use the waitForState(job, finished) to wait for the job to complete running, and job.State to get the current state of the job.
Can you validate the cluster successfully using the imported profile? (Parallel menu, configurations, and validate).

댓글 수: 10

Miltos
Miltos 2012년 12월 7일
Hi Jason,
Many thanks. The job is finished (sorry for not clarifying it earlier). I haven't validated the imported profile because it is the profile given to me by the cluster administrator. I am trying to resolve some issues with display right now (as I have to use the Linux platform on the cluster), and so I cannot validate the profile using Parallel>Manage Configurations. Can I do it programmatically? I don't know of an .m function/script that does it.
Jason Ross
Jason Ross 2012년 12월 7일
There isn't one that does, it basically does a short trial of each type of job and will collect the errors for you. If you can get the display sorted out, it's probably the best way to assure that the basics are functional. It's possible that there could be some issue with permissions or ssh setup that is preventing the results from coming back.
Have you tried your code with the "local" scheduler (which should be present if you have Parallel Computing Toolbox on your machine)? Does it deliver results?
Miltos
Miltos 2012년 12월 7일
Hi Jason, The code is correct. I have also tried the code with the "local" schedular and getAllOutputArguments returns again an empty 1x0 cell array. Thanks again for following up on this.
Jason Ross
Jason Ross 2012년 12월 10일
Can you validate the local scheduler?
If you are getting an empty array both times, it seems something is going wrong somewhere along the way that's independent of the scheduler you select.
Miltos
Miltos 2012년 12월 10일
편집: Jason Ross 2012년 12월 10일
Hi Jason, I have resolved (some of) my display issues and I have tried to validate both the local and the imported configuration. The first one was successful, while the latter validation failed at the Parallel job stage. The output is:
Stage: Parallel Job
Status: Failed
Description: The given stage reached the default or user-specified timeout.
Command Line Output: (none)
Error Report: (none)
Debug Log:
Job Id: 1911494.blue30
Job_Name = Job6
Job_Owner = mm2a09@blue32
job_state = Q
queue = batch
server = blue30
Checkpoint = u
ctime = Mon Dec 10 19:09:27 2012
Error_Path = blue32:/home/mm2a09/Innovations/Job6.e1911494
Hold_Types = n
Join_Path = oe
Keep_Files = n
Mail_Points = a
mtime = Mon Dec 10 19:09:27 2012
Output_Path = blue32:/home/mm2a09/Innovations/matlab_temp/Job6/Job6.log
Priority = 0
qtime = Mon Dec 10 19:09:27 2012
Rerunable = True
Resource_List.nodect = 32
Resource_List.nodes = 32
Resource_List.pmem = 1840mb
Resource_List.walltime = 60:00:00
Variable_List = PBS_O_QUEUE=batch,PBS_O_HOST=blue32,
PBS_O_HOME=/home/mm2a09,PBS_O_LANG=en_US.UTF-8,PBS_O_LOGNAME=mm2a09,
PBS_O_PATH=/local/software/rh53/matlab/2011a/bin:/local/software/rh53
/matlab/2008b/bin:/usr/lib64/qt-3.3/bin:/local/software/rh53/acroread/
8.1.6/Adobe/Reader8/bin:/local/software/rh53/moab/6.1.9/sbin:/local/so
ftware/rh53/moab/6.1.9/bin:/usr/lpp/mmfs//bin:/local/software/torque/d
efault/sbin:/local/software/torque/default/bin:/usr/local/bin:/bin:/us
r/bin:/home/mm2a09/bin:/local/bin:.,PBS_O_MAIL=/var/spool/mail/mm2a09,
PBS_O_SHELL=/bin/bash,PBS_SERVER=blue30,
PBS_O_WORKDIR=/home/mm2a09/Innovations,
MDCE_DECODE_FUNCTION=decodePbsSimpleParallelTask,
MDCE_STORAGE_LOCATION=PC{}:UNIX{/home/mm2a09/Innovations/matlab_temp}
:,MDCE_STORAGE_CONSTRUCTOR=makeFileStorageObject,
MDCE_JOB_LOCATION=Job6,MDCE_CMR=/local/software/rh53/matlab/2011a,
MDCE_MATLAB_EXE=/local/software/rh53/matlab/2011a/bin/worker,
MDCE_MATLAB_ARGS= -parallel,MDCE_TOTAL_TASKS=32,
MDCE_REMSH=/usr/bin/rsh,MDCE_USE_ATTACH=off,MDCE_SCHED_TYPE=torque,
MDCE_DEBUG=true
etime = Mon Dec 10 19:09:27 2012
submit_args = -h -l nodes=32 -W x=GRES:MATLAB_Distrib_Comp_Engine+32 -l wa
lltime=60:00:00 -o /home/mm2a09/Innovations/matlab_temp/Job6/Job6.log
-N Job6 /home/mm2a09/Innovations/matlab_temp/Job6/pbsParallelWrapper.s
h
fault_tolerant = False
submit_host = blue32
init_work_dir = /home/mm2a09/Innovations
x = GRES:MATLAB_Distrib_Comp_Engine+32
In any case, I get empty cell array from getAllOutputArguments!!
Alright. I think you have at least two things going on.
One is that your code has a problem that's independent of the cluster. If "local" validates successfully, then the code should run. I'd start looking in your code and using the local scheduler for where things aren't coming back properly -- you might, for example, be able to comment part of the code and see where the answer falls over as you move down. Start with half the code, then half again (kind of a Newton-Rhapson approach and see what's going on).
I'd expect that you have tried a trivial example, e.g.
spmd
matlabroot
end
The second thing is that your Torque setup doesn't validate. This could be a setup issue, or it could be that the cluster was busy and took longer to process your job because it was busy. I'd follow up with the sysadmin and ask if the validation should work or not -- it could be that they are OK with distributed jobs but not with parallel, so they would not expect it to pass.
Miltos
Miltos 2012년 12월 12일
편집: Miltos 2012년 12월 12일
I am even more perplexed now:
I run my code in a MATLAB session on my laptop, and the code works fine.
I run my code in a MATLAB session on the local node of the remote cluster and the code works fine.
This tells me that my code works fine, but:
(as in my original post) if I run my code in the remote cluster with MDCS I have the problem that getAllOutputArguments returns and empty cell array).
This tells me that there is an issue with MDCS in the cluster, but:
I run a trivial code (a function TEST(y) that gives output out=y+1) in the remote cluster with MDCS and getAllOutputArguments returns the correct output!
Jason Ross
Jason Ross 2012년 12월 12일
편집: Jason Ross 2012년 12월 12일
Do you have any reliance on external data sources or resources? Examples might include
  • .mat files you load or save in your processing
  • Storing intermediate results in /tmp
  • Reliance on a path that doesn't exist for all cluster nodes
  • Some other customization that remains tied only to one node (path?)
That's the first thing that springs to mind
Miltos
Miltos 2012년 12월 14일
편집: Miltos 2012년 12월 14일
I think I found what was happening after lots of trial-and-error!
Apparently job.PathDependencies needed the full path /home/username/filename not just ~/filename
When I run the trivial code I entered the pathDependencies directly from the Command Window and (for some unknown reason to me!) I added it with the full path, while in the script for my code I had job.PathDependencies={'~/filename'}.
Of course, when I was running the code from my laptop or the cluster login node, the script was not using job.PathDependencies and so it was running with no problems!
Conundrum solved!
Many thanks Jason for your help
Jason Ross
Jason Ross 2012년 12월 14일
No problem, glad you were able to get it to work!

댓글을 달려면 로그인하십시오.

추가 답변 (0개)

카테고리

도움말 센터File Exchange에서 MATLAB Parallel Server에 대해 자세히 알아보기

질문:

2012년 12월 7일

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by