Problem with parallel configuration. Parallel job test validation failed!!
조회 수: 10 (최근 30일)
이전 댓글 표시
We want to set up a cluster of two PCs (intel core i5 with 4 cores per machine). We are using the release of MATLAB 2009b and the admin center to generate a job manager with 4 workers, one core per worker (2 workers per machine). The mdce is installed in the two machines with the default mdce_def. This process works fine.
The problems appear when we try to run a parallel configuration, using this job manager with a minimun and maximun of 4 workers, because the parallel test fail.
This process generates several error lines in the mdce-service.log in log folder:
INFO | jvm 1 | 2011/08/18 16:09:04 | Thu Aug 18 16:09:04 CEST 2011:Group-15:out:job aborted using terminate/kill:
INFO | jvm 1 | 2011/08/18 16:09:04 | Thu Aug 18 16:09:04 CEST 2011:Group-15:out:process: node: exit code: error message:
INFO | jvm 1 | 2011/08/18 16:09:04 | Thu Aug 18 16:09:04 CEST 2011:Group-15:err:MPI_Comm_connect(119).....................: MPI_Comm_connect(port="tag=0 port=28351 description=lp-apd12 ifname=172.22.4.92 ", MPI_INFO_NULL, root=0, comm=0x84000000, newcomm=0000000001023A60) failed
INFO | jvm 1 | 2011/08/18 16:09:04 | Thu Aug 18 16:09:04 CEST 2011:Group-15:out:0: localhost: 1: Fatal error in MPI_Comm_connect: Other MPI error, error stack:
INFO | jvm 1 | 2011/08/18 16:09:04 | Thu Aug 18 16:09:04 CEST 2011:Group-15:err:MPID_Comm_connect(187)....................:
INFO | jvm 1 | 2011/08/18 16:09:04 | Thu Aug 18 16:09:04 CEST 2011:Group-15:err:MPIDI_Comm_connect(405)...................:
INFO | jvm 1 | 2011/08/18 16:09:04 | Thu Aug 18 16:09:04 CEST 2011:Group-15:err:MPIC_Sendrecv(126)........................:
INFO | jvm 1 | 2011/08/18 16:09:04 | Thu Aug 18 16:09:04 CEST 2011:Group-15:out:MPI_Comm_connect(119).....................: MPI_Comm_connect(port="tag=0 port=28351 description=lp-apd12 ifname=172.22.4.92 ", MPI_INFO_NULL, root=0, comm=0x84000000, newcomm=0000000001023A60) failed
INFO | jvm 1 | 2011/08/18 16:09:04 | Thu Aug 18 16:09:04 CEST 2011:Group-15:err:MPIC_Wait(270)............................:
INFO | jvm 1 | 2011/08/18 16:09:04 | Thu Aug 18 16:09:04 CEST 2011:Group-15:out:MPID_Comm_connect(187)....................:
INFO | jvm 1 | 2011/08/18 16:09:04 | Thu Aug 18 16:09:04 CEST 2011:Group-15:out:MPIDI_Comm_connect(405)...................:
INFO | jvm 1 | 2011/08/18 16:09:04 | Thu Aug 18 16:09:04 CEST 2011:Group-15:err:MPIDI_CH3i_Progress_wait(215).............: an error occurred while handling an event returned by MPIDU_Sock_Wait()
INFO | jvm 1 | 2011/08/18 16:09:04 | Thu Aug 18 16:09:04 CEST 2011:Group-15:out:MPIC_Sendrecv(126)........................:
INFO | jvm 1 | 2011/08/18 16:09:04 | Thu Aug 18 16:09:04 CEST 2011:Group-15:err:MPIDI_CH3I_Progress_handle_sock_event(420):
INFO | jvm 1 | 2011/08/18 16:09:04 | Thu Aug 18 16:09:04 CEST 2011:Group-15:out:MPIC_Wait(270)............................:
INFO | jvm 1 | 2011/08/18 16:09:04 | Thu Aug 18 16:09:04 CEST 2011:Group-15:out:MPIDI_CH3i_Progress_wait(215).............: an error occurred while handling an event returned by MPIDU_Sock_Wait()
INFO | jvm 1 | 2011/08/18 16:09:04 | Thu Aug 18 16:09:04 CEST 2011:Group-15:out:MPIDI_CH3I_Progress_handle_sock_event(420):
INFO | jvm 1 | 2011/08/18 16:09:04 | Thu Aug 18 16:09:04 CEST 2011:Group-16:err:Fatal error in MPI_Intercomm_merge: Other MPI error, error stack:
INFO | jvm 1 | 2011/08/18 16:09:04 | Thu Aug 18 16:09:04 CEST 2011:Group-16:err:MPI_Intercomm_merge(284): MPI_Intercomm_merge(comm=0xc4000005, high=1, newintracomm=0000000001023A68) failed
INFO | jvm 1 | 2011/08/18 16:09:04 | Thu Aug 18 16:09:04 CEST 2011:Group-16:out:
INFO | jvm 1 | 2011/08/18 16:09:04 | Thu Aug 18 16:09:04 CEST 2011:Group-16:out:job aborted using terminate/kill:
INFO | jvm 1 | 2011/08/18 16:09:04 | Thu Aug 18 16:09:04 CEST 2011:Group-15:out:MPIDU_Sock_wait(2603).....................: The specified network name is no longer available. (errno 64)
INFO | jvm 1 | 2011/08/18 16:09:04 | Thu Aug 18 16:09:04 CEST 2011:Group-16:out:process: node: exit code: error message:
INFO | jvm 1 | 2011/08/18 16:09:04 | Thu Aug 18 16:09:04 CEST 2011:Group-16:out:0: localhost: 1: Fatal error in MPI_Intercomm_merge: Other MPI error, error stack:
INFO | jvm 1 | 2011/08/18 16:09:04 | Thu Aug 18 16:09:04 CEST 2011:Group-16:out:MPI_Intercomm_merge(284): MPI_Intercomm_merge(comm=0xc4000005, high=1, newintracomm=0000000001023A68) failed
INFO | jvm 1 | 2011/08/18 16:09:04 | Thu Aug 18 16:09:04 CEST 2011:Group-16:out:MPI_Intercomm_merge(262): Too many communicators
INFO | jvm 1 | 2011/08/18 16:09:04 | Thu Aug 18 16:09:04 CEST 2011:Group-15:err:MPIDU_Sock_wait(2603).....................: The specified network name is no longer available. (errno 64)
INFO | jvm 1 | 2011/08/18 16:09:04 | Thu Aug 18 16:09:04 CEST 2011:Group-16:err:MPI_Intercomm_merge(262): Too many communicators
INFO | jvm 1 | 2011/08/18 16:09:04 | Thu Aug 18 16:09:04 CEST 2011:Group-17:out:Warning: Unrecognized MATLAB option "cp".
INFO | jvm 1 | 2011/08/18 16:09:04 | Thu Aug 18 16:09:04 CEST 2011:Group-17:out:Warning: Unrecognized MATLAB option "nodisplay".
INFO | jvm 1 | 2011/08/18 16:09:04 | Thu Aug 18 16:09:04 CEST 2011:Group-17:out:Warning: Unrecognized MATLAB option "Djava.security.policy=C:\Program Files\MATLAB\R2009b\toolbox\distcomp\config\jsk-all.policy".
INFO | jvm 1 | 2011/08/18 16:09:04 | Thu Aug 18 16:09:04 CEST 2011:Group-18:out:Warning: Unrecognized MATLAB option "cp".
INFO | jvm 1 | 2011/08/18 16:09:04 | Thu Aug 18 16:09:04 CEST 2011:Group-18:out:Warning: Unrecognized MATLAB option "nodisplay".
INFO | jvm 1 | 2011/08/18 16:09:04 | Thu Aug 18 16:09:04 CEST 2011:Group-18:out:Warning: Unrecognized MATLAB option "Djava.security.policy=C:\Program Files\MATLAB\R2009b\toolbox\distcomp\config\jsk-all.policy".
INFO | jvm 1 | 2011/08/18 16:09:05 | Thu Aug 18 16:09:05 CEST 2011:Group-18:out:Warning: Unable to locate a personal folder for $documents\MATLAB
INFO | jvm 1 | 2011/08/18 16:09:05 | Thu Aug 18 16:09:05 CEST 2011:Group-18:out:{Warning: Userpath must be an absolute path and must exist on disk.}
INFO | jvm 1 | 2011/08/18 16:09:05 | Thu Aug 18 16:09:05 CEST 2011:Group-17:out:Warning: Unable to locate a personal folder for $documents\MATLAB
INFO | jvm 1 | 2011/08/18 16:09:05 | Thu Aug 18 16:09:05 CEST 2011:Group-17:out:{Warning: Userpath must be an absolute path and must exist on disk.}
INFO | jvm 1 | 2011/08/18 16:09:06 | Thu Aug 18 16:09:06 CEST 2011:Group-17:out:
INFO | jvm 1 | 2011/08/18 16:09:06 | Thu Aug 18 16:09:06 CEST 2011:Group-17:out: To get started, type one of these: helpwin, helpdesk, or demo.
INFO | jvm 1 | 2011/08/18 16:09:06 | Thu Aug 18 16:09:06 CEST 2011:Group-17:out: For product information, visit www.mathworks.com.
INFO | jvm 1 | 2011/08/18 16:09:06 | Thu Aug 18 16:09:06 CEST 2011:Group-17:out:
INFO | jvm 1 | 2011/08/18 16:09:06 | Thu Aug 18 16:09:06 CEST 2011:Group-18:out:
INFO | jvm 1 | 2011/08/18 16:09:06 | Thu Aug 18 16:09:06 CEST 2011:Group-18:out: To get started, type one of these: helpwin, helpdesk, or demo.
INFO | jvm 1 | 2011/08/18 16:09:06 | Thu Aug 18 16:09:06 CEST 2011:Group-18:out: For product information, visit www.mathworks.com.
INFO | jvm 1 | 2011/08/18 16:09:06 | Thu Aug 18 16:09:06 CEST 2011:Group-18:out:
INFO | jvm 1 | 2011/08/18 16:09:07 | Thu Aug 18 16:09:07 CEST 2011:Group-17:out:» Thu Aug 18 16:09:07 CEST 2011 Worker started: pc-goba_worker02
INFO | jvm 1 | 2011/08/18 16:09:08 | Thu Aug 18 16:09:07 CEST 2011:Group-18:out:» Thu Aug 18 16:09:07 CEST 2011 Worker started: pc-goba_worker01
Thanks
댓글 수: 0
채택된 답변
Jason Ross
2011년 8월 23일
It looks like your hosts can't resolve their IP addresses correctly. Check the networking setup very closely and make sure:
Hosts can ping each other by short name (yourhostname) Hosts can ping each other by fully qualified name (yourhostname.yourdomain.com)
(you'll need to do this for both hosts in the cluster)
One of the most common things I've seen is that the DNS search order doesn't include the DNS domain of the host itself. For example, the fully qualified hostname is
myhost.desktops.mycorp.com
and the DNS search order is mycorp.com
So the host can't resolve "myhost" and then you get odd networking problems where things can't connect reliably. You can see what these settings are by running "ipconfig /all" at a command prompt, or by looking at the properties on the network connection.
I think Java is just reporting and is working OK.
추가 답변 (3개)
Jason Ross
2011년 8월 18일
In Admin Center, if you run the connectivity test (Hosts > Test Connectivity) are there any errors or warnings?
댓글 수: 1
Thomas O'Donnell
2013년 5월 29일
Where is the Admin Center ? I would like to view the health of my MATLAB 2012A Parallel server
Gonzalo Blanco
2011년 8월 18일
댓글 수: 3
Jason Ross
2011년 8월 19일
Is the firewall off on all of the machines in the cluster?
Do the errors/warnings persist in the Admin Center?
Are there other things running which might also be blocking communication? Virus scanners, malware scanners, etc -- they might block this kind of thing as "suspicious activity"
Jason Ross
2011년 8월 19일
Other things you might want to look for:
From the "The specified network name is no longer available. (errno 64)" error message -- check that every host has correct forward and reverse DNS lookups in place, and that your DNS is reliable. Check the error logs on the host to see if something is going on here.
Check your system PATH to see if there are other MATLAB installs on the path. The error stack that starts with "Unrecognized MATLAB option "cp"." and then continues on with "nodisplay", "Djava.security.policy" and so on makes it look like something is starting MATLAB in a way that's not expected. If you haven't set ClusterMatlabRoot to the installation of MATLAB and are using "on path", you might want to try setting it to the MATLAB installation you want to use.
댓글 수: 7
Alexandre Malotchko
2016년 4월 6일
does not respond to java.net.InetAddress.isReachable(): on R2015b means that all machines involved need to have ECHO service on port 7 - on windows 7 for instance, you need to install MSFT Simple TCP Services feature and configure firewalls to allow port 7 traffic.
참고 항목
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!