Bug #2541
closed
Added by Dmitry Lyumkis over 11 years ago.
Updated over 9 years ago.
Affected Version:
Appion/Leginon 2.1.0
Description
Over the last two days, I've noticed that a lot of Xmipp jobs have not been running. I've noticed it in:
- xmipp protocol (run by Dipa)
- cl2d (run by me)
- xmipp_ml_tomo (run by Travis)
I think something is funny with the mpi, but I have not had a chance to track it down. We have a number of test scripts that we can point to in order to troubleshoot. I think this should be urgent, as there are a lot of people relying on these xmipp jobs.
Below is output from file xmipp.std in directory /ami/data00/appion/13sep27a/align/cl2d2:
[guppy-10:03937] plm:tm: failed to spawn daemon, error code = 15012
--------------------------------------------------------------------------
A daemon (pid unknown) died unexpectedly on signal 1 while attempting to
launch so we are aborting.
There may be more information reported by the environment (see above).
This may be because the daemon was unable to find all the needed shared
libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
location of the shared libraries on the remote nodes and this will
automatically be forwarded to the remote nodes.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that the job aborted, but has no info as to the process
that caused that situation.
--------------------------------------------------------------------------
xmipp.std logs the xmipp CL2D run
- Status changed from New to In Code Review
- Status changed from In Code Review to Closed
- Status changed from Closed to In Test
- Assignee changed from Sargis Dallakyan to Arne Moeller
This seems to be happening when guppy is heavily loaded. After some searching I found this document which mentions that "it may be desirable to increase the pbs_tcp_timeout setting used by the pbs_mom daemon in MOM-to-MOM communication".
http://docs.adaptivecomputing.com/torque/help.htm#topics/12-appendices/otherConsiderations.htm
I have changed pbs_tcp_timeout in /home/export/src/torque-2.5.12/src/lib/Libifl/tcp_dis.c from 20 to 200, rebuilded and reinstalled the binaries:
~/safadmin rpm -Uvh /home/export/src/torque-2.5.12/rpm/torque-client-2.5.12-1.cri.x86_64.rpm
~/safadmin rpm -Uvh --force /home/export/src/torque-2.5.12/rpm/torque-client-2.5.12-1.cri.x86_64.rpm
~/safadmin /etc/init.d/pbs_mom restart
Please let me know if this error happens again.
- Status changed from In Test to Closed
Also available in: Atom
PDF