Project

General

Profile

Actions

Bug #2894

closed

maximum likelihood jobs won't run

Added by Melody Campbell over 10 years ago. Updated almost 10 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
-
Target version:
-
Start date:
08/14/2014
Due date:
% Done:

0%

Estimated time:
Affected Version:
Appion/Leginon 3.1.0
Show in known bugs:
No
Workaround:

Description

Both Yong Zi and I have encountered this issue with our most recent maximum likelihood jobs:

Using 32 processors!
... Running on host: guppy-10
Xmipp: /usr/bin/mpirun np 32 /opt/Xmipp/bin/xmipp_mpi_ml_align2d -i /ami/data15/appion/14jul24d/align/maxlike9/partlist.sel -nref 200 -iter 15 -o /ami/data15/appion/14jul24d/align/maxlike9/part14aug14o00 -psi_step 5 -fast -eps 5e-5 -mirror
[guppy-10:22681] plm:tm: failed to spawn daemon, error code = 17000
-------------------------------------------------------------------------

A daemon (pid unknown) died unexpectedly on signal 1 while attempting to
launch so we are aborting.

There may be more information reported by the environment (see above).

This may be because the daemon was unable to find all the needed shared
libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
location of the shared libraries on the remote nodes and this will
automatically be forwarded to the remote nodes.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that the job aborted, but has no info as to the process
that caused that situation.
--------------------------------------------------------------------------
... Alignment time: 0.5 sec

self.params['maxlikejobid'] 1
lines= ['\tlibmpi.so.1 => /usr/lib64/libmpi.so.1 (0x000000308f800000)\n', '\tlibmpi_cxx.so.1 => /usr/lib64/libmpi_cxx.so.1 (0x00007f443b752000)\n']
Traceback (most recent call last):
File "/opt/myami-3.0/bin/maxlikeAlignment.py", line 459, in <module>
maxLike.start()
File "/opt/myami-3.0/bin/maxlikeAlignment.py", line 452, in start
self.createReferenceStack()
File "/opt/myami-3.0/bin/maxlikeAlignment.py", line 341, in createReferenceStack
apDisplay.printError("Xmipp did not run")
File "/opt/myami-3.0/lib/appionlib/apDisplay.py", line 62, in printError
raise Exception, colorString("\n * FATAL ERROR *\n"+text+"\n\a","red")
Exception: * FATAL ERROR *
Xmipp did not run

here is the directory:
/ami/data15/appion/14jul24d/align/maxlike9

Please let me know if there is anything I can do to troubleshoot.
Thanks

Actions #1

Updated by Sargis Dallakyan over 10 years ago

I run the mpi part and was getting this:

[sargis@guppy-18 maxlike9]$ /usr/bin/mpirun -np 32 /opt/Xmipp/bin/xmipp_mpi_ml_align2d   -i /ami/data15/appion/14jul24d/align/maxlike9/partlist.sel -nref 200 -iter 15 -o /ami/data1
--------------------------------------------------------------------------
A deprecated MCA parameter value was specified in an MCA parameter
file.  Deprecated MCA parameters should be avoided; they may disappear
in future releases.

  Deprecated parameter: mpi_preconnect_all
--------------------------------------------

Replaced mpi_preconnect_all with mpi_preconnect_mpi in /etc/openmpi-mca-params.conf on guppy head and work nodes. I'm now running the mpi part again, so far it looks good:

[melody@guppy-18 maxlike9]$ /usr/bin/mpirun -np 32 /opt/Xmipp/bin/xmipp_mpi_ml_align2d -i /ami/data15/appion/14jul24d/align/maxlike9/partlist.sel -nref 200 -iter 15 -o /ami/data15/appion/14jul24d/align/maxlike9/part14aug14r59 -psi_step 5 -fast  -eps 5e-5  -mirror
 -----------------------------------------------------------------
 | Read more about this program in the following publications:   |
 |  Scheres ea. (2005) J.Mol.Biol. 348(1), 139-49                |
 |  Scheres ea. (2005) Bioinform. 21(suppl.2), ii243-4   (-fast) |
 |                                                               |
 |  *** Please cite them if this program is of use to you! ***   |
 -----------------------------------------------------------------
--> Maximum-likelihood multi-reference refinement 
  Input images            : /ami/data15/appion/14jul24d/align/maxlike9/partlist.sel (22638)
  Number of references:   : 200
  Output rootname         : /ami/data15/appion/14jul24d/align/maxlike9/part14aug14r59
  Stopping criterium      : 5e-05
  initial sigma noise     : 1
  initial sigma offset    : 3
  Psi sampling interval   : 5
  Check mirrors           : true
  -> Use fast, reduced search-space approach with C = 1e-12
 -----------------------------------------------------------------
  Generating initial references by averaging over random subsets
8.00/8.00 min ............................................................-
  Multi-reference refinement:  iteration 1 of 15
0.06/2.01 hours .----------------------------------------------------------

Actions #2

Updated by Anchi Cheng almost 10 years ago

  • Status changed from New to Closed
Actions

Also available in: Atom PDF