Bug #2880: cl2d jobs won't run - Appion - Electron Microscopy Group

Actions

Copy link

Bug #2880

closed

cl2d jobs won't run

Added by David Veesler almost 10 years ago. Updated over 8 years ago.

Status:

Closed

Priority:

Immediate

Assignee:

Anchi Cheng

Category:

Target version:

Start date:

08/05/2014

Due date:

% Done:

Estimated time:

Spent time:

1.00 h

Affected Version:

Appion/Leginon 3.0.0

Show in known bugs:

Workaround:

Description

Jobs are crashing on guppy with the following error message:

!!! WARNING: could not create stack average, average.mrc
... Inserting CL2D Run into DB
lines= ['\tlibmpi.so.1 => /usr/lib64/libmpi.so.1 (0x0000003913600000)\n', '\tlibmpi_cxx.so.1 => /usr
... (0x00007f7e8b70b000)\n']
/ami/data17/appion/14aug02d/align/cl2d5/alignedStack.hed
Traceback (most recent call last):
File "/opt/myamisnap/bin/runXmippCL2D.py", line 624, in
cl2d.start()
File "/opt/myamisnap/bin/runXmippCL2D.py", line 605, in start
self.insertAlignStackRunIntoDatabase("alignedStack.hed")
File "/opt/myamisnap/bin/runXmippCL2D.py", line 386, in insertAlignStackRunIntoDatabase
apDisplay.printError("could not find average mrc file: "+avgmrcfile)
File "/opt/myamisnap/lib/appionlib/apDisplay.py", line 65, in printError
raise Exception, colorString("\n * FATAL ERROR *\n"+text+"\n\a","red")
Exception:

FATAL ERROR ***
could not find average mrc file: /ami/data17/appion/14aug02d/align/cl2d5/average.mrc

Files

Download all files

Just now-correct.png (78.1 KB) Just now-correct.png		Emily Greene, 09/24/2014 03:43 PM
2 hours ago-incorrect.png (78.2 KB) 2 hours ago-incorrect.png		Emily Greene, 09/24/2014 03:43 PM

Related issues 2 (0 open — 2 closed)

Actions

Copy link

Updated by Amber Herold almost 10 years ago

Related to Feature #2744: Change the default memory for Cl2d jobs submitted to garibaldi added

Actions

Copy link

Updated by Amber Herold almost 10 years ago

Melody is reporting this issue as well on both guppy and garibaldi.

Actions

Copy link

Updated by Amber Herold almost 10 years ago

Sargis, I just realized that this morning when you asked if I was working on the bug that David reported you probably thought I meant this one. I was working on #2744 this morning.

Sorry David and Melody, looks like our communication snafu prevented work on this urgent issue today...

Actions

Copy link

Updated by Sargis Dallakyan almost 10 years ago

Status changed from New to In Test

I run this job today at noon. It took 5.5 hours and it finished successfully producing 6 clusters : http://emportal.scripps.edu/myamiweb/processing/clusterlist.php?expId=13763

There might have been some other problems upstream that cause this error. Also, data15 is sill having issues; I'm running xfs_repair on it...

Actions

Copy link

Updated by David Veesler almost 10 years ago

I just tried again and cl2d jobs don't work either via the webinterface or in submitting the jobfile.

Sargis, here is an example /ami/data15/appion/14aug02a/align/cl2d7

Thanks
David

Actions

Copy link

Updated by Reginald McNulty almost 10 years ago

CL2D appears to be running fine right now. My clustering results are currently being uploaded to the database. This is the folder: /ami/data15/appion/14aug05b/align/cl2d2/

Actions

Copy link

Updated by David Veesler almost 10 years ago

This still doesn't work for me.
When submitting jobs from the webinterface, I still have the same error message.
When deleting all the files but the jobfile and submitting it manually to guppy, the job crashes with an mpi error (/ami/data15/appion/14aug02a/align/cl2d8).

I have also been trying to run cl2d jobs for other projects on garibaldi and although they go to completion, they yield only 2 class averages out of 64 or 128 required (depending on the job).

Actions

Copy link

Updated by Sargis Dallakyan almost 10 years ago

Thanks for the update Reggie. Sorry this still doesn't work for you David. I've run /ami/data15/appion/14aug02a/align/cl2d7 as David and it worked for me too. Things to try for David: use betamyamiweb instead of myamiweb or replace /opt/myami-3.0 with /opt/myamisnap in the job file.

By doing diff for the log when it works and when it fails I see that in both cases xmipp_mpi_class_averages works fine. This seems to be a problem with proc2d as shown in the snippet of diff below.

96,99c77
< EMAN: proc2d /ami/data15/appion/14aug02a/align/cl2d7/part14aug08j36_lev
el_05_.hed /ami/data15/appion/14aug02a/align/cl2d7/average.mrc average
< EMAN 1.9 Cluster ($Date: 2009/02/18 05:12:22 $).
< Run 'proc2d help' for detailed help.
1 images60 complete
---
> !!! WARNING: could not create stack average, average.mrc

Actions

Copy link

Updated by Melody Campbell almost 10 years ago

Hi All,

I'm still also having this problem about half the time I run. I ran from cronus3/betamyami and it has /opt/myamisnap in the job file.

Here is part of the job file

webcaller.py '/opt/myamisnap/bin/appion runXmippCL2D.py --stack=27 --lowpass=15 --highpass=2000 --num-part=22638 --num-ref=128 --bin=4 --max-iter=15 --nproc=24 --fast --classical_multiref --commit --description=correntropy --runname=cl2d11 --rundir=/ami/data15/appion/14jul24d/align/cl2d11 --projectid=450 --expid=13722 ' /ami/data15/appion/14jul24d/align/cl2d11/cl2d11.appionsub.log

And the error is:

Getting stack data for stackid=27
Old stack info: 'small ones ... 22638 particle substack with 0,1,2,3,4,5,7,8,10,13,15,16,18,19,29,25,23,22,21,20,30,31,33,38,39,45,50,68,66,70,71,78,80,90,93,98,106,103,100,114,127,126,123,125 classes included'
... Stack 27 pixel size: 1.364
... averaging stack for summary web page
!!! WARNING: could not create stack average, average.mrc
... Inserting CL2D Run into DB

lines= ['\tlibmpi.so.1 => /usr/lib64/libmpi.so.1 (0x000000372fa00000)\n', '\tlibmpi_cxx.so.1 => /usr/lib64/libmpi_cxx.so.1 (0x00007fdfffc0f000)\n']
/ami/data15/appion/14jul24d/align/cl2d11/alignedStack.hed
Traceback (most recent call last):
File "/opt/myamisnap/bin/runXmippCL2D.py", line 624, in <module>
cl2d.start()
File "/opt/myamisnap/bin/runXmippCL2D.py", line 605, in start
self.insertAlignStackRunIntoDatabase("alignedStack.hed")
File "/opt/myamisnap/bin/runXmippCL2D.py", line 386, in insertAlignStackRunIntoDatabase
apDisplay.printError("could not find average mrc file: "+avgmrcfile)
File "/opt/myamisnap/lib/appionlib/apDisplay.py", line 65, in printError
raise Exception, colorString("\n * FATAL ERROR *\n"+text+"\n\a","red")
Exception: * FATAL ERROR *
could not find average mrc file: /ami/data15/appion/14jul24d/align/cl2d11/average.mrc

This was running on guppy

Actions

Copy link

#10

Updated by Sargis Dallakyan almost 10 years ago

I've checked /var/log/messages on guppy-14 and saw this there:

Aug 12 11:54:07 guppy-14 abrt: detected unhandled Python exception in '/opt/myamisnap/bin/runXmippCL2D.py'
Aug 12 11:54:07 guppy-14 abrt: can't communicate with ABRT daemon, is it running? [Errno 2] No such file or directory
Aug 12 11:54:15 guppy-14 pbs_mom: LOG_ERROR::sys_copy, command '/usr/bin/scp -rpB /var/spool/torque/spool/670735.guppy.emg.nysbc.org.OU melody@guppy.emg.nysbc.org:/ami/data15/appion/14jul24d/align/cl2d11/cl2d11.appionsub.job.o670735' failed with status=1, giving up after 4 attempts

The error output in guppy-14:/var/spool/torque/undelivered/670741.guppy.emg.nysbc.org.OU reads:

Warning: no access to tty (Bad file descriptor).
Thus no job control in this shell.
PBS: chdir to /home/melody failed: No such file or directory
[guppy-18:28212] [[24343,0],4] ORTE_ERROR_LOG: Not found in file ess_tm_module.c at line 241
[guppy-18:28212] [[24343,0],4] -> [[24343,0],0] (node: NULL) oob-tcp: Number of attempts to create TCP connection has been exceeded.  Can not communicate with peer
[guppy-18:28212] [[24343,0],4] ORTE_ERROR_LOG: Not found in file ess_tm_module.c at line 241
[guppy-18:28212] [[24343,0],4] -> [[24343,0],0] (node: NULL) oob-tcp: Number of attempts to create TCP connection has been exceeded.  Can not communicate with peer
[guppy-16:19614] [[24343,0],2] ORTE_ERROR_LOG: Not found in file ess_tm_module.c at line 241
[guppy-16:19614] [[24343,0],2] -> [[24343,0],0] (node: NULL) oob-tcp: Number of attempts to create TCP connection has been exceeded.  Can not communicate with peer
[guppy-16:19614] [[24343,0],2] ORTE_ERROR_LOG: Not found in file ess_tm_module.c at line 241
[guppy-16:19614] [[24343,0],2] -> [[24343,0],0] (node: NULL) oob-tcp: Number of attempts to create TCP connection has been exceeded.  Can not communicate with peer
[guppy-15:18344] [[24343,0],1] ORTE_ERROR_LOG: Not found in file ess_tm_module.c at line 241
[guppy-15:18344] [[24343,0],1] -> [[24343,0],0] (node: NULL) oob-tcp: Number of attempts to create TCP connection has been exceeded.  Can not communicate with peer
[guppy-15:18344] [[24343,0],1] ORTE_ERROR_LOG: Not found in file ess_tm_module.c at line 241
[guppy-15:18344] [[24343,0],1] -> [[24343,0],0] (node: NULL) oob-tcp: Number of attempts to create TCP connection has been exceeded.  Can not communicate with peer
[guppy-19:08589] [[24343,0],5] ORTE_ERROR_LOG: Not found in file ess_tm_module.c at line 241
[guppy-19:08589] [[24343,0],5] -> [[24343,0],0] (node: NULL) oob-tcp: Number of attempts to create TCP connection has been exceeded.  Can not communicate with peer
[guppy-19:08589] [[24343,0],5] ORTE_ERROR_LOG: Not found in file ess_tm_module.c at line 241
[guppy-19:08589] [[24343,0],5] -> [[24343,0],0] (node: NULL) oob-tcp: Number of attempts to create TCP connection has been exceeded.  Can not communicate with peer

I've added "$usecp *:/ami/data15 /ami/data15" line to /var/spool/torque/mom_priv/config on all guppy work nodes, restarted pbs_server and pbs_sched on head node and pbs_mom on all work nodes. This is supposed to copy .OU file to /ami/data15. Please update this bug and past the rundir for the failed runXmippCL2D job.

Actions

Copy link

#11

Updated by Melody Campbell almost 10 years ago

Hi,

I moved the old directory to
/ami/data15/appion/14jul24d/align/OLD_cl2d11_OLD/

I re-created the directory and ran the same job using the exact same job file and got the same error as before:
... averaging stack for summary web page
!!! WARNING: could not create stack average, average.mrc
... Inserting CL2D Run into DB

lines= ['\tlibmpi.so.1 => /usr/lib64/libmpi.so.1 (0x0000003913600000)\n', '\tlibmpi_cxx.so.1 => /usr/lib64/libmpi_cxx.so.1 (0x00007f4abc346000)\n']
/ami/data15/appion/14jul24d/align/cl2d11/alignedStack.hed
Traceback (most recent call last):
File "/opt/myamisnap/bin/runXmippCL2D.py", line 624, in <module>
cl2d.start()
File "/opt/myamisnap/bin/runXmippCL2D.py", line 605, in start
self.insertAlignStackRunIntoDatabase("alignedStack.hed")
File "/opt/myamisnap/bin/runXmippCL2D.py", line 386, in insertAlignStackRunIntoDatabase
apDisplay.printError("could not find average mrc file: "+avgmrcfile)
File "/opt/myamisnap/lib/appionlib/apDisplay.py", line 65, in printError
raise Exception, colorString("\n * FATAL ERROR *\n"+text+"\n\a","red")
Exception: * FATAL ERROR *
could not find average mrc file: /ami/data15/appion/14jul24d/align/cl2d11/average.mrc

If the stack average is really just for the summary page could we just comment out this part? It's not needed for the actual alignment or processing and we would be very happy to be able to proceed using just cl2d without this average image for the appion webpage.

Actions

Copy link

#12

Updated by Amber Herold almost 10 years ago

Assignee changed from Sargis Dallakyan to Melody Campbell

I changed the error to a warning instead. Melody is testing. It may still be worth investigating what the root issue is however.

Actions

Copy link

#13

Updated by Melody Campbell almost 10 years ago

Assignee changed from Melody Campbell to Sargis Dallakyan

Hi,

I ran the new version in Amber's sandbox but it just crashes at the next step.
Here is the error:

!!! WARNING: could not create stack average, average.mrc
... Inserting CL2D Run into DB
!!! WARNING: could not find average mrc file: /ami/data00/appion/14jun19b/align/cl2d2/average.mrc

lines= ['\tlibmpi.so.1 => /usr/lib64/libmpi.so.1 (0x000000308f800000)\n', '\tlibmpi_cxx.so.1 => /usr/lib64/libmpi_cxx.so.1 (0x00007f48f47a9000)\n']
/ami/data00/appion/14jun19b/align/cl2d2/alignedStack.hed
Traceback (most recent call last):
File "/ami/data00/dev/amber/myami/appion/bin/runXmippCL2D.py", line 624, in <module>
cl2d.start()
File "/ami/data00/dev/amber/myami/appion/bin/runXmippCL2D.py", line 605, in start
self.insertAlignStackRunIntoDatabase("alignedStack.hed")
File "/ami/data00/dev/amber/myami/appion/bin/runXmippCL2D.py", line 389, in insertAlignStackRunIntoDatabase
apDisplay.printError("could not find reference stack file: "+refstackfile)
File "/ami/data00/dev/amber/myami/appion/appionlib/apDisplay.py", line 65, in printError
raise Exception, colorString("\n * FATAL ERROR *\n"+text+"\n\a","red")
Exception: * FATAL ERROR *
could not find reference stack file: /ami/data00/appion/14jun19b/align/cl2d2/part14aug13p38_level_-1_.hed

And here is the directory:
/ami/data00/appion/14jun19b/align/cl2d2

Actions

Copy link

#14

Updated by Sargis Dallakyan almost 10 years ago

I've researched the mpi error message and it seems that this can be fixed by setting mpi_preconnect_all to 1:
https://groups.google.com/forum/#!msg/mom-users/G4Nknz-Vp5I/1DIl7czOP1UJ

I've added mpi_preconnect_all = 1 to /etc/openmpi-mca-params.conf on guppy head and work nodes.

Please try again. If it fails, try running the same job and request one node instead of 4 nodes to see if that helps.

Actions

Copy link

#15

Updated by Jana Albrecht over 9 years ago

I just tested a small stack (~1200 particles) after Sargis modified: didn't run

mpirun: killing job...

mpirun: abort is already in progress...hit ctrl-c again to forcibly terminate

/ami/data15/appion/14aug12b/align/cl2d4

Actions

Copy link

#16

Updated by Sargis Dallakyan over 9 years ago

Thanks Jana. This job was terminated because it exceeded walltime limit:

[root@guppy cl2d4]# more cl2d4.appionsub.job.o670882
Warning: no access to tty (Bad file descriptor).
Thus no job control in this shell.
=>> PBS: job killed: walltime 7240 exceeded limit 7200 = (2*60*60)
Terminated

I've increased wall time for guppy host on cronus3/betamyamiweb from 2 to 200 hours. Please try again.

Actions

Copy link

#17

Updated by David Veesler over 9 years ago

It still doesn't work but now I am having a new error:

lines= ['\tlibmpi.so.1 => /usr/lib64/libmpi.so.1 (0x000000308f800000)\n', '\tlibmpi_cxx.so.1 => /usr/lib64/libmpi_cxx.so.1 (0x00007f59fe17a000)\n']
/ami/data15/appion/14aug07f/align/cl2d3/alignedStack.hed
Traceback (most recent call last):
File "/opt/myamisnap/bin/runXmippCL2D.py", line 624, in <module>
cl2d.start()
File "/opt/myamisnap/bin/runXmippCL2D.py", line 605, in start
self.insertAlignStackRunIntoDatabase("alignedStack.hed")
File "/opt/myamisnap/bin/runXmippCL2D.py", line 389, in insertAlignStackRunIntoDatabase
apDisplay.printError("could not find reference stack file: "+refstackfile)
File "/opt/myamisnap/lib/appionlib/apDisplay.py", line 65, in printError
raise Exception, colorString("\n * FATAL ERROR *\n"+text+"\n\a","red")
Exception: * FATAL ERROR *
could not find reference stack file: /ami/data15/appion/14aug07f/align/cl2d3/part14aug17k12_level_-1_.hed

This bug seems reproducible as it happens several times for me.

Actions

Copy link

#18

Updated by Sargis Dallakyan over 9 years ago

I'm now running /ami/data15/appion/14aug07f/align/cl2d3/cl2d3.appionsub.job as David. I'm using /ami/data00/dev/sargis/appion that should keep intermediate results and print more debug messages.

Since this worked for me last time I run a similar job as David, please run env|grep PATH on guppy so we can compare our env paths. Here is what I get:

[sargis@guppy cl2d2]$ env|grep PATH
PATH=.:/opt/Xmipp/bin:/opt/protomo/bin:/opt/protomo/x86_64/bin:/opt/protomo2/bin/linux/x86-64:/opt/phoelix/bin64:/usr/kerberos/bin:/opt/IMOD/bin:/usr/local/EMAN/bin:/opt/EMAN2/bin:/opt/myamisnap/bin:/opt/em_hole_finder:/usr/local/bin:/bin:/usr/bin:/usr/local/ccp4-6.3.0/share/xia2/xia2core//Test:/usr/local/ccp4-6.3.0/share/xia2/xia2//Applications:/usr/local/ccp4-6.3.0/etc:/usr/local/ccp4-6.3.0/bin:/usr/local/ccp4-6.3.0/share/ccp4i/bin:/usr/local/ccp4-6.3.0/share/dbccp4i/bin:/usr/local/relion/bin:/usr/local/SIMPLE/simple_linux_120521/apps:/usr/local/SIMPLE/simple_linux_120521/bin:/opt/suprim/bin64
MATLABPATH=/opt/myamisnap/ace
LD_LIBRARY_PATH=/opt/Xmipp/lib:/opt/suprim/lib64:/opt/suprim/lib64:/opt/phoelix/lib64:/opt/IMOD/lib:/opt/EMAN2/lib:/usr/lib64:/usr/local/matlab/bin/glnxa64:/usr/local/ccp4-6.3.0/lib:/usr/local/EMAN/lib:/opt/Imagic/lib:/opt/Imagic/fftw/lib:/opt/protomo2/lib/linux/x86-64
PYTHONPATH=/opt/EMAN2/lib:/opt/EMAN2/bin:/opt/myamisnap/lib::/usr/local/ccp4-6.3.0/share/python:/usr/local/EMAN/lib:/opt/protomo2/lib/linux/x86-64
MANPATH=/opt/IMOD/man:/usr/share/man:/usr/local/ccp4-6.3.0/man
CLASSPATH=.:/usr/local/ccp4-6.3.0/bin
DYLD_LIBRARY_PATH=/usr/local/ccp4-6.3.0/lib
XUSERFILESEARCHPATH=/usr/local/ccp4-6.3.0/lib/X11/app-defaults/%N:/usr/lib/X11/app-defaults
RASMOLPATH=/usr/local/ccp4-6.3.0/x-windows/RasMol/src
MODULEPATH=/usr/share/Modules/modulefiles:/etc/modulefiles
SUPRIM_PATH=/opt/suprim

Also, has anyone tried running this on a single node instead of 4 nodes to see if that works?

Update: The job I run as David is now running the mpi part and it seems to be working fine so far. I'll need to see David or Molody so we can lunch a modified job file together to debug this.

Actions

Copy link

#19

Updated by David Veesler over 9 years ago

Hi Sargis,

I tried running cl2d using the version 3 of Xmipp as you suggested and got the following error.

/ami/data15/appion/14aug14a/align/cl2d6

Traceback (most recent call last):
File "/opt/myamisnap/bin/runXmipp3CL2D.py", line 647, in <module>
cl2d.start()
File "/opt/myamisnap/bin/runXmipp3CL2D.py", line 599, in start
star.read()
File "/opt/myamisnap/lib/appionlib/starFile.py", line 132, in read
raise Exception("Trying to read a star format file that does not exist: %s" % (self.location))
Exception: Trying to read a star format file that does not exist: images.xmd

Actions

Copy link

#20

Updated by Melody Campbell over 9 years ago

Hi,

I ran two more CL2D jobs unsuccessfully on guppy.

/ami/data15/appion/14aug18a/align/cl2d1
/ami/data15/appion/14aug18a/align/cl2d3

They both had the same error:
=>> PBS: job killed: mem job total 125840520 kb exceeded limit 125829120 kb
Terminated

This was a very small job and I'm very surprised that it would need 125 gb of memory.... is there 16 gb of memory per node on guppy?

Actions

Copy link

#21

Updated by Sargis Dallakyan over 9 years ago

Thanks for the update. 14aug07f/align/cl2d3 finished succesfully. I've started a new Xmipp 3 cl2d job using http://longboard/betamyamiweb using the same parameters as in /ami/data15/appion/14aug14a/align/cl2d6.

Dmitry reported that he was getting PBS: chdir to '/home/dlyumkis' failed on guppy-17 yesterday. There were similar error message in the David's job file. /home folder was somehow unmounted from guppy-17. I have remounted /home on guppy-17.

For Melody's CL2D jobs there was no PBS: chdir errors, which is a good news. I'm now working on the memory error.

Yes, there is 16 gb of memory per node on guppy. pbsnodes gives me the following info: physmem=16331812kb,availmem=24125456kb,totmem=24564764kb. This is about the same for all guppy nodes.

[root@guppy align]# more */*.appionsub.job.o*
::::::::::::::
cl2d1/cl2d1.appionsub.job.o670982
::::::::::::::
Warning: no access to tty (Bad file descriptor).
Thus no job control in this shell.
=>> PBS: job killed: mem job total 18367888 kb exceeded limit 16777216 kb
Terminated
::::::::::::::
cl2d3/cl2d3.appionsub.job.o670989
::::::::::::::
Warning: no access to tty (Bad file descriptor).
Thus no job control in this shell.
=>> PBS: job killed: mem job total 125840520 kb exceeded limit 125829120 kb
Terminated
::::::::::::::

Actions

Copy link

#22

Updated by Reginald McNulty over 9 years ago

I am able to run CL2D on guppy without any error messages. I submitted the job via Appion and longboard/betamyamiweb.

The command and folder are below:

/opt/myamisnap/bin/runXmippCL2D.py \
--stack=190 --lowpass=10 --highpass=2000 --num-part=1349 --num-ref=20 \
--bin=7 --max-iter=15 --nproc=16 --fast --classical_multiref \
--correlation --commit --description='CL2D 26minus minus' --runname=cl2d3 \
--rundir=/ami/data15/appion/14jan24a/align/cl2d3 --projectid=367 \
--expid=12853

Actions

Copy link

#23

Updated by David Veesler over 9 years ago

Sargis,
The run /ami/data15/appion/14aug14a/align/cl2d7 using xmipp 3 successfully completed.
However, the classes aren't properly uploaded as nothing shows up when clicking the "show composite page" link.

Actions

Copy link

#24

Updated by Sargis Dallakyan over 9 years ago

Assignee changed from Sargis Dallakyan to David Veesler

Thanks for the updates. I've committed r18534 to fix the problem of uploading last Xmipp3 run at /ami/data15/appion/14aug14a/align/cl2d7. I've verified that "show composite page" shows Xmipp 3 CL2D results. This will be available on guppy tomorrow after nightly cron updates.

Actions

Copy link

#25

Updated by Clint Potter over 9 years ago

Status changed from In Test to Assigned
Assignee changed from David Veesler to Sargis Dallakyan

Discussed during Appion developers call. CL2D still having problems where raw particles don't correspond to classes. Might be in upload code. Assigned to Sargis. Neil will help.

Actions

Copy link

#26

Updated by Melody Campbell over 9 years ago

Hi,

This is just a note, on a cl2d run I ran last August 19th the particles were still correctly correlated with the 2D class averages. Therefore, my guess is the problem has occurred to changes made within the last month. This was with Xmipp2.

Please let us know if there is anything more we can do to test this problem, we would really like to get it fixed ASAP. Jana/Yong ZI are currently running a small proteasome test with 3k particles as their preferred orientation and high contrast make is easy to tell if the particles are correlated with the 2D average or not.

Actions

Copy link

#27

Updated by Emily Greene over 9 years ago

I made a stack with 3 really clear views (50S ribosome side view, 50S ribosome crown view, and 30S ribosome) and then classified them (/ami/data15/appion/14sep14a/align/cl2d9/). Level 2 uploaded properly, but 1 and 0 didn't. You can see the number of particles in the level 1 classes is the exact same number as the first four level 2 classes, and the particles no longer correspond.

Actions

Copy link

#28

Updated by Sargis Dallakyan over 9 years ago

I've looked at the code (php and python) and can't figure out what might be causing this. There are no recent changes there or in Xmipp2, which makes it a bit hard to debug. In /ami/data15/appion/14sep14a/align/cl2d9/ there are .sel files that point to the aligned particles and that part (View Alighned Partices) seems to be working OK. I'm now going through View Raw Partices code (both php and upload part in runXmippCL2D.py) to see what can be done there. A smaller test case with fewer particles would make it eaiser to debug this.

Actions

Copy link

#29

Updated by Melody Campbell over 9 years ago

Hey Emily,

Bridget and I just looked through your 14sep14a cl2d9 on data15 for all the levels and it seems like they all work... could you go back and double check and see if they are still wrong for you? (we're not ruling out the possibility that something has changed since when you looked at them this morning...)

Thanks

Actions

Copy link

#30

Updated by Melody Campbell over 9 years ago

Hi Sargis,

I've made a small stack with ~3k particles and ran cl2d on guppy. The directory can be found here:
/ami/data22/appion/14aug07e/align/cl2d10-10im_guppy

If you look in appion at the 64 classes (/ami/data22/appion/14aug07e/align/cl2d10-10im_guppy/part14sep24n47_level_05_.hed) you can see that for class #13 or #23 or #41 looking at the aligned particles, they are clearly not matched up correctly. The particles associated with class #57 might be the correct ones for class #13, #23 or #41.

Actions

Copy link

#31

Updated by Bridget Carragher over 9 years ago

Thanks Sargis. We are currently baffled. Emily did a small run - see above /ami/data15/appion/14sep14a/align/cl2d9/
and reported problems. But then Melody and I looked at it and it looks prefect. 14aug07e - latest run.
Melody did a small run and it looks wrong.
Emily ran on garibaldi and Melody ran on both but so far we'ver only looked at guppy and that is wrong.

Actions

Copy link Download all files

#32

Updated by Emily Greene over 9 years ago

File Just now-correct.png Just now-correct.png added
File 2 hours ago-incorrect.png 2 hours ago-incorrect.png added

This is really weird. I displayed the particles for the same class about 2 hours ago and just did now when I saw the email, and they're different. Screen shots of the class averages are attached, where you can see the number of particles in each class changed. Maybe Sargis did something when looking around?

Actions

Copy link

#33

Updated by Bridget Carragher over 9 years ago

Maybe the classes can be displayed before the database is done updating?
Did the classes seem god to you in spite of the incorrect total particle counts? i.e. did the raw particles seem to belong to the right class. Right now they all look very good.
Did you always run on garibaldi or sometimes on guppy beofre?

Actions

Copy link

#34

Updated by Emily Greene over 9 years ago

My test case was run on guppy, as Anchi requested since we have better control over it, and I am guessing that you are right and it wasn't finished uploading-sorry for jumping the gun. That's a mild problem since Appion told me the run was done, but a different problem. At least one of my experimental runs is still incorrect and was run on garibaldi. The error persists through the different levels and affects both raw and aligned particles. I will try submitting the same 600 particle test job to garibaldi and see if it uploads correctly. I'll also try to make a 100 particle one so that Sargis has less to search through.

Actions

Copy link

#35

Updated by Melody Campbell over 9 years ago

Same problem on garibaldi:

/ami/data22/appion/14aug07e/align/cl2d10-10im_garibaldi

Actions

Copy link

#36

Updated by Bridget Carragher over 9 years ago

Priority changed from Urgent to Immediate

OK, this is a major really big problem!!
Everyone who can please escalate this to the highest priority an put everything else to one side until we solve this.
Right now I think we have no idea what the issue is.
Is it related to the number of nodes? The head node? The parreliization? The upload speed? Some random bug? Longboard vs. cronus?
Is anyone else out there (Scott? Neil?) experiencing the same issue. Let us know if you want more details on what we are facing.

Actions

Copy link

#37

Updated by Emily Greene over 9 years ago

I was waiting until Melody's test finished but now it is. I think we've figured out the problem-it started when we switched to using data21 and 22 after the data15 crash. Here are 2 runs with only 127 particles:
/ami/data15/appion/14sep14a/align/cl2d11/
/ami/data21/appion/14sep14a/align/cl2d12/

cl2d12 was run hours ago but hasn't fully uploaded in addition to having some incorrect particles. Usually the first class average is off and after that it's sporadic.

I tried it also with a 650 particle stack with the same problem (cl2d10 and cl2d10_data21) and Melody tried it with the proteasome. It seems to be consistent. The particles are way more clear for the proteasome, but that stack is a lot larger.

Hope this helps.

Actions

Copy link

#38

Updated by Melody Campbell over 9 years ago

it's a miracle!

last night emily emailed me pointing out that it seemed like her alignments worked but only on data15.
i re-ran this with the proteaosme and i found at least for guppy on data15 that the particles are properly correlated (for this one run)

/ami/data15/appion/14aug07e/align/cl2d10-10im_guppy_data15

this seems like more than just a coincidence as emily has also seen it work properly on data15 and that the 4 other runs i ran on data22 in the same session wiht the same parameters/stack have incorrect class average/particle correlation

anyway, could it be something gets crossed when reading/writing to another disk?

Actions

Copy link

#39

Updated by Sargis Dallakyan over 9 years ago

I have rerun last two jobs without clearing intermediate results and the raw results produced my Xmipp2 (.spi, .xmp, .sel) look identical.

/ami/data15/appion/14aug07e/align/cl2d10-10im_guppy_data15
/ami/data22/appion/14aug07e/align/cl2d10-10im_guppy

I'll now go through the steps in createReferenceStack (https://emg.nysbc.org/projects/appion/repository/entry/trunk/appion/bin/runXmippCL2D.py#L192). I think somthing is happending when it creates the class averages and alignedStack that makes the result look different.

Actions

Copy link

#40

Updated by Sargis Dallakyan over 9 years ago

I found what's causing this. runXmippCL2D.py is using glob function and this function is returned ordered list of files on data15. On data22, however, it returns this list in arbitrary order. I have made revision r18592 that sorts this list before making imagic files. Please try running another cl2d job on data21 or data22 tomorrow after a nightly cron jobs updates appion on guppy.

There is more info on how pythons glob.glob ordered at http://stackoverflow.com/questions/6773584/how-is-pythons-glob-glob-ordered.

Actions

Copy link

#41

Updated by Sargis Dallakyan over 9 years ago

Status changed from Assigned to In Code Review
Assignee changed from Sargis Dallakyan to Anchi Cheng

I'm not sure why redmine stopped sending notifications for this bug. Maybe because this bug is one month old. In any case, I've looked for other places where glob is used. It seems this sorting is not required in other palces, or when it's required, it's doing sorting like in the following examples:

https://emg.nysbc.org/projects/appion/repository/annotate/trunk/appion/bin/emanRefine2d.py#L95
https://emg.nysbc.org/projects/appion/repository/annotate/trunk/appion/bin/refBasedMaxlikeAlign.py#L242

Actions

Copy link

#42

Updated by Sargis Dallakyan over 9 years ago

I have found why redmine was not sending notification emails after reading http://www.redmine.org/issues/8157 and going through the logs (Email delivery error: 550 5.1.1 <craigyk@scripps.edu>... User unknown).

Actions

Copy link

#43

Updated by Emily Greene over 9 years ago

Hey everyone,

I finished my identical data15 and data21 runs on an experimental dataset, and things seem to have uploaded properly on both! I didn't check every single class, but I randomly opened classes of distinct particles in different levels and they were all correct. Runs are on 14sep17c (cl2d5 and cl2d5_data21) if anyone wants to take a look. I ran on garibaldi but, as Sargis made the same updates on both, I think it should be fine.

Happy classifying!

Actions

Copy link

#44

Updated by Bridget Carragher over 9 years ago

Thanks Emily and everyone else for the careful testing and bug reporting. And thanks Sargis for hunting this down and fixing it.

Actions

Copy link

#45

Updated by Sargis Dallakyan over 9 years ago

Related to Bug #2873: Xmipp2 CL2D class averages summary page does not always correctly map the class averages to the underlying stack of raw particles added

Actions

Copy link

#46

Updated by Anchi Cheng over 8 years ago

Status changed from In Code Review to Closed

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Appion

Custom queries

Bug #2880

cl2d jobs won't run

Updated by Amber Herold almost 10 years ago

Updated by Amber Herold almost 10 years ago

Updated by Amber Herold almost 10 years ago

Updated by Sargis Dallakyan almost 10 years ago

Updated by David Veesler almost 10 years ago

Updated by Reginald McNulty almost 10 years ago

Updated by David Veesler almost 10 years ago

Updated by Sargis Dallakyan almost 10 years ago

Updated by Melody Campbell almost 10 years ago

Updated by Sargis Dallakyan almost 10 years ago

Updated by Melody Campbell almost 10 years ago

Updated by Amber Herold almost 10 years ago

Updated by Melody Campbell almost 10 years ago

Updated by Sargis Dallakyan almost 10 years ago

Updated by Jana Albrecht over 9 years ago

Updated by Sargis Dallakyan over 9 years ago

Updated by David Veesler over 9 years ago

Updated by Sargis Dallakyan over 9 years ago

Updated by David Veesler over 9 years ago

Updated by Melody Campbell over 9 years ago

Updated by Sargis Dallakyan over 9 years ago

Updated by Reginald McNulty over 9 years ago

Updated by David Veesler over 9 years ago

Updated by Sargis Dallakyan over 9 years ago

Updated by Clint Potter over 9 years ago

Updated by Melody Campbell over 9 years ago

Updated by Emily Greene over 9 years ago

Updated by Sargis Dallakyan over 9 years ago

Updated by Melody Campbell over 9 years ago

Updated by Melody Campbell over 9 years ago

Updated by Bridget Carragher over 9 years ago

Updated by Emily Greene over 9 years ago

Updated by Bridget Carragher over 9 years ago

Updated by Emily Greene over 9 years ago

Updated by Melody Campbell over 9 years ago

Updated by Bridget Carragher over 9 years ago

Updated by Emily Greene over 9 years ago

Updated by Melody Campbell over 9 years ago

Updated by Sargis Dallakyan over 9 years ago

Updated by Sargis Dallakyan over 9 years ago

Updated by Sargis Dallakyan over 9 years ago

Updated by Sargis Dallakyan over 9 years ago

Updated by Emily Greene over 9 years ago

Updated by Bridget Carragher over 9 years ago

Updated by Sargis Dallakyan over 9 years ago

Updated by Anchi Cheng over 8 years ago