Bug #2575
closed
Template picking job hangs in Garibaldi due to FindEM hanging
Added by Saikat Chowdhury about 11 years ago.
Updated almost 7 years ago.
Affected Version:
Appion/Leginon 3.0.0
Description
I have been trying to run template picker since last weekend on session# 12587(13oct29b) and 12628(13nov05o). The name of the job file is "template.job" and the directories in garibaldi are :"/gpfs/home/saikat/xdynactin_tilt" and "/gpfs/home/saikat/xdynactin_tilt/tilt_pair".
Sometimes the job crashes stating that findem.exe has crashed. If I resubmit the job then it runs for some more micrographs and gets stuck. Neither does the job crash, nor does it give any error, but seems to be stuck at a micrograph for hours. If I kill the job and resubmit it, it will proceed for some more micrographs and again hang. There are no definite number of micrographs the job proceeds for nor is there a specific step at which the job hangs.
I waited for myami to be upgraded and then resubmitted the job today and I still have the same issue.
Unfortunately, this appears to be a problem of the Fortrain findem.exe program. Probably a memory leak. Natalia reported the same. I've asked around and Arne said that it has happened to him regularly, too. It is therefore not a garibaldi issue.
I can see if we can do a time-out and retry, but for the time being, your only choice is to resubmit.
Just checked Saikat's job output. It looks like there is already a retry built-in and it is used on quite a few images. However when it really stall, the pipe wasn't broken so it didn't retry.
In addition, I see that Saikat request 8 nodes and 2 processors per node. As far as I know, findem multi-thread only works on the same node, so this way of spreading out would only waste resource and does not speed up the process. Try asking for 1 node an 8 processors per node.
It seems like this was first reported on September 3 2013 (Bug #2504)
FindEM is multithreaded and doesn't run on multiple nodes. I'm not sure why Saikat requested multiple nodes. We've tried in the past requesting 1 nodes with 1 through 16 processors, also requesting small and large amounts of memory and nothing seemed to fix this issue.
Do we have the source code for the new FindEM2? I could try this out to see if it helps at all.
The attached file in #2115 should have the source or exe.
- Subject changed from Template picking job hangs in Garibaldi to Template picking job hangs in Garibaldi due to FindEM hanging
Gabe,
Any progress with trying out FindEM2 to see it fixes this problem?
Sorry I dropped the ball - I never finished implementing this, since FindEM2 requests a custom mask for each template. I'll have the code generate a circular mask for now for all templates & do some testing.
I updated the parallelization code to limit the number of simultaneously running threads to the number of available CPUs. If launched on garibaldi it will check the PBS_NODEFILE variable to get this number. I implemented FindEM2, but in my testing I didn't see much of an improvement over FindEM1. This is perhaps because I was using a circular mask instead of one that is specific to each template. I didn't experience any hangs/crashes with either FindEM1 or FindEM2 using the updated threading code, so I'm leaving FindEM1 in there. If we see any problems I can change it back to use FindEM2.
I DID NOT, however, change any of the appionweb code that generates that job. Should we have the user specify the number of CPUs? Should we hard code it to 8? Have it based on the number of templates?
Let me know what you think.
r18049
I did a quick test run to see how myamiweb behaves. Even though I picked 2 templates, it selected automatically nodes=1 and ppn=1 on guppy if I submit through the interface. We can probably to some math and force that one to have a reasonable value. We can not force the values when people use the copy-paste command option which is how most people ran into trouble at the first place. Therefore your python side limitation would be the best and only safe guard we can take.
Gabe, the way you have coded this up assumes that one will always run this on a node with PBS. If someone (like me) is running it interactively on a non cluster computer it gets hung up. I commented your ppn lines out to get it to run for me.
Thanks for catching that, the ppn bit of code was a remnant from some debugging, I deleted it.
r18050
- Status changed from New to In Code Review
- Assignee changed from Anchi Cheng to Dmitry Lyumkis
- Priority changed from High to Normal
Added code to parallelize the peak finding steps after FindEM runs, which can take a very long time if using many templates and processing large images.
- Status changed from In Code Review to Closed
Also available in: Atom
PDF