Project

General

Profile

Actions

Bug #2575

closed

Template picking job hangs in Garibaldi due to FindEM hanging

Added by Saikat Chowdhury about 11 years ago. Updated almost 7 years ago.

Status:
Closed
Priority:
Normal
Category:
-
Target version:
-
Start date:
11/07/2013
Due date:
% Done:

0%

Estimated time:
Affected Version:
Appion/Leginon 3.0.0
Show in known bugs:
No
Workaround:

Description

I have been trying to run template picker since last weekend on session# 12587(13oct29b) and 12628(13nov05o). The name of the job file is "template.job" and the directories in garibaldi are :"/gpfs/home/saikat/xdynactin_tilt" and "/gpfs/home/saikat/xdynactin_tilt/tilt_pair".
Sometimes the job crashes stating that findem.exe has crashed. If I resubmit the job then it runs for some more micrographs and gets stuck. Neither does the job crash, nor does it give any error, but seems to be stuck at a micrograph for hours. If I kill the job and resubmit it, it will proceed for some more micrographs and again hang. There are no definite number of micrographs the job proceeds for nor is there a specific step at which the job hangs.
I waited for myami to be upgraded and then resubmitted the job today and I still have the same issue.


Related issues 2 (1 open1 closed)

Related to Appion - Bug #2504: template picker stalls for unknown reasonNew09/03/2013

Actions
Related to Appion - Bug #2923: tiltaligner bugClosedSargis Dallakyan09/08/2014

Actions
Actions #1

Updated by Anchi Cheng about 11 years ago

Unfortunately, this appears to be a problem of the Fortrain findem.exe program. Probably a memory leak. Natalia reported the same. I've asked around and Arne said that it has happened to him regularly, too. It is therefore not a garibaldi issue.

I can see if we can do a time-out and retry, but for the time being, your only choice is to resubmit.

Actions #2

Updated by Anchi Cheng about 11 years ago

Just checked Saikat's job output. It looks like there is already a retry built-in and it is used on quite a few images. However when it really stall, the pipe wasn't broken so it didn't retry.

In addition, I see that Saikat request 8 nodes and 2 processors per node. As far as I know, findem multi-thread only works on the same node, so this way of spreading out would only waste resource and does not speed up the process. Try asking for 1 node an 8 processors per node.

Actions #3

Updated by Gabriel Lander about 11 years ago

It seems like this was first reported on September 3 2013 (Bug #2504)
FindEM is multithreaded and doesn't run on multiple nodes. I'm not sure why Saikat requested multiple nodes. We've tried in the past requesting 1 nodes with 1 through 16 processors, also requesting small and large amounts of memory and nothing seemed to fix this issue.
Do we have the source code for the new FindEM2? I could try this out to see if it helps at all.

Actions #4

Updated by Anchi Cheng about 11 years ago

The attached file in #2115 should have the source or exe.

Actions #5

Updated by Anchi Cheng almost 11 years ago

  • Subject changed from Template picking job hangs in Garibaldi to Template picking job hangs in Garibaldi due to FindEM hanging

Gabe,

Any progress with trying out FindEM2 to see it fixes this problem?

Actions #6

Updated by Gabriel Lander almost 11 years ago

Sorry I dropped the ball - I never finished implementing this, since FindEM2 requests a custom mask for each template. I'll have the code generate a circular mask for now for all templates & do some testing.

Actions #7

Updated by Gabriel Lander almost 11 years ago

I updated the parallelization code to limit the number of simultaneously running threads to the number of available CPUs. If launched on garibaldi it will check the PBS_NODEFILE variable to get this number. I implemented FindEM2, but in my testing I didn't see much of an improvement over FindEM1. This is perhaps because I was using a circular mask instead of one that is specific to each template. I didn't experience any hangs/crashes with either FindEM1 or FindEM2 using the updated threading code, so I'm leaving FindEM1 in there. If we see any problems I can change it back to use FindEM2.
I DID NOT, however, change any of the appionweb code that generates that job. Should we have the user specify the number of CPUs? Should we hard code it to 8? Have it based on the number of templates?
Let me know what you think.
r18049

Actions #8

Updated by Anchi Cheng almost 11 years ago

I did a quick test run to see how myamiweb behaves. Even though I picked 2 templates, it selected automatically nodes=1 and ppn=1 on guppy if I submit through the interface. We can probably to some math and force that one to have a reasonable value. We can not force the values when people use the copy-paste command option which is how most people ran into trouble at the first place. Therefore your python side limitation would be the best and only safe guard we can take.

Actions #9

Updated by Scott Stagg almost 11 years ago

Gabe, the way you have coded this up assumes that one will always run this on a node with PBS. If someone (like me) is running it interactively on a non cluster computer it gets hung up. I commented your ppn lines out to get it to run for me.

Actions #10

Updated by Gabriel Lander almost 11 years ago

Thanks for catching that, the ppn bit of code was a remnant from some debugging, I deleted it.
r18050

Actions #11

Updated by Gabriel Lander over 10 years ago

  • Status changed from New to In Code Review
  • Assignee changed from Anchi Cheng to Dmitry Lyumkis
  • Priority changed from High to Normal

Added code to parallelize the peak finding steps after FindEM runs, which can take a very long time if using many templates and processing large images.

Actions #12

Updated by Gabriel Lander over 10 years ago

  • Related to Bug #2923: tiltaligner bug added
Actions #13

Updated by Anchi Cheng almost 7 years ago

  • Status changed from In Code Review to Closed
Actions

Also available in: Atom PDF