Bug #2575: Template picking job hangs in Garibaldi due to FindEM hanging - Leginon - Electron Microscopy Group

Custom queries

Appion/Leginon 3.0 bug fixes
Leginon 3.2 Bug Fixes
Leginon 3.3 Bug Fixes
Leginon 3.4 Bug Fixes
Leginon 3.4 Completed Features
Leginon 3.5 Bug Fixes
Leginon 3.5 Completed Features
Leginon 3.6 Bug Fixes
Leginon 3.6 Completed Features
Leginon Known Bugs

Actions

Copy link

Bug #2575

closed

Template picking job hangs in Garibaldi due to FindEM hanging

Added by Saikat Chowdhury over 11 years ago. Updated over 7 years ago.

Status:

Closed

Priority:

Normal

Assignee:

Dmitry Lyumkis

Category:

Target version:

Start date:

11/07/2013

Due date:

% Done:

Estimated time:

Affected Version:

Appion/Leginon 3.0.0

Show in known bugs:

Workaround:

Description

I have been trying to run template picker since last weekend on session# 12587(13oct29b) and 12628(13nov05o). The name of the job file is "template.job" and the directories in garibaldi are :"/gpfs/home/saikat/xdynactin_tilt" and "/gpfs/home/saikat/xdynactin_tilt/tilt_pair".
Sometimes the job crashes stating that findem.exe has crashed. If I resubmit the job then it runs for some more micrographs and gets stuck. Neither does the job crash, nor does it give any error, but seems to be stuck at a micrograph for hours. If I kill the job and resubmit it, it will proceed for some more micrographs and again hang. There are no definite number of micrographs the job proceeds for nor is there a specific step at which the job hangs.
I waited for myami to be upgraded and then resubmitted the job today and I still have the same issue.

Related issues 2 (1 open — 1 closed)

Related to Appion - Bug #2504: template picker stalls for unknown reason

New

09/03/2013

Actions

Related to Appion - Bug #2923: tiltaligner bug

Closed

Sargis Dallakyan

09/08/2014

Actions

Issue # Delay: days Cancel

History
Notes
Property changes

Actions

Copy link

Updated by Anchi Cheng over 11 years ago

Unfortunately, this appears to be a problem of the Fortrain findem.exe program. Probably a memory leak. Natalia reported the same. I've asked around and Arne said that it has happened to him regularly, too. It is therefore not a garibaldi issue.

I can see if we can do a time-out and retry, but for the time being, your only choice is to resubmit.

Actions

Copy link

Updated by Anchi Cheng over 11 years ago

Just checked Saikat's job output. It looks like there is already a retry built-in and it is used on quite a few images. However when it really stall, the pipe wasn't broken so it didn't retry.

In addition, I see that Saikat request 8 nodes and 2 processors per node. As far as I know, findem multi-thread only works on the same node, so this way of spreading out would only waste resource and does not speed up the process. Try asking for 1 node an 8 processors per node.

Actions

Copy link

Updated by Gabriel Lander over 11 years ago

It seems like this was first reported on September 3 2013 (Bug #2504)
FindEM is multithreaded and doesn't run on multiple nodes. I'm not sure why Saikat requested multiple nodes. We've tried in the past requesting 1 nodes with 1 through 16 processors, also requesting small and large amounts of memory and nothing seemed to fix this issue.
Do we have the source code for the new FindEM2? I could try this out to see if it helps at all.

Actions

Copy link

Updated by Anchi Cheng over 11 years ago

The attached file in #2115 should have the source or exe.

Actions

Copy link

Updated by Anchi Cheng over 11 years ago

Subject changed from Template picking job hangs in Garibaldi to Template picking job hangs in Garibaldi due to FindEM hanging

Gabe,

Any progress with trying out FindEM2 to see it fixes this problem?

Actions

Copy link

Updated by Gabriel Lander over 11 years ago

Sorry I dropped the ball - I never finished implementing this, since FindEM2 requests a custom mask for each template. I'll have the code generate a circular mask for now for all templates & do some testing.

Actions

Copy link

Updated by Gabriel Lander over 11 years ago

I updated the parallelization code to limit the number of simultaneously running threads to the number of available CPUs. If launched on garibaldi it will check the PBS_NODEFILE variable to get this number. I implemented FindEM2, but in my testing I didn't see much of an improvement over FindEM1. This is perhaps because I was using a circular mask instead of one that is specific to each template. I didn't experience any hangs/crashes with either FindEM1 or FindEM2 using the updated threading code, so I'm leaving FindEM1 in there. If we see any problems I can change it back to use FindEM2.
I DID NOT, however, change any of the appionweb code that generates that job. Should we have the user specify the number of CPUs? Should we hard code it to 8? Have it based on the number of templates?
Let me know what you think.
r18049

Actions

Copy link

Updated by Anchi Cheng over 11 years ago

I did a quick test run to see how myamiweb behaves. Even though I picked 2 templates, it selected automatically nodes=1 and ppn=1 on guppy if I submit through the interface. We can probably to some math and force that one to have a reasonable value. We can not force the values when people use the copy-paste command option which is how most people ran into trouble at the first place. Therefore your python side limitation would be the best and only safe guard we can take.

Actions

Copy link

Updated by Scott Stagg over 11 years ago

Gabe, the way you have coded this up assumes that one will always run this on a node with PBS. If someone (like me) is running it interactively on a non cluster computer it gets hung up. I commented your ppn lines out to get it to run for me.

Actions

Copy link

#10

Updated by Gabriel Lander over 11 years ago

Thanks for catching that, the ppn bit of code was a remnant from some debugging, I deleted it.
r18050

Actions

Copy link

#11

Updated by Gabriel Lander almost 11 years ago

Status changed from New to In Code Review
Assignee changed from Anchi Cheng to Dmitry Lyumkis
Priority changed from High to Normal

Added code to parallelize the peak finding steps after FindEM runs, which can take a very long time if using many templates and processing large images.

Actions

Copy link

#12

Updated by Gabriel Lander almost 11 years ago

Related to Bug #2923: tiltaligner bug added

Actions

Copy link

#13

Updated by Anchi Cheng over 7 years ago

Status changed from In Code Review to Closed

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Leginon

Custom queries

Bug #2575

Template picking job hangs in Garibaldi due to FindEM hanging

Updated by Anchi Cheng over 11 years ago

Updated by Anchi Cheng over 11 years ago

Updated by Gabriel Lander over 11 years ago

Updated by Anchi Cheng over 11 years ago

Updated by Anchi Cheng over 11 years ago

Updated by Gabriel Lander over 11 years ago

Updated by Gabriel Lander over 11 years ago

Updated by Anchi Cheng over 11 years ago

Updated by Scott Stagg over 11 years ago

Updated by Gabriel Lander over 11 years ago

Updated by Gabriel Lander almost 11 years ago

Updated by Gabriel Lander almost 11 years ago

Updated by Anchi Cheng over 7 years ago