Project

General

Profile

Actions

Feature #2346

closed

run multiple instances of AppionLoop script with each on different images

Added by Anchi Cheng over 11 years ago. Updated over 10 years ago.

Status:
Closed
Priority:
Urgent
Assignee:
Amber Herold
Category:
-
Target version:
-
Start date:
05/02/2013
Due date:
% Done:

0%

Estimated time:

Description

AppionLoop processes can not be spread to different cpu, each with different image. This is because the image is added to donedict only if the image processing is completed and successful. However, once the program is bug-free, it makes sense to let several processers operate on different images at the same time.

We want this to be more an openmpi like than mpi so that it can spread to more than one computer.

Some users has been doing this by specify --mrclist in each instance so they don't collide.

We need this urgently because K2 data come in too fast for gain correction/alignment on the frames.


Related issues 1 (0 open1 closed)

Related to Appion - Feature #2370: Parallel run of DD frame stack making and alignmentClosedAmber Herold05/22/2013

Actions
Actions #1

Updated by Anchi Cheng over 11 years ago

r17550 by Jim is used to check and creates the lock file in as short a time as we can
r17553 has the primary lock and unlock for AppionLoop2
r17554 checks "_st.mrc" file for the image in case that it escapes the lock. I am not sure how effective this secondary lock is if the two processes are so in sync that it can fool the primary lock

This method has a major drawback that if the image process dies or is stopped by the user without going through apDisplay.printError(), the lock will remain in place so that the locked image will never be processed unless the script is run with "parallel" option set to False since I put a cleanParallelLock() function call for that case. Suggestion on how to reset in this situation is welcome.

With this, I can launch multiple makeDDRawFrameStack.py with the exact command on single computer with multiple cpus or on different computers. It plays well with "wait" flag. My test shows that three of these is enough to make rawtransfer the new bottleneck for live frame-gain-correction and alignment.

Actions #2

Updated by Neil Voss over 11 years ago

Hi Anchi,

This is all probably not important, but I wanted to put my 2 cents. If you are experiencing problems, this may be related.

I see you decided to use a lock file to specify which images are being processed. I have a few concerns (which may be unfounded), but anyway:

  • (1) the older network filesystems (NFS2 and NFS3) do not support file lock, there may be delays in showing the lock file to other hosts. So, two nodes could process the same image, this is supposedly better with NFS4, but when I left you were still using the older NFS. Especially is asynchronous mode is being used.

http://docstore.mik.ua/orelly/networking_2ndEd/nfs/ch11_02.htm
https://pypi.python.org/pypi/flufl.lock

  • (2) the bottle neck on my smaller setup is the file system, the database server is relatively available. So, would using a database table system instead of lock files,reduce load on the file system. Though whatever table is used would have to be dumped occasionally or allow sinedon to delete items.
Actions #3

Updated by Anchi Cheng over 11 years ago

r17575 moved the parallel lock to AppionScript level so that it can also be used in catchUpDDAlign.py It is possible to set the name (prefix) of the lock so that more than one kind of script can be run, as in the case of cpu gain correction and gpu frame alignment of DD movies. 3 good cuda devices seems to keep up with the 40 s/image Leginon can do in collecting K2 data now.

Actions #4

Updated by Anchi Cheng over 11 years ago

r17616 removes cleanParallellock on close used in makeDDRawFrameStack.py because if several of these are running, the first one to finish would clean up locks belong to the unfinished ones and cause error to them when unlocking.

Actions #5

Updated by Anchi Cheng almost 11 years ago

  • Status changed from Assigned to In Code Review
  • Assignee changed from Anchi Cheng to Amber Herold

This has been used for makeDDRawFrameStack for a while now. Seems to work fine. No complains

Actions #6

Updated by Amber Herold almost 11 years ago

  • Status changed from In Code Review to Closed
Actions #7

Updated by Anchi Cheng over 10 years ago

Notes on how to add this feature in a subclass of appionLoop2:

Add a secondary lock like in r17554. It should check for the existance of the first file that processImage function creates so that if the primary lock is not detected, this will stop the processing.

Actions

Also available in: Atom PDF