Feature #3381
openappionPBS.py
0%
Description
Scott created appionPBS.py to launch child cluster jobs per image. The first use is makeDEAlignedSum.py
This will affect all appionLoops subclasses if we start using this.
Updated by Anchi Cheng about 9 years ago
- Related to Feature #2847: add DE frame alignment script added
Updated by Anchi Cheng about 9 years ago
Transfer Scott's notes here in 2015-05-01
I just added the appionPBS.py module to the repository. I forgot to add the associate the revision to this thread. I don't know how to add it after the fact, but the revision is 18823. I'm pretty excited about the new module. I have it working on a test copy of makeDEAlignedSum.py, and it has increased the throughput for aligning frames, by as many processors I have access to on the cluster. The nice thing is that it is launching single processor jobs, so they start super quick. I think we can use this for parallelizing a lot of different processes.
Updated by Anchi Cheng about 9 years ago
Transfer Scott and my conversation in #2847 to here on 2015-08-18
Anchi:
appionPBS.py needs to have a way of passing appionwrapper into the job script to load the proper modules. Scott, I can write that part and complete the myamiweb gui as standard appionloop if it is o.k. with you. I need to make it working here for the users waiting to process their images.
Scott:
Anchi, that is fine with me. There are still some things that need to be worked out with appionPBS, but I'm happy to see how you modify it.
Anchi:
committed wrapper prepend in r19079.
What does checkJob wait for ? an output file that gets added in the scratch directory named image.sh.xxxx ? Doesn't this depends on the job management system ?
Scott:
It looks for the job outfile. It should appear in the directory from which the job was launched. Right now it is specific for PBS or MOAB clusters because that was all that I could test on. I'm working on adding the SLURM option this week. I will likely change this part substantially in the next weeks. Instead of checking for the job outfile, I'm thinking of having the job create a unique done file that checkJob would look for. I chose to have it look for a file instead of interacting with the resource manager because the manager sometimes gets overloaded and that could cause the whole thing to fail.
Updated by Scott Stagg about 9 years ago
I added support for SLURM and changed the way job checking works. I tested here, but it needs to be checked at NYSBC too.
Updated by Anchi Cheng about 9 years ago
r19109 make the loop continue if there is an error in generating command. Not sure if this is desirable. Fatal error should be raised if it needs to be stopped.
Updated by Anchi Cheng about 9 years ago
r19352 and r19353 (debug) check if queue scratch exists.