Feature #3891
openDistributed Resource Management Application API
0%
Description
Adding an option to Appion config file so that it can submit jobs using the Slurm or other schedulers.
Updated by Sargis Dallakyan almost 9 years ago
- Description updated (diff)
Useful links related to this feature:
http://www.drmaa.org/
http://slurm.schedmd.com/
http://www.galaxyproject.org/
Updated by Sargis Dallakyan over 8 years ago
Adding a link to https://pegasus.isi.edu/ Donny Shrum suggested for Workflow Management System.
Updated by Donny Shrum over 8 years ago
I’ve added support for a parameter named destinationsURL in the format:
destinationsURL=http://127.0.0.1/job_api/handler.php
The workflow when this parameter exists in the .appion.cfg file is as follows:
In processingHost.py -
destinationsURL gets added to the confDict
If destinationsURL is in the confDict the job headers are returned by a webservice using the headersFromWebSevice method in this class. If the parameter does not exist then the job headers are generated in torqueHost /sgeHost etc… just as before.
I’m passing/exposing the command and jobType from Agent class to the processingHost class so that this information may be used in turn by the webservice to customize the job parameters.
I added imports for requests, json and pwd to processingHost.
This code will replace the generateHeaders method in slurmHost.py but I’ll leave that in place as it is consistent with the other scheduler classes. I anticipate we’ll also use this design to replace translateOutput and checkJobStatus in the scheduler class.
As a result of this change we can set job parameters such as nodes, core, memory and queue based on individual users and groups at the RCC (Scott Stagg doesn’t always use the same resources grad students might be using) and we’ve also been able to force some jobs to run locally as opposed to via our cluster.
I’ve also attached the changes in a zip file. If you’d prefer that I check with you prior to committing changes let me know and I’ll do that… I wasn’t sure what makes it easiest for you to take a look at the changes but I was guessing it’s easy for you to take a quick look after I check the code in.
Updated by Sargis Dallakyan over 8 years ago
- Related to Feature #3000: support for SGE cluster type added
Updated by Neil Voss over 8 years ago
- Blocks Feature #2836: bake appion processing recipes added
Updated by Neil Voss over 8 years ago
- Status changed from New to Assigned
- Assignee set to Donny Shrum
Putting Donny as the assigned
Updated by Anchi Cheng over 8 years ago
Donny, I am using this the first time. The way you pass command and jobtype to createProcessingHost seems to restrict one command for each call. Is that intended ?
Updated by Donny Shrum over 8 years ago
Anchi Cheng wrote:
Donny, I am using this the first time. The way you pass command and jobtype to createProcessingHost seems to restrict one command for each call. Is that intended ?
Hi Anchi,
That is a result of how generateHeaders is called. The scheduler specific classes (slurmHost.py, sgeHost.py, torqueHost.py) all have a method named generateHeaders that returns lines that are printed to the job file and those lines are based on that method as well as the job parameters contained in the configDict that is built per job.
An example:
#!/usr/bin/sh
#SBATCH -n 8
#SBATCH -t 240:00:00
#SBATCH -p stagg_q
#SBATCH --mem 4GB
Since those headers are generated per job that is where I inserted the webservice as it seemed to be the least disruptive and most logical placed based on the existing code... it works out that we would like to customize those parameters and likely ignore the user input; directing jobs to specific queues with specific settings based on the user submitting the job and the job being run (grad students go to backfill for example) We have a json file on our end that maps settings per user/group/job and Scott will be able to make quick edits to that.
Changes to the web service I have on our end are easy so I'm open to any changes in how that external service is called.
--Donny
FSU RCC
Updated by Anchi Cheng over 8 years ago
Donny,
Should we pass the job parameter into generateHeader in a more specific way rather than the command? I was using generateHeader to make job files with more than one command which can be handled by jobObject by setting its attributes. Here you take one command, and it is not clear what I would need to include in the command for your webservice to parse correctly the job parameters.
A related question: What does jobObject do in your new code? It is totally ignored by headersFromWebService (By the way, I assume WebSevice is a typo ?)
Updated by Donny Shrum over 8 years ago
Hi Anchi,
Should we pass the job parameter into generateHeader in a more specific way rather than the command?
I might be missing something here as I'm not nearly as familiar with the code as I'd like to be :) I'm bypassing the generateHeaders method which appears in multiple scheduler specific classes in the createJobFile method in the processingHost class.
header = None
if (self.destinationsURL):
header = self.headersFromWebSevice(currentJob)
else:
header = self.generateHeaders(currentJob)
I was using generateHeader to make job files with more than one command which can be handled by jobObject by setting its attributes. Here you take one command, and it is not clear what I would need to include in the command for your webservice to parse correctly the job parameters.
My thought is that the implementation of the webservice is site specific. So our webservice is looking at the specific job and the user that submitted the job, and the various queues they can submit to and we (Scott really) can make an edit to a json file that will set submission parameters regarding number of nodes, memory, which queue (partition for slurm.) The webservice receives json from appion that looks like this. {"username": "sstagg", "command": "destinations", "jobType": "partalign", "script": ["maxlikeAlignment.py", "--description=test", "--stack=1", "--lowpass=10", "--highpass=2000", "--num-part=173", "--num-ref=2", "--bin=3", "--angle-interval=5", "--max-iter=15", "--fast", "--fast-mode=normal", "--mirror", "--savemem", "--commit","--converge=normal", "--rundir=/lustre/cryo/lustre/appiondata/15nov02z/align/maxlike14", "--runname=maxlike14", "--projectid=461", "--expid=9681", "--jobtype=partalign", "--ppn=1", "--nodes=1", "--walltime=240", "--jobid=3241"]}
And the custom webservice on our end responds by sending back: {"result":0,"customResponse":{"header":"#!\/usr\/bin\/sh","prefix":"#SBATCH","execCommand":"sbatch","statusCommand":"squeue","options":{"-N":"1","-n":"8","-t":"96:00:00","--mem":"6GB","-p":"condor"}}}
And the headersFromWebService method converts that json into a header that the rest of the code base is expecting. Something like this:
#!/usr/bin/sh
#SBATCH -N 1
#SBATCH -n 8
#SBATCH -t 96:00:00
#SBATCH -p condor
#SBATCH --mem 6GB
That allows us to bury all our hairy site specific code in that web service and send reply to appion with just the header it expects. We can also decide if we'd like to include or ignore parameters that our users plug into the submit form regarding cores / memory as not all our users would know what to use or even what is available to them.
A related question: What does jobObject do in your new code? It is totally ignored by headersFromWebService (By the way, I assume WebSevice is a typo ?)
The jobObject contains all the script parameters that are passed to the headersFromWebService method and ultimately the web service. I might need to look again but as I recall the jobObject is exposed to the Agent Class but not to the processingHost class so I passed the command and job type from the agent class over to the processingHost class so that it could be used to generate the headers.
That was all done in an effort to disturb the existing code as little as possible and allow us to make custom changes to job submissions in a code base that is all on our end.
Donny Shrum
FSU RCC
Updated by Donny Shrum over 8 years ago
A link to slides on the web service
https://docs.google.com/presentation/d/1gQXRfitpCNAuxTRFF8-2IViCC9a5yyOcxQNwTPddzYE/edit?usp=sharing