Project

General

Profile

Actions

Feature #1918

open

add Moab Torque subsubmission class

Added by Anchi Cheng over 12 years ago. Updated over 11 years ago.

Status:
Assigned
Priority:
Normal
Assignee:
Category:
-
Target version:
Start date:
07/17/2012
Due date:
% Done:

0%

Estimated time:

Description

Used at FSU


Files

dot_appion.cfg (154 Bytes) dot_appion.cfg Anchi Cheng, 07/17/2012 07:55 PM
test.py (448 Bytes) test.py Anchi Cheng, 07/23/2012 10:34 PM
Actions #1

Updated by Anchi Cheng over 12 years ago

r16942 add the first attempt that allowed me to test job file generation with the attached .appion.cfg saved in my sandbox appion directory.

runJob.py wrappertest.py --expid=2 --session=12may08z --projectid=1 --rundir=/tmp/test/ --jobtype='wrappertest'

Do use expid and projectid that correspond to a real pair.

Scott, please fill in translateOutput in the new subclass based on instruction and example in TorqueHost class.

Actions #2

Updated by Scott Stagg over 12 years ago

  • Assignee changed from Scott Stagg to Anchi Cheng

I put dot_appion.cfg in my appion directory, and tried the above command. My appion.cfg looks like this:

ProcessingHostType=MoabTorque

Shell=/bin/tcsh

ScriptPrefix=

ExecCommand=echo

StatusCommand=/opt/moab/bin/showq

AdditionalHeaders=

PreExecuteLines=

The output from my test looks like:

[sstagg@kriosdb ~]$ runJob.py wrappertest.py --expid=1 --session=12jul11a --projectid=18 --rundir==/tmp/test --jobtype='wrappertest'
/panfs/storage.local/imb/stagg/software/myami_dev/pyami/tifffile.py:903: UserWarning: Failed to import _tifffile.decodepackbits
  warnings.warn("Failed to import %s" % module_function)
/panfs/storage.local/imb/stagg/software/myami_dev/pyami/tifffile.py:903: UserWarning: Failed to import _tifffile.decodelzw
  warnings.warn("Failed to import %s" % module_function)
wrappertest ['wrappertest.py', '--expid=1', '--session=12jul11a', '--projectid=18', '--rundir==/tmp/test', '--jobtype=wrappertest']
Couldn't create output directory : [Errno 2] No such file or directory: ''Error: Could not execute job GenericJob

Actions #3

Updated by Anchi Cheng over 12 years ago

  • Assignee changed from Anchi Cheng to Scott Stagg

I think your cluster might not allow you to create /tmp/test where the job file need to be written.

Actions #4

Updated by Scott Stagg over 12 years ago

I think this is going to require a quick phone call. I can write to /tmp, and I tried my home directory too. Both give the same error. I don't really know what runJob.py is doing. Also, I don't know what expid is.

Actions #5

Updated by Anchi Cheng over 12 years ago

expid is the same as SessionData's DEF_id. The test script pretends that it is running wrappertest.py as an Appion Script so it needs to know expid and matching projectid so that it can commit the results to the processing database. I've always found it strange that we need both name and id of the session but I never spent enough time to figure out why except that I know that it requires both.

runJob.py is supposed to recursively make the rundir (/tmp/test) and save the job script (GenericJob.job) there before running the script using ExeCommand setting in your .appion.cfg. I get your error ([Error 2] No such file or directory) in my csh when doing 'ls' on a non-existing directory.

Actions #6

Updated by Scott Stagg over 12 years ago

Appion didn't always require expid as well as the session name. Someone must have added that at some point. It should probably be changed, but I imagine that is pretty low priority. Anyway, your test.py ran with no problems. I'm almost sure the problem is with my configuration file or the new MoabTorque class. I'll give you a call so we can figure it out.

Actions #7

Updated by Anchi Cheng over 12 years ago

Just for the record, we have resolved the error (came from --rundir==/tmp/test instead of --rundir=/tmp/test).

About the duplicated attribute settings in torqueHost.py and .appion.cfg, I have confirmed that what is in .appion.cfg overwrites what is in the TorqueHost and MoabTorqueHost because of the self.configure(configDict) at the end of init function. I've changed torqueHost.py in r16961.py to remove the duplicated initialization in the class. In a way, my change forces the proper configuration of .appion.cfg because its baseclass TorqueHost has the wrong values.

Actions #8

Updated by Scott Stagg over 12 years ago

After doing some tests, it appears that a current policy on our cluster is that the job file can not be launched from the appiondata directory. For now, this means that I can't launch jobs like picking using runJob.py. I should be able to do refinement jobs, though, correct? I tried to do launch a reconstruction job using the "Just Show Command" button, but it still seems like it's trying to launch some remote process: I get this error:

ERROR: Error: Copying file (/lustre/cryo/lustre/appiondata/12jul11a/recon/eman_recon1/start.hed) to
~sstagg/appion/12jul11a/recon/eman_recon1/start.hed on krios-0-1 failed:
Error copying file over ssh on host: krios-0-
 local file: /lustre/cryo/lustre/appiondata/12jul11a/recon/eman_recon1/start.hed
remote file: ~sstagg/appion/12jul11a/recon/eman_recon1/start.hed username: sstagg
Error Message: Could not copy file to host krios-0-1. Check permissions on remote file or directory

Actions #9

Updated by Anchi Cheng over 12 years ago

From the host that hosts appionweb, Appion needs to do an scp using the login name and password to send (push) the prepared particle stack and initial model to the remote cluster host even if you do show command. runJob.py that is created by show command is run from the remotehost and is not meant to pull from the localhost these files. It just checks the availability of these files before it runs anything.

If you've got this far, execute_over_ssh must have worked before making the remote rundir comes before this so I don't see why scp does not work if the local file is readable.

Can you try directly from the appionweb host do this:

scp /lustre/cryo/lustre/appiondata/12jul11a/recon/eman_recon1/start.hed sstagg@krios-0-1:~sstagg/appion/12jul11a/recon/eman_recon1/start.hed

Or check your ssh2 for php module with a simple php script like this from myamiweb:

<?php>
$hostname = 'krios-0-1';
$username = 'sstagg';
$passwd = _your password_;
$localfile = '/lustre/cryo/lustre/appiondata/12jul11a/recon/eman_recon1/start.hed';
$remotefile = '~sstagg/appion/12jul11a/recon/eman_recon1/start.hed';

$connection = ssh2_connect($hostname, 22);
ssh2_auth_password($connection, $username, $passwd);
ssh2_scp_send($connection, $localfile, $remotefile, 0644);
ssh2_exec($connection, 'exit');
?>

Actions #10

Updated by Anchi Cheng over 12 years ago

Is krios-0-1 the remote host that you want the job to start from? Interesting naming.

Actions #11

Updated by Scott Stagg over 12 years ago

The scp worked. I'm not sure how to do the php thing. Should I make that a php file that I run from my web browser? Also, how did it decide to write to my home directory? That is definitely not where I'd like jobs to go.

All my appion data is hosted and processed on a couple of highly souped up machines (called kriosdb and krios-0-x) at the FSU HPC. Those machines mount both the appiondata and a big scratch directory for processing. The scratch directory is the 'remote' directory. The krios-0-x machines are for preprocessing runs like picking. I also have access to the remainder of the HPC for refinements. The rest of the HPC only mounts the 'remote' scratch directory.

Actions #12

Updated by Scott Stagg over 12 years ago

OK, I figured out how to do the php thing. It doesn't work if I use the $remotefile with a '~' in the string. If I replace it with a full path, the transfer works. I don't understand where the path with the '~' came from. In the reconstruction job launching form, I put a completely different path in the 'Processing host output directory' field.

Actions #13

Updated by Anchi Cheng over 12 years ago

Since you will be doing refinement on host other than krios-x-x, that would the the $PROCESSING_HOSTS appion need to send the stack and model over to. Let's say the host that you will do the refinement is FSU_HPC1 and there is a scratch directory called /scratch/sstagg/ that you can write to as the user sstagg, here is what you need to do in your config.php of myamiweb to make the files required for refinement to be sent to the right place by default.

You need to have a section that defines parameters for the FSU_HPC1 similar to Appion_config_advanced with these specifically

$PROCESSING_HOSTS[] = array(
'host' => 'FSU_HPC1.fsu.edu',
'baseoutdir' => '/scratch/_the_directory_writable_by_sstagg_'
'localhelperhost' => 'krios-0-1.fsu.edu',

The rest of the items in that array will depend on the cluster parameters.

The reason that it tried to write the file to ~sstagg is because you left 'baseoutdir' assigned to empty string or that it could not find the directory that you assigned to. It is likely that appion does not create the said directory if it does not exist. If so, it should be fixed.

Actions #14

Updated by Amber Herold over 12 years ago

  • Target version changed from Appion/Leginon 2.3.0 to Appion/Leginon 3.0.0
Actions #15

Updated by Amber Herold over 11 years ago

  • Status changed from Assigned to Closed

Am I correct to assume that work on this is complete?

Actions #16

Updated by Scott Stagg over 11 years ago

If closed means that it's working, then no it's not complete. I just didn't have time to work on it. I've been trying to update for the last couple of days, so it's possible that I will with on this again soon.

Actions #17

Updated by Amber Herold over 11 years ago

  • Status changed from Closed to Assigned

Opps, sorry about that, back to open.

Actions

Also available in: Atom PDF