Bug #3646: maxlikelihood run cannot take large number of processors - Appion - Electron Microscopy Group

Actions

Copy link

Bug #3646

closed

maxlikelihood run cannot take large number of processors

Added by Venkata Dandey over 9 years ago. Updated over 9 years ago.

Status:

Closed

Priority:

High

Assignee:

Venkata Dandey

Category:

Target version:

Start date:

10/08/2015

Due date:

% Done:

Estimated time:

Affected Version:

Appion/Leginon 3.2

Show in known bugs:

Workaround:

Description

when submitting the maxlikelihood job with 256 processors, it does not submit the job and throws up the error (screenshot attached) saying that no writing permissions, even though I used the right one.

Files

maxlikelihood_error_large_cpuerror.PNG (97 KB) maxlikelihood_error_large_cpuerror.PNG

Venkata Dandey, 10/08/2015 11:46 AM

Actions

Copy link

Updated by Venkata Dandey over 9 years ago

Priority changed from Normal to High

Actions

Copy link

Updated by Anchi Cheng over 9 years ago

Project changed from 138 to Appion
Assignee set to Sargis Dallakyan
Affected Version set to Appion/Leginon 3.2

Since smaller number works, there may be an error caught by myamiweb testing the cluster and failed, not a real permission problem.

Actions

Copy link

Updated by Yong Zi Tan over 9 years ago

The problem is that the --ppn set by the system is automatically 4, even though it is 24 for our cluster. Therefore if you ask for say, 240 processors, Appion divides that by 4 to result in it asking for 60 nodes, which is more than what we have, hence the crash. The way to fix it could be to add in an option in the ML2D Appion GUI to let you pick the number of nodes and processors per node.

Actions

Copy link

Updated by Yong Zi Tan over 9 years ago

Assignee changed from Sargis Dallakyan to Anchi Cheng

Dear Anchi, maybe you can take care of this? Because this does not seem to be an issue with the cluster as I have successfully run 360 processor ML2D jobs, just by submitting in the command line. Thank you!

Run that worked: /gpfs/appion/yztan/15oct15f/align/maxlike6/

Default run command:
runJob.py /opt/myamisnap/bin/appion maxlikeAlignment.py --description=ml2d of original stack --stack=4 --lowpass=15 --highpass=600 --num-part=76029 --num-ref=300 --bin=4 --angle-interval=5 --max-iter=15 --nproc=360 --fast --fast-mode=normal --mirror --savemem --commit --converge=normal --rundir=/gpfs/appion/yztan/15oct15f/align/maxlike7 --runname=maxlike7 --projectid=124 --expid=853 --jobtype=partalign --ppn=4 --nodes=90 --walltime=240 --queue=longq --jobid=52

Altered run command:
runJob.py /opt/myamisnap/bin/appion maxlikeAlignment.py --description=ml2d of original stack --stack=4 --lowpass=15 --highpass=600 --num-part=76029 --num-ref=300 --bin=4 --angle-interval=5 --max-iter=15 --nproc=360 --fast --fast-mode=normal --mirror --savemem --commit --converge=normal --rundir=/gpfs/appion/yztan/15oct15f/align/maxlike7 --runname=maxlike7 --projectid=124 --expid=853 --jobtype=partalign --ppn=15 --nodes=24 --walltime=999 --queue=longq --jobid=52

Actions

Copy link

Updated by Anchi Cheng over 9 years ago

Status changed from New to In Test
Assignee changed from Anchi Cheng to Venkata Dandey

r19274 forces ppn to be assigned to maximum ppn of the cluster. This would not be a good thing of variable number of processors are on that cluster. Also a validation of total number of nodes requested should be added. This is a temporary fix. The new appion form for cluster really should be used for this.

Vankata, please see if this comes out right now.

Actions

Copy link