Bug #1482
openspider consuming all the resources
0%
Description
Spider processes that are launched from appion are are expanding to take up all of the processors on the node from which they are launched. This is problematic for two reasons. First, it takes up all the processing capabilities, and second it actually slows spider down significantly when the process it is working on isn't very parallelizable. The two places I've noticed this behavior are 1) when it is filtering an image during picking and 2) when doing an alignment of particles to a template. The behavior is the same for both spider 15 and 19.
Updated by Scott Stagg over 12 years ago
- Priority changed from High to Immediate
OK. I figured this out, but don't have time to fix it properly. The problem is that the spider functions which are passed through spyder have an nproc attribute, and nproc is being set to apParam.getNumProcessors() instead of a reasonable value. See the below snippet from apSpider/backproject.py:
def backprojectCG(stackfile, eulerdocfile, volfile, numpart, pixrad, dataext=".spi"): """ inputs: stack, in spider format eulerdocfile outputs: volume """ ### setup starttime = time.time() stackfile = spyder.fileFilter(stackfile) eulerdocfile = spyder.fileFilter(eulerdocfile) volfile = spyder.fileFilter(volfile) if not os.path.isfile(stackfile+dataext): apDisplay.printError("stack file not found: "+stackfile+dataext) if not os.path.isfile(eulerdocfile+dataext): apDisplay.printError("euler doc file not found: "+eulerdocfile+dataext) apFile.removeFile(volfile+dataext) # nproc = apParam.getNumProcessors() nproc = 2 mySpider = spyder.SpiderSession(dataext=dataext, logo=True, nproc=nproc, log=False) mySpider.toSpider("BP CG",
You can see my workaround where I arbitrarily set nproc to 2, and it solved my problem. There are several places in apSpider where the same problem exists (alignment.py, backproject.py, filters.py, and maybe others). The way it is currently implemented is problematic for many reasons. 1) You cannot specify how many processors to use. 2) It only works for systems with small numbers of processors. I think y'all are not seeing the problem as I am because you are using a resource manager (PBS) which limits the resources that a given process consumes. I upgraded the priority for this bug because I think it is critical that y'all get this worked out before the next release because not everyone will be using a resource manager. It is a big problem for me here because we are not using a resource manager and our processing boxes have 16 processors each. The processes that use spider eat up all 16 procs. The reason I'm not just fixing it myself is that it's going to require a lot of changes. For instance, all the programs that use the apSpider functions will need nproc command line arguments and corresponding webpage options. Alternatively, we should just set nproc to 1 and lose threading for the next release.
Thoughts?
Updated by Gabriel Lander over 12 years ago
I think we need to figure out which functions don't work well with parallelization, and specifically set the "nproc" for those when we start the spider session.
So for Scott's example, it would be:
mySpider = spyder.SpiderSession(dataext=dataext, logo=True, nproc=1, log=False)
Whereas functions that are parallelizable would be use the nproc=apParam.getNumProcessors()
trouble is, how do we figure out which functions are parallelized?
Updated by Scott Stagg over 12 years ago
I agree that we should determine which functions do and don't work well with parallelization, but I think we really absolutely have to have users specify the number of processors requested for a run. Otherwise you end up in the situation like I'm having where one job that doesn't need it eats up all the processors on a machine. For a while here it was using up 16 processors just to do a high pass filter for a micrograph. Think of the situation there are even more cores on one machine . . . 32 processors to do a high pass filter.
Updated by Anchi Cheng over 12 years ago
Scott,
A point well made. I can think of several ways to patch this in the global settings of appionScript level. Give me a day to discuss with Jim and Amber before writing it in.
However, if it is set at appionScript, would it be enough for you now? some spyder function might need less than globally specified by appionScript, would that be a show stopper for you still? As Gabe said, a dictionary will be needed for each spyder call for the processor requirement. to have the internal job management as you are doing.
Updated by Scott Stagg over 12 years ago
Anchi, I think I see where you're going. Say, for instance you're doing a refinement and you want to use 64 processors for projection matching but you might not want to use all those processors for backprojection. It gets even more complicated. You might want to use MPI for projection matching but threading for backprojection.
I think setting nproc globally as an option at the appionScript level would be fine for now. You can leave it up to the user to be knowledgeable enough not to request an unreasonable number of procs.
BTW, I hardcoded nproc=2 for all occurances in backproject.py, and the time it takes to create an RCT volume went from >3 hours (using 16 procs) to ~20 minutes.
Updated by Anchi Cheng over 12 years ago
- Status changed from New to Assigned
- Assignee set to Scott Stagg
- Target version set to Appion/Leginon 2.2.0
r16466 and r16468 makes nproc a global option that can be also passed from apGenericJob used on qsub type of running.
Scott, please see if this idea would work for you. I have not changed all the spider functions yet. If you want to start on it, you are welcome to do it since I don't know which function should be changed right away. Unfortunately, I can't just change apParam.py to use self.params['nproc'] in getNumProcessors(). Wish we have done it in an object now....
Amber promised to do the web gui side once we have the python side ready.
Updated by Amber Herold over 12 years ago
Anchi,
I made an indentation syntax error fix in apAgent and also added a check in apGenericJob to avoid a NoneType + string error. I didn't look too closely why it is happening now, perhaps I need to add a new option to my commands? This is just a fix for now so I can run some things today, but please take a look at it to see if it should be removed once we have completed work on this issue.
Updated by Scott Stagg over 12 years ago
I can make the changes to the spider functions, but I'm not exactly sure when I'll get to it. I definitely won't do it until I am able to update to using the web to launch jobs that use the resource manager (qsub, msub, etc). Amber, when I visited, you said that I would be able to do this update relatively soon. Any ETA on this?
Updated by Anchi Cheng over 12 years ago
- Status changed from Assigned to In Code Review
- Assignee changed from Scott Stagg to Amber Herold
fixed the None problem in r16471 but found that if I do pass nproc according to nodes and ppn, appionScript will end up with those nproc values instead of their individual default, so I commented that part out in r16472 so not to break anything.
We will need to figure out how to sync the two sides some time so that we can truly use this function.
Also found the source of indentation error, Christopher uses spaces but I use tab.
You can assign this back to me now after review this change.
Updated by Amber Herold over 12 years ago
Scott, just a few more bugs to get to and tests to run before we release. I missed almost 2 weeks out sick, so I'm bit behind where I had hoped to be. Should be within the next 2 weeks.
Updated by Anchi Cheng over 12 years ago
Scott,
Talked to Jim a bit about this. Is it possible that the slow down by using more processors due to the memory allocation? Do you know if you assign more processors, each one was allowed (by spider?) to use not as much memory on you computer? Could it be that when you limit the nproc to 2 on this computer and because other processors are not used by another user, it gets to use all memory so that you get the big speed up? I got around large eman refinement on garibaldi before by doing something similar where I reserved, say, 10 nodes each at 8 ppn with memory restriction of 47gb (max on the nodes) and then in eman refine, I only asked for nproc=20.
Just trying to understand it better.
Updated by Scott Stagg over 12 years ago
No, the slow down is definitely not due to memory allocation. There are no competing jobs when I am running the spider commands, and there is plenty of available memory. I'm pretty sure it's due to the way that spider is doing the threading. In my mind it's analogous to something like washing a single dish. It is very efficient for 1 person to wash a single dish, but very inefficient for 16 people to wash a single dish all at the same time.
Updated by Amber Herold over 12 years ago
- Assignee changed from Amber Herold to Anchi Cheng
Anchi, just reviewed this.
regarding this:
if has_nproc and int(self.getNProc()) > nodeppn_nproc: raise Exception('You can not specify more nproc than nodes * ppn') else: # nproc need to be set according to the input nodes and ppn if not already in the command line self.setNProc(nodeppn_nproc) # We will not insert --nproc now so that those script that need different default can use # its own default. #if self.getNProc(): # newCommandLine.append('--nproc=%d' % (int(self.getNProc()))) return newCommandLine
When we get to this conditional block, if nproc has been passed as an option, it has already been appended. An invalid passed in nproc will raise an exception here, but it looks like if the nproc is valid, once the comments are removed, nproc will be added again but the value will be nodeppn_nproc rather than the passed in value. This does not seem desirable.
Updated by Anchi Cheng over 12 years ago
- Assignee changed from Anchi Cheng to Amber Herold
has_nproc will be true if --nproc is already in the command and it will not go to "else" part. Therefore, nothing will be appended.
Updated by Amber Herold over 12 years ago
Yes, the has_nproc will be true, but the second part of the conditional "int(self.getNProc()) > nodeppn_nproc" must also be true for the else to not execute.
Updated by Amber Herold over 12 years ago
- Assignee changed from Amber Herold to Anchi Cheng
Updated by Anchi Cheng over 12 years ago
- Assignee changed from Anchi Cheng to Amber Herold
r16555 should fix this logic once comments are taken off in the future.
Updated by Amber Herold about 10 years ago
- Assignee deleted (
Amber Herold) - Priority changed from Immediate to Low
- Affected Version changed from Appion/Leginon 2.1.0 to Appion/Leginon 2.2.0
For some reason, there is no option to close this bug. Moving the priority to low.