Bug #2295
closedMake stack has frequent memory leaks which cause fatal errors
Added by Melody Campbell over 11 years ago. Updated over 11 years ago.
0%
Description
About 50% of the stacks I have been creating lately have a memory leak fatal error and crash before they have finished. This never happened before but has happened about 4 times in the past week. I save the job file, delete the contents of the directory and resubmit the exact same job file, and this will go to completion. Below is the error. Thanks!
Traceback (most recent call last):
File "/opt/myamisnap/bin/makestack2.py", line 1182, in <module>
makeStack.run()
File "/opt/myamisnap/lib/appionlib/appionLoop2.py", line 62, in run
if not self._startLoop(imgdata):
File "/opt/myamisnap/lib/appionlib/appionLoop2.py", line 550, in _startLoop
self._checkMemLeak()
File "/opt/myamisnap/lib/appionlib/appionLoop2.py", line 661, in _checkMemLeak
apDisplay.printError("Memory leak of "+str(round(memleak,2))+"MB")
File "/opt/myamisnap/lib/appionlib/apDisplay.py", line 57, in printError
raise Exception, colorString("\n * FATAL ERROR *\n"+text+"\n\a","red")
Exception:
- FATAL ERROR ***
Memory leak of 528.53MBFri Mar 15 14:46:41 2013: run start1
Fri Mar 15 14:46:51 2013: thread done
Fri Mar 15 14:46:41 2013: run start1
Fri Mar 15 14:46:51 2013: thread done
[melody@guppy stack-n-224box]$
Files
MemLeakLog.txt (21.1 KB) MemLeakLog.txt | /var/log/message file contents | Amber Herold, 03/25/2013 04:24 PM |
Updated by Ryan Hoffman over 11 years ago
I also got this error with DoG picking, on Garibaldi. It ran to around 80% completion before dying this way. Both makestack2 and DoGpicker call appionLoop2.py:
Traceback (most recent call last): File "/opt/applications/myami/2.2/bin/dogPicker.py", line 78, in <module> imgLoop.run() File "/opt/applications/myami/2.2/lib/python2.6/site-packages/appionlib/appionLoop2.py", line 61, in run if not self._startLoop(imgdata): File "/opt/applications/myami/2.2/lib/python2.6/site-packages/appionlib/appionLoop2.py", line 548, in _startLoop self._checkMemLeak() File "/opt/applications/myami/2.2/lib/python2.6/site-packages/appionlib/appionLoop2.py", line 659, in _checkMemLeak apDisplay.printError("Memory leak of "+str(round(memleak,2))+"MB") File "/opt/applications/myami/2.2/lib/python2.6/site-packages/appionlib/apDisplay.py", line 57, in printError raise Exception, colorString("\n *** FATAL ERROR ***\n"+text+"\n\a","red") Exception: *** FATAL ERROR *** Memory leak of 554.22MB
Updated by Amber Herold over 11 years ago
Opps, sorry guys, I meant to look into this last week. I've put it on my calendar to do this Monday.
Updated by Amber Herold over 11 years ago
- File MemLeakLog.txt MemLeakLog.txt added
Info from Dipa on another failed dog run:
Here are the directories for my two DoG picking runs that failed due to memory leaks. /ami/data15/appion/13mar21a/extract/dogrun2 /ami/data15/appion/13mar21b/extract/dogrun2 They were running at the same time, and both failed at the same time (10:10AM yesterday morning).
I checked the system log. Dmitry was running a frealign reconstruction during that time on the same guppy node. Also, someone ran xmipp cl2d at 6:30 in the morning and there was an out of memory event from xmipp. Attaching a copy of the log info.
I'm trying to reproduce the mem leak error by running dog picking or makestack on the same node as a cl2d at the same time but no luck. Also tried running at the same time as a dd make stack using start frame = 2 as Anchi suggested to ensure her recent changes were not at fault.
It would be really helpful if people could leave their run directory in tact after seeing this error and add the run directory to this issue so I can get more info!
Thanks!
Updated by Ryan Hoffman over 11 years ago
Here's the directory for the crashed DoG Picker run I mentioned:
/ami/data15/appion/13mar14d/extract/dogrun1
Updated by Amber Herold over 11 years ago
Everyone, (Jim, Neil, Sargis - your input would be helpful)
It would be useful to have more examples of this bug happening on guppy to determine the exact cause. If you encounter a memory leak, please add it to this issue, noting the run directory. Then make sure to leave the directory intact - don't use the same run name for subsequent runs. Garibaldi cases are good to, but I may need to get JC involved to examine logs.
- This error occurs because Appion is checking the memory logs on the processing node at /proc/meminfo. This information is not just for the Appion job running, but all processes on the node. So the memory leak could be caused by other jobs that are using the same node. For the failure that I could trace (Dipas), Frealign was also running on the node at the same time. A CL2D job was also terminated a few hours earlier by the linux Out-Of-Memory Killer, also during the Frealign run. Apparently, jobs being stopped by the OOM killer is supposed to be extremely rare since virtual memory has been in use.
- This makes me think that perhaps Frealign is using memory in a way that looks like a memory leak to the Appion functions that looks for it. Neil, this was your function, lots of un-commented numbers and math. I wonder if this function is actually finding a memory leak, or if the function is not designed for the way the node is currently being used, and it just looks like a memory leak.
#===================== def _checkMemLeak(self): """ unnecessary code for determining if the program is eating memory over time """ ### Memory leak code: #self.stats['memlist'].append(mem.mySize()/1024) self.stats['memlist'].append(mem.active()) memfree = mem.free() swapfree = mem.swapfree() minavailmem = 64*1024; # 64 MB, size of one image if(memfree < minavailmem): apDisplay.printError("Memory is low ("+str(int(memfree/1024))+"MB): there is probably a memory leak") if(self.stats['count'] > 15): memlist = self.stats['memlist'][-15:] n = len(memlist) gain = (memlist[n-1] - memlist[0])/1024.0 sumx = n*(n-1.0)/2.0 sumxsq = n*(n-1.0)*(2.0*n-1.0)/6.0 sumy = 0.0; sumxy = 0.0; sumysq = 0.0 for i in range(n): value = float(memlist[i])/1024.0 sumxy += float(i)*value sumy += value sumysq += value**2 ### stdx = math.sqrt(n*sumxsq - sumx**2) stdy = math.sqrt(n*sumysq - sumy**2) rho = float(n*sumxy - sumx*sumy)/float(stdx*stdy+1e-6) slope = float(n*sumxy - sumx*sumy)/float(n*sumxsq - sumx*sumx) memleak = rho*slope ### if(self.stats['memleak'] > 3 and slope > 20 and memleak > 512 and gain > 2048): apDisplay.printError("Memory leak of "+str(round(memleak,2))+"MB") elif(memleak > 32): self.stats['memleak'] += 1 apDisplay.printWarning("substantial memory leak "+str(round(memleak,2))+"MB") print "(",str(n),round(slope,5),round(rho,5),round(gain,2),")"
- Options to fix this:
- Run Frealign, Xmipp, and maybe dog picker and makestack against a memory analyzer to see if there is actually a memory leak. For Xmipp, this might require recompiling for debug mode, and installing analyzers. Jim, do you know how to analyze the python programs for a mem leak? I have not looked into Frealign, is it written in C?
- Rewrite the mem.py function that reads info from /proc/meminfo to use the job specific memory data instead. Neil, thoughts on that?
- Comment out the memory check and rely on the OOM Killer to do its job. The downside is that there is no indication what the error was when a job dies (without root access anyway), which is what we currently see with xmipp CL2D. The upside is that the OOM will use a highly tuned algorithm to chose which job dies, and our dog runs and make stack runs would likely survive.
- On a side note, during our cronus2 crisis meeting with JC, do I remember JC telling us that when we run CL2D we need to make sure we ask for all the processors on the node to make sure it has enough memory?
Any other thoughts from anyone? Any other ideas for debugging/fixing from Jim or Sargis?
Thanks!
Updated by Ryan Hoffman over 11 years ago
On a side note, during our cronus2 crisis meeting with JC, do I remember JC telling us that when we run CL2D we need to make sure we ask for all the processors on the node to make sure it has enough memory
This seems to help with my Garibaldi runs -- if I request the maximum 47 GB per node (even if only using one core) this excludes other processes from using the node and the memory leak never seems to happen.
Updated by Ryan Hoffman over 11 years ago
This is regarding the function checkMemLeak.
The first message, which is an Appion error, Memory is low ("+str(int(memfree/1024))+"MB): there is probably a memory leak")
, never appears in my log files. So there's always more than 64 MB of available memory. Not very useful finding, except for implying that mem.free() is reporting a reasonable value.
The third message, which is an Appion warning, "substantial memory leak "+str(round(memleak,2))+"MB")
, very frequently appears in my log files. It appears in runs that do not crash as well as runs that do crash.
The second message, an Appion error, "Memory leak of "+str(round(memleak,2))+"MB"
, only appears for the runs that actually crash. I find its conditional to be particularly opaque;
if(self.stats['memleak'] > 3 and slope > 20 and memleak > 512 and gain > 2048)
Hopefully someone can discern its meaning.
I don't understand the motivation behind the definitions of rho
, gain
, and slope
, but it seems that that the condition (rho*slope)>32
frequently happens without compromising the run.
Updated by Neil Voss over 11 years ago
It is a pretty crazy if statement I agree:
- rho is the correlation coefficient
- slope is calculated slope of linear regression on the amount of memory used after each image
- gain is the total increase in memory used from the beginning of the run
- memleak = rho*slope
a large slope implies memory leak (note this was a HUGE issue with sinedon bugs in 2007), but if the correlation coefficient is low that it is likely an outlier. So, multiplying rho*slope reduced the false positives for memory leaks.
Since we have not had memory leaks in a long time, we might as well disable this functionality unless a debug flag is used.
Updated by Neil Voss over 11 years ago
I've changed the Errors into Warnings, so all runs will continue despite having memory leaks.
Updated by Amber Herold over 11 years ago
Thanks Neil,
It sounds like this function can have false positives?
I was going to change the error to warning as well, but if there is truly a memory leak I'd like to know about it.
Do you think there is anything we should still be concerned about? Do you think Frealign or cl2d may have a memory leak based on what we've seen? Or do you think everything is probably working as it should be?
Ryan, the reason you see the program die for some of the messages and not all of the messages, is that the "printWarning()" function just print a warning, but the printError() function actually throws an exception in addition to printing the error. Yes, this is poorly named, the function should be named something like terminate() that takes a termination message as a parameter.
Updated by Amber Herold over 11 years ago
- Category set to Python scripting
- Status changed from New to In Test
- Target version set to Appion/Leginon 3.0.0
Reviewed code and moving this to In Test. If we start to see Appion jobs die for no reason (killed by OOM), we may want to revisit this.
Updated by Dmitry Lyumkis over 11 years ago
- Status changed from In Test to Assigned
- Assignee changed from Amber Herold to Sargis Dallakyan
It is likely that the reason for these memory leaks is that our Xmipp parallelization is not setup correctly. When Amber and I checked for what was happening that last time this issue was occuring, we noticed that the node that had the memory leak was running 16 instances of CL2D. I just ran CL2D, and noticed that every CL2D parallel thread was being launched on guppy-11, whereas the rest of the nodes had nothing running on them. This additionally means that if we run on guppy, we are currently limited to 8 proc for CL2D, which is certainly not enough. It will be useful to check that our MPI implementation (/usr/lib64/openmpi/bin/mpirun) is compatible with Xmipp and is capable of spooling jobs across nodes.
Updated by Sargis Dallakyan over 11 years ago
- Status changed from Assigned to In Test
- Assignee changed from Sargis Dallakyan to Dmitry Lyumkis
Open MPI that comes with CentOS 6 is not build with Torque support. That's why mpirun wasn't able to distribute jobs to different nodes. I have found an FAQ at http://www.open-mpi.org/faq/?category=building#build-rte-tm and run rpmbuild with openmpi-1.6.4-1.src.rpm to enable Torque support for Open MPI:
yum install torque-devel.x86_64 rpmbuild --rebuild openmpi-1.6.4-1.src.rpm --define 'configure_options --with-tm' cd /root/rpmbuild/RPMS/x86_64 rpm -Uvh --force openmpi-1.6.4-1.x86_64.rpm ~/safadmin "rpm -Uvh --force openmpi-1.6.4-1.x86_64.rpm"
I have run a test job with xmipp_mpi_class_averages that Dmitry showed me and it seems that mpirun is now capable of spooling jobs across nodes.
If someone can independently test and close this Bug that would be great. Thank you all.
Updated by Dmitry Lyumkis over 11 years ago
- Status changed from In Test to Closed
Just tried an independent job with Xmipp CL2D, and it is spooling jobs correctly. We should keep an eye out for other types of runs.