Bug #2414
open
database goes away during stack making
Added by Scott Stagg over 11 years ago.
Updated about 7 years ago.
Assignee:
Sargis Dallakyan
Affected Version:
Appion/Leginon 2.2.0
Description
Hi all,
We're getting some strange Appion behavior at FSU, and I'm wondering if y'all have ever seen this or have a suggestion on how to fix it. During long stack making jobs, our jobs always crash now with this error:
sinedon.dbdatakeeper.DatabaseError: Can't connect to local MySQL server through socket '/var/lib/mysql/mysql.sock' (2)
Exception mysql_exceptions.OperationalError: (2006, 'MySQL server has gone away') in <bound method Makestack2Loop._del__ of <__main__.Makestack2Loop object at 0x19a63d0>> ignored
The amount of time it will run before crashing is random, but jobs will usually run between 4 and 8 hours. I've tried checking the network connection between the processing node and the database, and it is not the problem. I tried setting the MySQL connection timeouts in my.cnf to:
interactive_timeout = 432000
wait_timeout = 28800
but that didn't make any difference. I am using the truck version of myami. Have y'all seen this error, and do you have any ideas how I can fix it?
Thanks,
Scott
- Status changed from New to Assigned
- Assignee set to Sargis Dallakyan
Hi Scott,
We seem to remember Sargis making some adjustments to avoid this error, however he is out this week. He may have some input on this when he returns.
- Status changed from Assigned to Closed
So, it turns out that the problem was network switches. It turns out that the switch that the db computer is on was going bad, and somehow that was causing mysql to reboot periodically. I have no idea why mysql was doing that, but replacing the switch seemed to fix the problem. Weird!
- Status changed from Closed to New
This bug has not been fixed after all. Our system admin thinks it has something to do with unclosed database connections. He ran the following query:
show status like '%onn%';
+-----------------------------------------------+-------+
| Variable_name | Value |
+-----------------------------------------------+-------+
| Aborted_connects | 0 |
| Connection_errors_accept | 0 |
| Connection_errors_internal | 0 |
| Connection_errors_max_connections | 0 |
| Connection_errors_peer_address | 0 |
| Connection_errors_select | 0 |
| Connection_errors_tcpwrap | 0 |
| Connections | 86749 |
| Max_used_connections | 16 |
| Performance_schema_session_connect_attrs_lost | 0 |
| Ssl_client_connects | 0 |
| Ssl_connect_renegotiates | 0 |
| Ssl_finished_connects | 0 |
| Threads_connected | 7 |
+-----------------------------------------------+-------+
And he is concerned that the reason Connections is so large is that python isn't closing connections after it is finished. Could this be the reason our database keeps randomly restarting?
I haven't seen this error with our database. Here is a similar result from our database:
mysql> show status like '%onn%';
+--------------------------+----------+
| Variable_name | Value |
+--------------------------+----------+
| Aborted_connects | 4430 |
| Connections | 14390632 |
| Max_used_connections | 193 |
| Ssl_client_connects | 0 |
| Ssl_connect_renegotiates | 0 |
| Ssl_finished_connects | 0 |
| Threads_connected | 39 |
+--------------------------+----------+
Our current database (cronus4) is running on PowerEdge R610 with 16GB of RAM.
[cronus4]# more /etc/issue
CentOS release 5.9 (Final)
[cronus4]# mysql --version
mysql Ver 14.12 Distrib 5.0.95, for redhat-linux-gnu (x86_64) using readline 5.1
Try increasing max_connections to see if that fixes this DatabaseError.
[cronus4]# more /etc/my.cnf
...
# The MySQL server
[mysqld]
...
skip-locking
key_buffer = 512M
max_allowed_packet = 8M
table_cache = 512
sort_buffer_size = 8M
read_buffer_size = 8M
read_rnd_buffer_size = 8M
myisam_sort_buffer_size = 64M
thread_cache_size = 8
query_cache_size = 1G
query_cache_limit = 1G
max_connections = 1350
interactive_timeout = 864000
wait_timeout = 864000
# Try number of CPU's*2 for thread_concurrency
thread_concurrency = 16
Scott,
You may want to add this fix for issue #2653. It is something we should have done all these years but got away with it in most cases.
Scott,
What is your mysql version ? We just added a bunch of autocommit to Appion where database connections are made directly based on a problem of a database created on MySQL 5.5 in r18205 to r18210. Don't know it they will help you, too.
- Tracker changed from Support to Bug
- Status changed from New to In Test
We have seen this problem again recently with people doing cl2d jobs that takes too long. Looking at the way sinedon reconnect, I think it may not work right because the stats may not be available in the ping function if it is truly gone away. 8c782ada attempts reconnection without using the existing self.db . I can not test it properly but tried it in the case when the access is denied by stopping the mysql server.
I also added a print statement to give the time if it needs to use this fix. We will know for sure if this is needed.
Also available in: Atom
PDF