Project

General

Profile

Actions

Bug #2414

open

database goes away during stack making

Added by Scott Stagg about 11 years ago. Updated almost 7 years ago.

Status:
In Test
Priority:
Normal
Assignee:
Sargis Dallakyan
Category:
-
Target version:
-
Start date:
07/11/2013
Due date:
% Done:

0%

Estimated time:
Affected Version:
Appion/Leginon 2.2.0
Show in known bugs:
Workaround:

Description

Hi all,

We're getting some strange Appion behavior at FSU, and I'm wondering if y'all have ever seen this or have a suggestion on how to fix it. During long stack making jobs, our jobs always crash now with this error:

sinedon.dbdatakeeper.DatabaseError: Can't connect to local MySQL server through socket '/var/lib/mysql/mysql.sock' (2)
Exception mysql_exceptions.OperationalError: (2006, 'MySQL server has gone away') in <bound method Makestack2Loop._del__ of <__main__.Makestack2Loop object at 0x19a63d0>> ignored

The amount of time it will run before crashing is random, but jobs will usually run between 4 and 8 hours. I've tried checking the network connection between the processing node and the database, and it is not the problem. I tried setting the MySQL connection timeouts in my.cnf to:

interactive_timeout = 432000
wait_timeout = 28800

but that didn't make any difference. I am using the truck version of myami. Have y'all seen this error, and do you have any ideas how I can fix it?

Thanks,
Scott

Actions #1

Updated by Amber Herold about 11 years ago

  • Status changed from New to Assigned
  • Assignee set to Sargis Dallakyan

Hi Scott,
We seem to remember Sargis making some adjustments to avoid this error, however he is out this week. He may have some input on this when he returns.

Actions #2

Updated by Scott Stagg about 11 years ago

  • Status changed from Assigned to Closed

So, it turns out that the problem was network switches. It turns out that the switch that the db computer is on was going bad, and somehow that was causing mysql to reboot periodically. I have no idea why mysql was doing that, but replacing the switch seemed to fix the problem. Weird!

Actions #3

Updated by Scott Stagg over 10 years ago

  • Status changed from Closed to New

This bug has not been fixed after all. Our system admin thinks it has something to do with unclosed database connections. He ran the following query:

show status like '%onn%';
+-----------------------------------------------+-------+
| Variable_name                                 | Value |
+-----------------------------------------------+-------+
| Aborted_connects                              | 0     |
| Connection_errors_accept                      | 0     |
| Connection_errors_internal                    | 0     |
| Connection_errors_max_connections             | 0     |
| Connection_errors_peer_address                | 0     |
| Connection_errors_select                      | 0     |
| Connection_errors_tcpwrap                     | 0     |
| Connections                                   | 86749 |
| Max_used_connections                          | 16    |
| Performance_schema_session_connect_attrs_lost | 0     |
| Ssl_client_connects                           | 0     |
| Ssl_connect_renegotiates                      | 0     |
| Ssl_finished_connects                         | 0     |
| Threads_connected                             | 7     |
+-----------------------------------------------+-------+


And he is concerned that the reason Connections is so large is that python isn't closing connections after it is finished. Could this be the reason our database keeps randomly restarting?

Actions #4

Updated by Sargis Dallakyan over 10 years ago

I haven't seen this error with our database. Here is a similar result from our database:

mysql> show status like '%onn%';
+--------------------------+----------+
| Variable_name            | Value    |
+--------------------------+----------+
| Aborted_connects         | 4430     | 
| Connections              | 14390632 | 
| Max_used_connections     | 193      | 
| Ssl_client_connects      | 0        | 
| Ssl_connect_renegotiates | 0        | 
| Ssl_finished_connects    | 0        | 
| Threads_connected        | 39       | 
+--------------------------+----------+

Our current database (cronus4) is running on PowerEdge R610 with 16GB of RAM.
[cronus4]# more /etc/issue
CentOS release 5.9 (Final)
[cronus4]# mysql --version
mysql  Ver 14.12 Distrib 5.0.95, for redhat-linux-gnu (x86_64) using readline 5.1

Try increasing max_connections to see if that fixes this DatabaseError.
[cronus4]# more /etc/my.cnf
...
# The MySQL server
[mysqld]
...
skip-locking
key_buffer = 512M
max_allowed_packet = 8M
table_cache = 512
sort_buffer_size = 8M
read_buffer_size = 8M
read_rnd_buffer_size = 8M
myisam_sort_buffer_size = 64M
thread_cache_size = 8
query_cache_size = 1G 
query_cache_limit = 1G 
max_connections = 1350
interactive_timeout = 864000
wait_timeout = 864000
# Try number of CPU's*2 for thread_concurrency
thread_concurrency = 16

Actions #5

Updated by Anchi Cheng over 10 years ago

Scott,

You may want to add this fix for issue #2653. It is something we should have done all these years but got away with it in most cases.

Actions #6

Updated by Anchi Cheng over 10 years ago

Scott,

What is your mysql version ? We just added a bunch of autocommit to Appion where database connections are made directly based on a problem of a database created on MySQL 5.5 in r18205 to r18210. Don't know it they will help you, too.

Actions #7

Updated by Anchi Cheng almost 7 years ago

  • Tracker changed from Support to Bug
  • Status changed from New to In Test

We have seen this problem again recently with people doing cl2d jobs that takes too long. Looking at the way sinedon reconnect, I think it may not work right because the stats may not be available in the ping function if it is truly gone away. 8c782ada attempts reconnection without using the existing self.db . I can not test it properly but tried it in the case when the access is denied by stopping the mysql server.

I also added a print statement to give the time if it needs to use this fix. We will know for sure if this is needed.

Actions

Also available in: Atom PDF