转自:http://blog.csdn.net/zengxuewen2045/article/details/51412663 作为案例收集
一、基本环境信息
select version(); +---------------------+ | version() | +---------------------+ | 10.0.20-MariaDB-log | +---------------------+ 1 row in set (0.00 sec) uname -a Linux MCBL051 2.6.32-504.el6.x86_64 #1 SMP Tue Sep 16 01:56:35 EDT 2014 x86_64 x86_64 x86_64 GNU/Linux select @@tx_isolation; +----------------+ | @@tx_isolation | +----------------+ | READ-COMMITTED | +----------------+ 1 row in set (0.00 sec)
二、问题描述
应用人员反应某系统生产环境,有一支SQL跑了很久,被kill了,但一直没有kill掉,后面多支类似SQL执行时被前面这支sql堵住,请dba支援。
三、问题分析
从processlist来查看,TIME为52113秒,SQL执行大约14小时没有完成,而手工kill,无法释放线程资源,而堵住后面执行的多个类似SQL,应用人员将所有内外应用停了后,也没结束被kill的线程。下面显示SQL的command一直处于killed,没有真正完成。
show processlist\G *************************** 3. row *************************** Id: 38023534 User: lotsprd Host: 172.16.12.155:44805 db: lots Command: Killed Time: 52113 State: Sending data Info: SELECT DISTINCT o.CN, o.WNE, o.CON, o.ON Progress: 0.000
查看事务信息,trx_mysql_thread_id为38023534的事务正处于RUNNING,对应同一支SQL,没有其它事务了,说明这个线程没有被其它线程锁住,事务中的表也没有上锁。
select * from innodb_trx\G *************************** 1. row *************************** trx_id: 636733127 trx_state: RUNNING trx_started: 2016-05-13 20:57:54 trx_requested_lock_id: NULL trx_wait_started: NULL trx_weight: 0 trx_mysql_thread_id: 38023534 trx_query: SELECT DISTINCT o.CN, o.WNE, o.CON, o.ON, o.A_NAME, o.A_NO, o.CREATE_DATE receipt_time, o.delivery_date delivery_time FROM t o WHERE o.CREATE_DATE BETWEEN '2016-05-02 20:39:23' AND '2016-05-12 17:00:23' AND o.business_type = 'B2C' AND o.is_delivery = 1 and o.order_type = 'PO' AND o.WNe = 'xxx小电仓' AND o.A_CODE NOT IN (38549, 44926) AND o.A_NO != '' AND o.is_got = 0 AND o.IS_SIGN = 0 AND NOT EXISTS ( SELECT id FROM t_carrier WHERE ON = o.ON AND remark NOT IN ( '开始处理', '待配货', '开始处理', '信息审核通过', '已通知配货', '待配货', '您的包裹已打包' ) ) ORDER BY o.WNE, o.CREATE_DATE limit 20 trx_operation_state: NULL trx_tables_in_use: 2 trx_tables_locked: 0 trx_lock_structs: 0 trx_lock_memory_bytes: 360 trx_rows_locked: 0 trx_rows_modified: 0 trx_concurrency_tickets: 2256 trx_isolation_level: READ COMMITTED trx_unique_checks: 1 trx_foreign_key_checks: 1 trx_last_foreign_key_error: NULL trx_adaptive_hash_latched: 0 trx_adaptive_hash_timeout: 10000 trx_is_read_only: 0 trx_autocommit_non_locking: 0 1 row in set (0.00 sec)
查看show engine innodb status的事务部分信息,发现跑该SQL的事务也没有被锁住,这样看来,这个应该就只是一个查询SQL,但select这种为什么会记录innodb_trx表中,有待进一步验证
mysql> show engine innodb status \G; ------------ TRANSACTIONS ------------ Trx id counter 637134193 Purge done for trx's n:o < 636733130 undo n:o < 0 state: running but idle History list length 84354 LIST OF TRANSACTIONS FOR EACH SESSION: ---TRANSACTION 0, not started MySQL thread id 38193498, OS thread handle 0x2b73c5442700, query id 2807386390 localhost dba init show engine innodb status ---TRANSACTION 0, not started MySQL thread id 38168362, OS thread handle 0x2b739680a700, query id 2807382298 10.63.32.108 lotsprd cleaning up ---TRANSACTION 3843, not started MySQL thread id 1, OS thread handle 0x2b7396253700, query id 0 Waiting for background binlog tasks ---TRANSACTION 636733127, ACTIVE 63973 sec, thread declared inside InnoDB 2256 mysql tables in use 2, locked 0 MySQL thread id 38023534, OS thread handle 0x2b7396e99700, query id 2805177701 172.16.12.155 lotsprd Sending data SELECT DISTINCT o.CN, o.WNE, o.CON, o.ON, o.A_NAME, o.A_NO, o.CREATE_DATE receipt_time, o.delivery_date delivery_time FROM t o WHERE o.CREATE_DATE BETWEEN '2016-05-02 20:39:23' AND '2016-05-12 17:00:23' AND o.business_type = 'B2C' AND o.is_delivery = 1 and o.order_type = 'PO' AND o.WNe = 'xxx小电仓' AND o.A_CODE NOT IN (38549, 44926) AND o.A_NO != '' AND o.is_got = 0 AND o.IS_SIGN = 0 AND NOT EXISTS ( SELECT id FROM t_carrier Trx read view will not see trx with id >= 636733128, sees < 636709213 Trx #rec lock waits 0 #table lock waits 0 Trx total rec lock wait time 0 SEC Trx total table lock wait time 0 SEC
又查看主库processlist情况,显示已将所有binlog发到了salve上:
-----+ | Id | User | Host | db | Command | Time | State | Info | Progress | +----------+-----------------+---------------------+------+-------------+----------+----------------------------------------------------------------------- +------------------------------------------------------------------------------------------------------+----------+ | 2 | event_scheduler | localhost | NULL | Daemon | 18320596 | Waiting on empty queue | NULL | 0.000 | | 12575445 | rep | 10.16.19.88:30630 | NULL | Binlog Dump | 12434220 | Master has sent all binlog to slave; waiting for binlog to be updated | NULL | 0.000 | | 38023534 | lotsprd | 172.16.12.155:44805 | lots | Killed | 65293 | Sending data | SELECT DISTINCT o.CN, o.WNE, o.CON, o.ON | 0.000 | | 38168361 | lotsprd | 10.63.32.108:10310 | NULL | Sleep | 15127 | | NULL | 0.000 | | 38168362 | lotsprd | 10.63.32.108:10311 | lots | Sleep | 2185 | | NULL | 0.000 | | 38171800 | lotsprd | 10.63.32.108:11787 | lots | Sleep | 665 | | NULL | 0.000 | | 38189047 | lotsprd | 10.63.32.108:5627 | lots | Sleep | 1412 | | NULL | 0.000 | | 38200552 | dba | localhost | NULL | Query | 0 | init | show processlist | 0.000 | +----------+-----------------+---------------------+------+-------------+----------+----------------------------------------------------------------------- +------------------------------------------------------------------------------------------------------+----- 从库状态显示主从数据是同步的: *************************** 1. row *************************** Slave_IO_State: Waiting for master to send event Master_Host: 10.16.19.87 Master_User: rep Master_Port: 3306 Connect_Retry: 60 Master_Log_File: mysql-bin.000267 Read_Master_Log_Pos: 476541394 Relay_Log_File: relay-bin.000819 Relay_Log_Pos: 476541681 Relay_Master_Log_File: mysql-bin.000267 Slave_IO_Running: Yes Slave_SQL_Running: Yes Replicate_Do_DB: Replicate_Ignore_DB: Replicate_Do_Table: Replicate_Ignore_Table: Replicate_Wild_Do_Table: Replicate_Wild_Ignore_Table: Last_Errno: 0 Last_Error: Skip_Counter: 0 Exec_Master_Log_Pos: 476541394 Relay_Log_Space: 476542015 Until_Condition: None Until_Log_File: Until_Log_Pos: 0 Master_SSL_Allowed: No Master_SSL_CA_File: Master_SSL_CA_Path: Master_SSL_Cert: Master_SSL_Cipher: Master_SSL_Key: Seconds_Behind_Master: 0 Master_SSL_Verify_Server_Cert: No Last_IO_Errno: 0 Last_IO_Error: Last_SQL_Errno: 0 Last_SQL_Error: Replicate_Ignore_Server_Ids: Master_Server_Id: 19873306 Master_SSL_Crl: Master_SSL_Crlpath: Using_Gtid: No Gtid_IO_Pos: 1 row in set (0.00 sec)
四、问题处理
从网上查找资料,显示该问题只能重启mysql或是进行主从切换,由于上面查看相关信息有确认主从库的数据是同步的,而在应用人员已将所有应用都停了情况下,应用人员同意重启mysql
,用msyqladmin shutdown执行关闭时,一直hang住很久,查看错误日志显示如下:
160514 15:10:16 [Note] /apps/svr/mariadb10/bin/mysqld: Normal shutdown 160514 15:10:16 [Note] Event Scheduler: Killing the scheduler thread, thread id 2 160514 15:10:16 [Note] Event Scheduler: Waiting for the scheduler thread to reply 160514 15:10:16 [Note] Event Scheduler: Stopped 160514 15:10:16 [Note] Event Scheduler: Purging the queue. 0 events 160514 15:10:36 [Warning] /apps/svr/mariadb10/bin/mysqld: Forcing close of thread 38023534 user: 'lotsprd' 160514 15:10:36 [ERROR] mysqld: Got an error writing communication packets shutdow 已hang住,此时主库也无法再提供对外服务,前面主从数据已确认同步,将从库停止slave。现在只能对mysql进程进行kill操作了,如下: [apps@MCBL051 bin]$ ps -ef|grep mysql apps 2584 1 0 2015 ? 00:00:00 /bin/sh /apps/svr/mariadb10/bin/mysqld_safe --defaults-file=/apps/conf/mysql/mariadb10_3306.cnf apps 3121 2584 17 2015 ? 36-11:01:53 /apps/svr/mariadb10/bin/mysqld --defaults-file=/apps/conf/mysql/mariadb10_3306.cnf --basedir=/apps/svr/mariadb10 -- datadir=/apps/dbdat/mariadb10_data3306 --plugin-dir=/apps/svr/mariadb10/lib/plugin --log-error=/apps/logs/mysql/error3306.log --open-files-limit=8192 --pid- file=/apps/dbdat/mariadb10_data3306/mysql.pid --socket=/tmp/mysql3306.sock --port=3306 apps 16651 22150 0 15:26 pts/0 00:00:00 grep mysql [apps@MCBL051 bin]$ kill -9 2584 [apps@MCBL051 bin]$ ps -ef|grep mysql apps 3121 1 17 2015 ? 36-11:01:53 /apps/svr/mariadb10/bin/mysqld --defaults-file=/apps/conf/mysql/mariadb10_3306.cnf --basedir=/apps/svr/mariadb10 -- datadir=/apps/dbdat/mariadb10_data3306 --plugin-dir=/apps/svr/mariadb10/lib/plugin --log-error=/apps/logs/mysql/error3306.log --open-files-limit=8192 --pid- file=/apps/dbdat/mariadb10_data3306/mysql.pid --socket=/tmp/mysql3306.sock --port=3306 apps 17837 22150 0 15:27 pts/0 00:00:00 grep mysql [apps@MCBL051 bin]$ kill -9 3121 [apps@MCBL051 bin]$ ps -ef|grep mysql apps 17947 22150 0 15:27 pts/0 00:00:00 grep mysql 再启动mysql主库正常,启动从库slave正常,检查主从数据同步正常。
后续同样的问题出现,继续寻找原因
show processlist\G 跑最长时间线程处于killed状态,一直未释放 *************************** 6. row *************************** Id: 5449173 User: lotsprd Host: 10.16.26.199:51821 db: lots Command: Killed Time: 84026 State: Sending data Info: SELECT DISTINCT o.CUSTOMER_NAME, o.WAREHOUSE_NAME, o.CUSTOMER_ORDER_NO, o.ANNTO_ORDER_NO Progress: 0.000
很多正在跑sql的线程都处于Waiting for table flush状态
*************************** 10. row ***************************
Id: 5557272
User: lotsprd
Host: 172.16.12.155:54883
db: lots
Command: Query
Time: 70
State: Waiting for table flush
Info: SELECT
o.ANNTO_ORDER_NO,
ct.CARRIER_CODE,
ct.EXPRESS_NO,
ct.NODE_ACTION,
ct.ACCEPT_TI
Progress: 0.000
*************************** 11. row ***************************
Id: 5557273
User: lotsprd
Host: 172.16.12.155:54884
db: lots
Command: Query
Time: 9
State: Waiting for table flush
Info: SELECT annto_order_no,express_no,express_code,EXPRESS_NAME,source_express_code FROM (
SELECT
二.问题分析
1.检查binlog中是否有手工执行的flush table、alter table和analyze table命令
[apps@MCBL051 log]$ /apps/svr/mariadb10/bin/mysqlbinlog --no-defaults --base64-output=decode-rows -v -v mysql-bin.000345|grep FLUSH [apps@MCBL051 log]$ /apps/svr/mariadb10/bin/mysqlbinlog --no-defaults --base64-output=decode-rows -v -v mysql-bin.000345|grep flush [apps@MCBL051 log]$ /apps/svr/mariadb10/bin/mysqlbinlog --no-defaults --base64-output=decode-rows -v -v mysql-bin.000345|grep analze [apps@MCBL051 log]$ /apps/svr/mariadb10/bin/mysqlbinlog --no-defaults --base64-output=decode-rows -v -v mysql-bin.000345|grep ANALYZE [apps@MCBL051 log]$ /apps/svr/mariadb10/bin/mysqlbinlog --no-defaults --base64-output=decode-rows -v -v mysql-bin.000345|grep alter [apps@MCBL051 log]$ /apps/svr/mariadb10/bin/mysqlbinlog --no-defaults --base64-output=decode-rows -v -v mysql-bin.000345|grep ALTER 查询都没有手工执行的flush table、alter table和analyze table命令。
2.检查备份,发现在1点时的备份脚本中有在备份前kill掉长时间查询的线程
crontab -l 0 1 * * * /apps/sh/cron/new_backup_mysql.sh --socket=/tmp/mysql3306.sock >> /apps/tmp/dump_3306.log 2>&1 mydumper --user=${db_user} --password=${db_pass} --long-query-guard=${v_long_query_guard} --socket=${socket} --database=${dbname} --outputdir=${this_back_p} --compress --verbose=3 --logfile=${base_p}/${logfilename} ${kill_long_query_option} v_backup_type=0 #mydumper 备份 #log_mysql_start_status backup_mydumper if [ $try_times -eq 1 ] then kill_long_query_option="--kill-long-queries"
3.查看mydumper备份的日志,确认有在01:01:22时kill了运行超过54999s的线程
cd /apps/mysql_backup tail -300 new_backup_mysql3306.log.archive|more 2016-07-01 01:01:22 [WARNING] - Killed a query that was running for 54999s 2016-07-01 01:11:22 [ERROR] - Couldn't acquire global lock, snapshots will not be consistent: Lock wait timeout exceeded; try restar ting transaction 2016-07-01 01:21:24 [ERROR] - Error dumping table (lots.t_carrier_tracking) data: Lock wait timeout exceeded; try restarting transac tion 2016-07-01 01:21:24 [INFO] - Empty table lots.t_carrier_tracking 2016-07-01 01:21:42 [ERROR] - Error dumping table (lots.t_order) data: Lock wait timeout exceeded; try restarting transaction 2016-07-01 01:21:42 [INFO] - Empty table lots.t_order 2016-07-01 01:24:57 [ERROR] - Error dumping table (lots.t_warehouse) data: Lock wait timeout exceeded; try restarting transaction 2016-07-01 01:24:57 [INFO] - Empty table lots.t_warehouse 2016-07-01 01:31:24 [ERROR] - Error dumping schemas (lots.t_carrier_tracking): Lock wait timeout exceeded; try restarting transactio n 2016-07-01 01:31:24 [INFO] - Thread 2 shutting down 2016-07-01 01:31:42 [ERROR] - Error dumping schemas (lots.t_order): Lock wait timeout exceeded; try restarting transaction 2016-07-01 01:31:42 [INFO] - Thread 4 shutting down 2016-07-01 01:31:57 [ERROR] - Error dumping schemas (lots.t_warehouse): Lock wait timeout exceeded; try restarting transaction 2016-07-01 01:31:57 [INFO] - Thread 1 shutting down 但从上面报错来看,线程被kill,但线程和锁没有真正释放,影响到的表为T_ORDER、 t_warehouse、t_carrier_tracking。 在flush tables with read lock成功获得锁之前,必须等待所有语句执行完成(包括SELECT)。所以如果有个慢查询在执行,或者一个打开的事务,或者其他进程拿着表锁,flush tables with read lock就会被阻塞,直到所有的锁被释放。 这三个表:T_ORDER、 t_warehouse、t_carrier_tracking由于慢sql没有执行执行完成,导致加锁没有成功。所以下次再有新产生对三个表的select 或dml的sql执行时,线程就会出现waiting for table flush。我们mysql的lock_wait_timeout时间为600 秒,上面备份加锁时,等待锁超过了10分钟后,就会超时报错。 此次问题产生原因是慢SQL和备份操作杀进程引起的,主要原因还是由于慢sql长时间执行,性能太差。
应该有写锁或者非共享锁存在
三.解决方法
关闭mysql,mysql长时间未有关闭的话,使用kill -9来杀掉mysql进程。
最后将引起问题的慢SQL发给开发人员优化,优化人员反馈已优化业务逻辑,观察问题暂未出现。