hangkk2008

  博客园  :: 首页  :: 新随笔  :: 联系 :: 订阅 订阅  :: 管理

转自:http://blog.csdn.net/zengxuewen2045/article/details/51412663 作为案例收集

一、基本环境信息

select version();
+---------------------+
| version()           |
+---------------------+
| 10.0.20-MariaDB-log |
+---------------------+
1 row in set (0.00 sec)
uname -a
Linux MCBL051 2.6.32-504.el6.x86_64 #1 SMP Tue Sep 16 01:56:35 EDT 2014 x86_64 x86_64 x86_64 GNU/Linux

select @@tx_isolation;
+----------------+
| @@tx_isolation |
+----------------+
| READ-COMMITTED |
+----------------+
1 row in set (0.00 sec)

二、问题描述

     应用人员反应某系统生产环境,有一支SQL跑了很久,被kill了,但一直没有kill掉,后面多支类似SQL执行时被前面这支sql堵住,请dba支援。

三、问题分析

     从processlist来查看,TIME为52113秒,SQL执行大约14小时没有完成,而手工kill,无法释放线程资源,而堵住后面执行的多个类似SQL,应用人员将所有内外应用停了后,也没结束被kill的线程。下面显示SQL的command一直处于killed,没有真正完成。

show processlist\G
*************************** 3. row ***************************
      Id: 38023534
    User: lotsprd
    Host: 172.16.12.155:44805
      db: lots
 Command: Killed
    Time: 52113
   State: Sending data
    Info: SELECT DISTINCT
                        o.CN,
                        o.WNE,
                        o.CON,
                        o.ON
Progress: 0.000

查看事务信息,trx_mysql_thread_id为38023534的事务正处于RUNNING,对应同一支SQL,没有其它事务了,说明这个线程没有被其它线程锁住,事务中的表也没有上锁。

select * from innodb_trx\G
*************************** 1. row ***************************
                    trx_id: 636733127
                 trx_state: RUNNING
               trx_started: 2016-05-13 20:57:54
     trx_requested_lock_id: NULL
          trx_wait_started: NULL
                trx_weight: 0
       trx_mysql_thread_id: 38023534
                 trx_query: SELECT DISTINCT
                        o.CN,
                        o.WNE,
                        o.CON,
                        o.ON,
                        o.A_NAME,
                        o.A_NO,
                        o.CREATE_DATE receipt_time,
                        o.delivery_date delivery_time
                FROM t o
                WHERE o.CREATE_DATE BETWEEN '2016-05-02 20:39:23' AND '2016-05-12 17:00:23'
                AND o.business_type = 'B2C'
                AND o.is_delivery = 1
                and o.order_type = 'PO'
                 
                        AND o.WNe = 'xxx小电仓'
                 
                 
                AND o.A_CODE NOT IN (38549, 44926)
                AND o.A_NO != ''
                AND o.is_got = 0
                AND o.IS_SIGN = 0
                AND NOT EXISTS (
                        SELECT id FROM t_carrier
                        WHERE ON = o.ON
                        AND remark NOT IN (
                                '开始处理',
                                '待配货',
                                '开始处理',
                                '信息审核通过',
                                '已通知配货',
                                '待配货',
                                '您的包裹已打包'
                        )
                )
                ORDER BY o.WNE, o.CREATE_DATE limit 20
       trx_operation_state: NULL
         trx_tables_in_use: 2
         trx_tables_locked: 0
          trx_lock_structs: 0
     trx_lock_memory_bytes: 360
           trx_rows_locked: 0
         trx_rows_modified: 0
   trx_concurrency_tickets: 2256
       trx_isolation_level: READ COMMITTED
         trx_unique_checks: 1
    trx_foreign_key_checks: 1
trx_last_foreign_key_error: NULL
 trx_adaptive_hash_latched: 0
 trx_adaptive_hash_timeout: 10000
          trx_is_read_only: 0
trx_autocommit_non_locking: 0
1 row in set (0.00 sec)

查看show engine innodb status的事务部分信息,发现跑该SQL的事务也没有被锁住,这样看来,这个应该就只是一个查询SQL,但select这种为什么会记录innodb_trx表中,有待进一步验证

mysql> show engine innodb status \G;
------------
TRANSACTIONS
------------
Trx id counter 637134193
Purge done for trx's n:o < 636733130 undo n:o < 0 state: running but idle
History list length 84354
LIST OF TRANSACTIONS FOR EACH SESSION:
---TRANSACTION 0, not started
MySQL thread id 38193498, OS thread handle 0x2b73c5442700, query id 2807386390 localhost dba init
show engine innodb status
---TRANSACTION 0, not started
MySQL thread id 38168362, OS thread handle 0x2b739680a700, query id 2807382298 10.63.32.108 lotsprd cleaning up
---TRANSACTION 3843, not started
MySQL thread id 1, OS thread handle 0x2b7396253700, query id 0 Waiting for background binlog tasks
---TRANSACTION 636733127, ACTIVE 63973 sec, thread declared inside InnoDB 2256
mysql tables in use 2, locked 0
MySQL thread id 38023534, OS thread handle 0x2b7396e99700, query id 2805177701 172.16.12.155 lotsprd Sending data
SELECT DISTINCT
                        o.CN,
                        o.WNE,
                        o.CON,
                        o.ON,
                        o.A_NAME,
                        o.A_NO,
                        o.CREATE_DATE receipt_time,
                        o.delivery_date delivery_time
                FROM t o
                WHERE o.CREATE_DATE BETWEEN '2016-05-02 20:39:23' AND '2016-05-12 17:00:23'
                AND o.business_type = 'B2C'
                AND o.is_delivery = 1
                and o.order_type = 'PO'
                 
                        AND o.WNe = 'xxx小电仓'
                 
                 
                AND o.A_CODE NOT IN (38549, 44926)
                AND o.A_NO != ''
                AND o.is_got = 0
                AND o.IS_SIGN = 0
                AND NOT EXISTS (
                        SELECT id FROM t_carrier
Trx read view will not see trx with id >= 636733128, sees < 636709213
Trx #rec lock waits 0 #table lock waits 0
Trx total rec lock wait time 0 SEC
Trx total table lock wait time 0 SEC

又查看主库processlist情况,显示已将所有binlog发到了salve上:

-----+
| Id       | User            | Host                | db   | Command     | Time     | State                                                                 | Info      
                                                                                          | Progress |
+----------+-----------------+---------------------+------+-------------+----------+-----------------------------------------------------------------------
+------------------------------------------------------------------------------------------------------+----------+
|        2 | event_scheduler | localhost           | NULL | Daemon      | 18320596 | Waiting on empty queue                                                | NULL      
                                                                                          |    0.000 |
| 12575445 | rep             | 10.16.19.88:30630   | NULL | Binlog Dump | 12434220 | Master has sent all binlog to slave; waiting for binlog to be updated | NULL      
                                                                                          |    0.000 |
| 38023534 | lotsprd         | 172.16.12.155:44805 | lots | Killed      |    65293 | Sending data                                                          | SELECT
DISTINCT
                        o.CN,
                        o.WNE,
                        o.CON,
                        o.ON |    0.000 |
| 38168361 | lotsprd         | 10.63.32.108:10310  | NULL | Sleep       |    15127 |                                                                       | NULL      
                                                                                          |    0.000 |
| 38168362 | lotsprd         | 10.63.32.108:10311  | lots | Sleep       |     2185 |                                                                       | NULL      
                                                                                          |    0.000 |
| 38171800 | lotsprd         | 10.63.32.108:11787  | lots | Sleep       |      665 |                                                                       | NULL      
                                                                                          |    0.000 |
| 38189047 | lotsprd         | 10.63.32.108:5627   | lots | Sleep       |     1412 |                                                                       | NULL      
                                                                                          |    0.000 |
| 38200552 | dba             | localhost           | NULL | Query       |        0 | init                                                                  | show
processlist                                                                                     |    0.000 |
+----------+-----------------+---------------------+------+-------------+----------+-----------------------------------------------------------------------
+------------------------------------------------------------------------------------------------------+-----
从库状态显示主从数据是同步的:
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: 10.16.19.87
                  Master_User: rep
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: mysql-bin.000267
          Read_Master_Log_Pos: 476541394
               Relay_Log_File: relay-bin.000819
                Relay_Log_Pos: 476541681
        Relay_Master_Log_File: mysql-bin.000267
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
              Replicate_Do_DB: 
          Replicate_Ignore_DB: 
           Replicate_Do_Table: 
       Replicate_Ignore_Table: 
      Replicate_Wild_Do_Table: 
  Replicate_Wild_Ignore_Table: 
                   Last_Errno: 0
                   Last_Error: 
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 476541394
              Relay_Log_Space: 476542015
              Until_Condition: None
               Until_Log_File: 
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File: 
           Master_SSL_CA_Path: 
              Master_SSL_Cert: 
            Master_SSL_Cipher: 
               Master_SSL_Key: 
        Seconds_Behind_Master: 0
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error: 
               Last_SQL_Errno: 0
               Last_SQL_Error: 
  Replicate_Ignore_Server_Ids: 
             Master_Server_Id: 19873306
               Master_SSL_Crl: 
           Master_SSL_Crlpath: 
                   Using_Gtid: No
                  Gtid_IO_Pos: 
1 row in set (0.00 sec)

四、问题处理
     从网上查找资料,显示该问题只能重启mysql或是进行主从切换,由于上面查看相关信息有确认主从库的数据是同步的,而在应用人员已将所有应用都停了情况下,应用人员同意重启mysql

,用msyqladmin shutdown执行关闭时,一直hang住很久,查看错误日志显示如下:

160514 15:10:16 [Note] /apps/svr/mariadb10/bin/mysqld: Normal shutdown
160514 15:10:16 [Note] Event Scheduler: Killing the scheduler thread, thread id 2
160514 15:10:16 [Note] Event Scheduler: Waiting for the scheduler thread to reply
160514 15:10:16 [Note] Event Scheduler: Stopped
160514 15:10:16 [Note] Event Scheduler: Purging the queue. 0 events
160514 15:10:36 [Warning] /apps/svr/mariadb10/bin/mysqld: Forcing close of thread 38023534  user: 'lotsprd'
160514 15:10:36 [ERROR] mysqld: Got an error writing communication packets
shutdow 已hang住,此时主库也无法再提供对外服务,前面主从数据已确认同步,将从库停止slave。现在只能对mysql进程进行kill操作了,如下:
[apps@MCBL051 bin]$ ps -ef|grep mysql
apps      2584     1  0  2015 ?        00:00:00 /bin/sh /apps/svr/mariadb10/bin/mysqld_safe --defaults-file=/apps/conf/mysql/mariadb10_3306.cnf
apps      3121  2584 17  2015 ?        36-11:01:53 /apps/svr/mariadb10/bin/mysqld --defaults-file=/apps/conf/mysql/mariadb10_3306.cnf --basedir=/apps/svr/mariadb10 --
datadir=/apps/dbdat/mariadb10_data3306 --plugin-dir=/apps/svr/mariadb10/lib/plugin --log-error=/apps/logs/mysql/error3306.log --open-files-limit=8192 --pid-
file=/apps/dbdat/mariadb10_data3306/mysql.pid --socket=/tmp/mysql3306.sock --port=3306
apps     16651 22150  0 15:26 pts/0    00:00:00 grep mysql
[apps@MCBL051 bin]$ kill -9 2584
[apps@MCBL051 bin]$ ps -ef|grep mysql
apps      3121     1 17  2015 ?        36-11:01:53 /apps/svr/mariadb10/bin/mysqld --defaults-file=/apps/conf/mysql/mariadb10_3306.cnf --basedir=/apps/svr/mariadb10 --
datadir=/apps/dbdat/mariadb10_data3306 --plugin-dir=/apps/svr/mariadb10/lib/plugin --log-error=/apps/logs/mysql/error3306.log --open-files-limit=8192 --pid-
file=/apps/dbdat/mariadb10_data3306/mysql.pid --socket=/tmp/mysql3306.sock --port=3306
apps     17837 22150  0 15:27 pts/0    00:00:00 grep mysql
[apps@MCBL051 bin]$ kill -9 3121
[apps@MCBL051 bin]$ ps -ef|grep mysql
apps     17947 22150  0 15:27 pts/0    00:00:00 grep mysql
再启动mysql主库正常,启动从库slave正常,检查主从数据同步正常。

 

后续同样的问题出现,继续寻找原因

show processlist\G
跑最长时间线程处于killed状态,一直未释放
*************************** 6. row ***************************
      Id: 5449173
    User: lotsprd
    Host: 10.16.26.199:51821
      db: lots
 Command: Killed
    Time: 84026
   State: Sending data
    Info: SELECT DISTINCT
                        o.CUSTOMER_NAME,
                        o.WAREHOUSE_NAME,
                        o.CUSTOMER_ORDER_NO,
                        o.ANNTO_ORDER_NO
Progress: 0.000
很多正在跑sql的线程都处于Waiting for table flush状态
*************************** 10. row ***************************
      Id: 5557272
    User: lotsprd
    Host: 172.16.12.155:54883
      db: lots
 Command: Query
    Time: 70
   State: Waiting for table flush
    Info: SELECT
                        o.ANNTO_ORDER_NO,
                        ct.CARRIER_CODE,
                        ct.EXPRESS_NO,
                        ct.NODE_ACTION,
                        ct.ACCEPT_TI
Progress: 0.000
*************************** 11. row ***************************
      Id: 5557273
    User: lotsprd
    Host: 172.16.12.155:54884
      db: lots
 Command: Query
    Time: 9
   State: Waiting for table flush
    Info: SELECT annto_order_no,express_no,express_code,EXPRESS_NAME,source_express_code FROM (
                        SELECT

二.问题分析
1.检查binlog中是否有手工执行的flush table、alter table和analyze table命令

[apps@MCBL051 log]$ /apps/svr/mariadb10/bin/mysqlbinlog --no-defaults --base64-output=decode-rows -v -v mysql-bin.000345|grep FLUSH
[apps@MCBL051 log]$ /apps/svr/mariadb10/bin/mysqlbinlog --no-defaults --base64-output=decode-rows -v -v mysql-bin.000345|grep flush
[apps@MCBL051 log]$ /apps/svr/mariadb10/bin/mysqlbinlog --no-defaults --base64-output=decode-rows -v -v mysql-bin.000345|grep analze
[apps@MCBL051 log]$ /apps/svr/mariadb10/bin/mysqlbinlog --no-defaults --base64-output=decode-rows -v -v mysql-bin.000345|grep ANALYZE
[apps@MCBL051 log]$ /apps/svr/mariadb10/bin/mysqlbinlog --no-defaults --base64-output=decode-rows -v -v mysql-bin.000345|grep alter
[apps@MCBL051 log]$ /apps/svr/mariadb10/bin/mysqlbinlog --no-defaults --base64-output=decode-rows -v -v mysql-bin.000345|grep ALTER
查询都没有手工执行的flush tablealter table和analyze table命令。

2.检查备份,发现在1点时的备份脚本中有在备份前kill掉长时间查询的线程

crontab -l
0 1 * * *  /apps/sh/cron/new_backup_mysql.sh   --socket=/tmp/mysql3306.sock >> /apps/tmp/dump_3306.log 2>&1
mydumper --user=${db_user} --password=${db_pass} --long-query-guard=${v_long_query_guard} --socket=${socket} --database=${dbname} --outputdir=${this_back_p} --compress
--verbose=3 --logfile=${base_p}/${logfilename} ${kill_long_query_option}
 v_backup_type=0  #mydumper  备份
            #log_mysql_start_status
            backup_mydumper
            if [ $try_times -eq 1 ]
            then
                kill_long_query_option="--kill-long-queries"

3.查看mydumper备份的日志,确认有在01:01:22时kill了运行超过54999s的线程

cd /apps/mysql_backup
tail -300 new_backup_mysql3306.log.archive|more
2016-07-01 01:01:22 [WARNING] - Killed a query that was running for 54999s
2016-07-01 01:11:22 [ERROR] - Couldn't acquire global lock, snapshots will not be consistent: Lock wait timeout exceeded; try restar
ting transaction
2016-07-01 01:21:24 [ERROR] - Error dumping table (lots.t_carrier_tracking) data: Lock wait timeout exceeded; try restarting transac
tion 
2016-07-01 01:21:24 [INFO] - Empty table lots.t_carrier_tracking
2016-07-01 01:21:42 [ERROR] - Error dumping table (lots.t_order) data: Lock wait timeout exceeded; try restarting transaction 
2016-07-01 01:21:42 [INFO] - Empty table lots.t_order
2016-07-01 01:24:57 [ERROR] - Error dumping table (lots.t_warehouse) data: Lock wait timeout exceeded; try restarting transaction 
2016-07-01 01:24:57 [INFO] - Empty table lots.t_warehouse
2016-07-01 01:31:24 [ERROR] - Error dumping schemas (lots.t_carrier_tracking): Lock wait timeout exceeded; try restarting transactio
n
2016-07-01 01:31:24 [INFO] - Thread 2 shutting down
2016-07-01 01:31:42 [ERROR] - Error dumping schemas (lots.t_order): Lock wait timeout exceeded; try restarting transaction
2016-07-01 01:31:42 [INFO] - Thread 4 shutting down
2016-07-01 01:31:57 [ERROR] - Error dumping schemas (lots.t_warehouse): Lock wait timeout exceeded; try restarting transaction
2016-07-01 01:31:57 [INFO] - Thread 1 shutting down
但从上面报错来看,线程被kill,但线程和锁没有真正释放,影响到的表为T_ORDER、 t_warehouse、t_carrier_tracking。
在flush tables with read lock成功获得锁之前,必须等待所有语句执行完成(包括SELECT)。所以如果有个慢查询在执行,或者一个打开的事务,或者其他进程拿着表锁,flush tables
with read lock就会被阻塞,直到所有的锁被释放。
这三个表:T_ORDER、 t_warehouse、t_carrier_tracking由于慢sql没有执行执行完成,导致加锁没有成功。所以下次再有新产生对三个表的select 或dml的sql执行时,线程就会出现waiting
for table flush。我们mysql的lock_wait_timeout时间为600 秒,上面备份加锁时,等待锁超过了10分钟后,就会超时报错。

此次问题产生原因是慢SQL和备份操作杀进程引起的,主要原因还是由于慢sql长时间执行,性能太差。
应该有写锁或者非共享锁存在

 

三.解决方法
  关闭mysql,mysql长时间未有关闭的话,使用kill -9来杀掉mysql进程。
  最后将引起问题的慢SQL发给开发人员优化,优化人员反馈已优化业务逻辑,观察问题暂未出现。

posted on 2016-07-10 10:07  鱼儿也疯狂  阅读(2683)  评论(0编辑  收藏  举报