show status和set gtid_mode 导致线程死锁案例

时间:2022-06-23
本文章向大家介绍show status和set gtid_mode 导致线程死锁案例,主要内容包括其使用实例、应用技巧、基本知识点总结和需要注意事项,具有一定的参考价值,需要的朋友可以参考一下。

作者简介

杨奇龙,网名“北在南方”,7年DBA老兵,目前任职于杭州有赞科技DBA,主要负责数据库架构设计和运维平台开发工作,擅长数据库性能调优、故障诊断。

特别感谢远在重庆的高鹏和我们组的小伙伴王航威一起分析本次案例,推荐横屏查看

一 现象

我们数据库组今年上半年的计划之一是将所有数据库实例打开GTID特性。在线上进行灰度开启GITD过程中遇到数据库hang。具体表现是执行如下命令时:

set global gtid_mode=OFF_PERMISSIVE;

数据库hang住,阻塞读写请求,新建连接失败,业务不可用。

二 复现

高频率执行show global status 和 set gtid_mode ,MySQL实例很容易就hang住了。

for i in `seq  1 50000`;
do
 my 3307 -e 'show  global status' > /tmp/1 ;
 sleep 1
done
for i in `seq  1 50000`;
do
  my 3307 -e 'SET @@GLOBAL.GTID_MODE = OFF_PERMISSIVE;';
  my 3307 -e 'SET @@GLOBAL.GTID_MODE = ON_PERMISSIVE;';
  my 3307 -e 'SET @@GLOBAL.GTID_MODE = ON;';
  my 3307 -e 'SET @@GLOBAL.GTID_MODE = ON_PERMISSIVE;';
  my 3307 -e 'SET @@GLOBAL.GTID_MODE = OFF_PERMISSIVE;';
  my 3307 -e 'SET @@GLOBAL.GTID_MODE = OFF;';
done

三 分析

3.1 获取数据库hang的时候的堆栈

pstack $pidofmysql

核心堆栈信息分别是 thread 2 执行show global status 命令 thread 4 执行set global gtid_mode 命令

3.2 使用gdb 分析当时阻塞到什么地方

注意: 堆栈信息的执行顺序实际上是从下往上的,#n大的先执行,我们分析的时候也需要向上到这来,比如thread 4 从#23开始

3.3 show global status 的堆栈

(gdb) thread 4
[Switching to thread 4 (Thread 0x7f8675f0b700 (LWP 49624))]#0  0x00007f86758ef334 in __lll_lock_wait () from /lib64/libpthread.so.0
(gdb)
[Switching to thread 4 (Thread 0x7f8675f0b700 (LWP 49624))]#0  0x00007f86758ef334 in __lll_lock_wait () from /lib64/libpthread.so.0
(gdb) bt
#0  0x00007f86758ef334 in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007f86758ea60e in _L_lock_995 () from /lib64/libpthread.so.0
#2  0x00007f86758ea576 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x00000000019219b9 in native_mutex_lock (mutex=0x2e20e48) at /root/percona-server-5.7.22-22/include/thr_mutex.h:84
#4  0x0000000001921c21 in safe_mutex_lock (mp=0x2e20e20, try_lock=0 '00', file=0x21f2168 "/root/percona-server-5.7.22-22/sql/sql_show.cc", line=3127)
    at /root/percona-server-5.7.22-22/mysys/thr_mutex.c:109
#5  0x000000000162e890 in my_mutex_lock (mp=0x2e20e20, file=0x21f2168 "/root/percona-server-5.7.22-22/sql/sql_show.cc", line=3127)
    at /root/percona-server-5.7.22-22/include/thr_mutex.h:180
#6  0x000000000162e98b in inline_mysql_mutex_lock (that=0x2e20e20, src_file=0x21f2168 "/root/percona-server-5.7.22-22/sql/sql_show.cc", src_line=3127)
    at /root/percona-server-5.7.22-22/include/mysql/psi/mysql_thread.h:730
#7  0x0000000001636798 in show_status_array (thd=0x7f83a7c19000, wild=0x0, variables=0x2da0780, value_type=OPT_GLOBAL, status_var=0x7f8675f07ef0,
    prefix=0x7f8675f07dc0 "Innodb", tl=0x7f83a7c27c58, ucase_names=false, cond=0x0) at /root/percona-server-5.7.22-22/sql/sql_show.cc:3127
#8  0x0000000001636700 in show_status_array (thd=0x7f83a7c19000, wild=0x0, variables=0x7f8673c60420, value_type=OPT_GLOBAL, status_var=0x7f8675f07ef0, prefix=0x21f2f84 "",
    tl=0x7f83a7c27c58, ucase_names=false, cond=0x0) at /root/percona-server-5.7.22-22/sql/sql_show.cc:3116
#9  0x0000000001648deb in fill_status (thd=0x7f83a7c19000, tables=0x7f83a7c27c58, cond=0x0) at /root/percona-server-5.7.22-22/sql/sql_show.cc:8084
#10 0x000000000164b2e9 in do_fill_table (thd=0x7f83a7c19000, table_list=0x7f83a7c27c58, qep_tab=0x7f83a7c28d78) at /root/percona-server-5.7.22-22/sql/sql_show.cc:8791
#11 0x000000000164b7f3 in get_schema_tables_result (join=0x7f83a7c28698, executed_place=PROCESSED_BY_JOIN_EXEC) at /root/percona-server-5.7.22-22/sql/sql_show.cc:8921
#12 0x000000000161fc39 in JOIN::prepare_result (this=0x7f83a7c28698) at /root/percona-server-5.7.22-22/sql/sql_select.cc:909
#13 0x000000000157cb71 in JOIN::exec (this=0x7f83a7c28698) at /root/percona-server-5.7.22-22/sql/sql_executor.cc:124
#14 0x000000000161e7ff in handle_query (thd=0x7f83a7c19000, lex=0x7f83a7c1b630, result=0x7f83a7c285d0, added_options=0, removed_options=0)
    at /root/percona-server-5.7.22-22/sql/sql_select.cc:185
#15 0x00000000015d044f in execute_sqlcom_select (thd=0x7f83a7c19000, all_tables=0x7f83a7c27c58) at /root/percona-server-5.7.22-22/sql/sql_parse.cc:5445
#16 0x00000000015c8704 in mysql_execute_command (thd=0x7f83a7c19000, first_level=true) at /root/percona-server-5.7.22-22/sql/sql_parse.cc:2892
#17 0x00000000015d14b6 in mysql_parse (thd=0x7f83a7c19000, parser_state=0x7f8675f0a5c0) at /root/percona-server-5.7.22-22/sql/sql_parse.cc:5901
#18 0x00000000015c504a in dispatch_command (thd=0x7f83a7c19000, com_data=0x7f8675f0ad30, command=COM_QUERY) at /root/percona-server-5.7.22-22/sql/sql_parse.cc:1490
#19 0x00000000015c3dd7 in do_command (thd=0x7f83a7c19000) at /root/percona-server-5.7.22-22/sql/sql_parse.cc:1021
#20 0x000000000170c998 in handle_connection (arg=0x7f83ad7ff460) at /root/percona-server-5.7.22-22/sql/conn_handler/connection_handler_per_thread.cc:312
#21 0x0000000001943acc in pfs_spawn_thread (arg=0x7f8673d87120) at /root/percona-server-5.7.22-22/storage/perfschema/pfs.cc:2190
#22 0x00007f86758e8aa1 in start_thread () from /lib64/libpthread.so.0
#23 0x00007f8674852bcd in clone () from /lib64/libc.so.6
(gdb) f 7
#7  0x0000000001636798 in show_status_array (thd=0x7f83a7c19000, wild=0x0, variables=0x2da0780, value_type=OPT_GLOBAL, status_var=0x7f8675f07ef0,
    prefix=0x7f8675f07dc0 "Innodb", tl=0x7f83a7c27c58, ucase_names=false, cond=0x0) at /root/percona-server-5.7.22-22/sql/sql_show.cc:3127
3127            mysql_mutex_lock(&LOCK_global_system_variables);

分析

根据线程执行的顺序来分析:

关键函数 fill_status 先获取mutex锁 mysql_mutex_lock(&LOCK_status)

代码位置 sql_show.cc:8019

int fill_status(THD *thd, TABLE_LIST *tables, Item *cond)
{
  if (thd->fill_status_recursion_level++ == 0)
    mysql_mutex_lock(&LOCK_status);
  if (thd->fill_status_recursion_level-- == 1)
    mysql_mutex_unlock(&LOCK_status);
}

通 执行show global status的线程阻塞在申请mutex锁 mysql_mutex_lock(&LOCK_global_system_variables) 的地方。

关键函数:show_status_array: (代码片段) 每次获取一个系统状态,就请求一次mysql_mutex_lock(&LOCK_global_system_variables)锁。

 if (!(wild && wild[0] && wild_case_compare(system_charset_info,
                                                 name_buffer, wild)) &&
          (!partial_cond || partial_cond->val_int()))
      {
        const char *pos;
        size_t length;

        mysql_mutex_lock(&LOCK_global_system_variables);
        pos= get_one_variable(thd, var, value_type, show_type, status_var,
                              &charset, buff, &length);
        table->field[1]->store(pos, (uint32) length, charset);
        thd->count_cuted_fields= CHECK_FIELD_IGNORE;
        table->field[1]->set_notnull();
        mysql_mutex_unlock(&LOCK_global_system_variables);

而 LOCK_global_system_variables 则是 mysql_mutex_t 结构。代码位置 mysqld.cc:686

mysql_mutex_t
  LOCK_status, LOCK_uuid_generator,
  LOCK_crypt,
  LOCK_global_system_variables,
  LOCK_user_conn, LOCK_slave_list,
  LOCK_error_messages;

3.4 修改gtid_mode的线程调用栈

gdb) thread 2
[Switching to thread 2 (Thread 0x7f8675ea9700 (LWP 27921))]#0  0x00007f86758ef334 in __lll_lock_wait () from /lib64/libpthread.so.0
(gdb) bt
#0  0x00007f86758ef334 in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007f86758ea60e in _L_lock_995 () from /lib64/libpthread.so.0
#2  0x00007f86758ea576 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x00000000019219b9 in native_mutex_lock (mutex=0x2e20cc8) at /root/percona-server-5.7.22-22/include/thr_mutex.h:84
#4  0x0000000001921c21 in safe_mutex_lock (mp=0x2e20ca0, try_lock=0 '00', file=0x22c7b38 "/root/percona-server-5.7.22-22/sql/binlog.cc", line=10145)
    at /root/percona-server-5.7.22-22/mysys/thr_mutex.c:109
#5  0x0000000001873fa9 in my_mutex_lock (mp=0x2e20ca0, file=0x22c7b38 "/root/percona-server-5.7.22-22/sql/binlog.cc", line=10145)
    at /root/percona-server-5.7.22-22/include/thr_mutex.h:180
#6  0x000000000187427e in inline_mysql_mutex_lock (that=0x2e20ca0, src_file=0x22c7b38 "/root/percona-server-5.7.22-22/sql/binlog.cc", src_line=10145)
    at /root/percona-server-5.7.22-22/include/mysql/psi/mysql_thread.h:730
#7  0x000000000188a903 in MYSQL_BIN_LOG::publish_coordinates_for_global_status (this=0x2e68720) at /root/percona-server-5.7.22-22/sql/binlog.cc:10145
#8  0x0000000001885115 in MYSQL_BIN_LOG::rotate (this=0x2e68720, force_rotate=true, check_purge=0x7f8675ea6aff) at /root/percona-server-5.7.22-22/sql/binlog.cc:7777
#9  0x00000000016ba82c in Sys_var_gtid_mode::global_update (this=0x2e52820, thd=0x7f83a8419000, var=0x7f83a8427ce8) at /root/percona-server-5.7.22-22/sql/sys_vars.h:2951
#10 0x00000000014f848d in sys_var::update (this=0x2e52820, thd=0x7f83a8419000, var=0x7f83a8427ce8) at /root/percona-server-5.7.22-22/sql/set_var.cc:184
#11 0x00000000014f9d0c in set_var::update (this=0x7f83a8427ce8, thd=0x7f83a8419000) at /root/percona-server-5.7.22-22/sql/set_var.cc:836
#12 0x00000000014f94d7 in sql_set_variables (thd=0x7f83a8419000, var_list=0x7f83a841ba08, free_joins=true) at /root/percona-server-5.7.22-22/sql/set_var.cc:672
#13 0x00000000015cb410 in mysql_execute_command (thd=0x7f83a8419000, first_level=true) at /root/percona-server-5.7.22-22/sql/sql_parse.cc:3858
#14 0x00000000015d14b6 in mysql_parse (thd=0x7f83a8419000, parser_state=0x7f8675ea85c0) at /root/percona-server-5.7.22-22/sql/sql_parse.cc:5901
#15 0x00000000015c504a in dispatch_command (thd=0x7f83a8419000, com_data=0x7f8675ea8d30, command=COM_QUERY) at /root/percona-server-5.7.22-22/sql/sql_parse.cc:1490
#16 0x00000000015c3dd7 in do_command (thd=0x7f83a8419000) at /root/percona-server-5.7.22-22/sql/sql_parse.cc:1021
#17 0x000000000170c998 in handle_connection (arg=0x7f835efff220) at /root/percona-server-5.7.22-22/sql/conn_handler/connection_handler_per_thread.cc:312
#18 0x0000000001943acc in pfs_spawn_thread (arg=0x7f8673d87320) at /root/percona-server-5.7.22-22/storage/perfschema/pfs.cc:2190
#19 0x00007f86758e8aa1 in start_thread () from /lib64/libpthread.so.0
#20 0x00007f8674852bcd in clone () from /lib64/libc.so.6
(gdb) f 7
#7  0x000000000188a903 in MYSQL_BIN_LOG::publish_coordinates_for_global_status (this=0x2e68720) at /root/percona-server-5.7.22-22/sql/binlog.cc:10145
10145     mysql_mutex_lock(&LOCK_status);
(gdb) f 10
#10 0x00000000014f848d in sys_var::update (this=0x2e52820, thd=0x7fb63c019000, var=0x7fb63c027ce8) at /root/percona-server-5.7.22-22/sql/set_var.cc:184
184       (on_update && on_update(this, thd, OPT_GLOBAL));
(gdb) l
179       to make it safe we'll need value_ptr_unlock().
180     */
181     AutoWLock lock1(&PLock_global_system_variables);
182     AutoWLock lock2(guard);
183     return global_update(thd, var) ||
184       (on_update && on_update(this, thd, OPT_GLOBAL));
185   }
186   else
187   {
188     bool locked= false;

分析

根据线程执行的顺序来分析: sys_var::update 函数执行的时候需要申请 AutoWLock lock1(&PLock_global_system_variables);

代码位置 set_var.cc :32行

static PolyLock_mutex PLock_global_system_variables(&LOCK_global_system_variables);

其实PLock_global_system_variables 也是需要申请mutex锁 &LOCK_global_system_variables。

代码位置 binlog.cc:10145

/** Copy the current binlog coordinates to the variables used for the
not-in-consistent-snapshot case of SHOW STATUS */

void MYSQL_BIN_LOG::publish_coordinates_for_global_status(void) const
{
  mysql_mutex_assert_owner(&LOCK_log);

  mysql_mutex_lock(&LOCK_status);
  strcpy(binlog_global_snapshot_file, log_file_name);
  binlog_global_snapshot_position=
      my_b_inited(&log_file) ? my_b_tell(&log_file) : 0;
  mysql_mutex_unlock(&LOCK_status);
}

设置gtid_mode的线程等待在第7个步骤,进行binlog切换时,此时需要申请mysql_mutex_lock(&LOCK_status)。然而&LOCK_status已经被show globa status的 fill_status 函数先申请获得。show global status需要申请mutex锁&LOCK_global_system_variables 被set gtid_mode的sys_var::update函数持有。于是相互等待死锁。

一图胜千言:

3.4 解释为什么会hang住,其他新的请求无法建立连接。

连接的初始化需要LOCK_global_system_variables, 此mutex被占用,导致新连接无法初始化

代码位置 sql_class.cc:1629

void THD::init(void)
{
  mysql_mutex_lock(&LOCK_global_system_variables);
  ...
  mysql_mutex_unlock(&LOCK_global_system_variables);

3.5 SQL查询时也需要LOCK_global_system_variables, 此mutex被占用,导致查询被阻塞.

代码位置 item_func.cc:7275

void Item_func_get_system_var::fix_length_and_dec()
{
  ...
  mysql_mutex_lock(&LOCK_global_system_variables);
}

四 总结

首先 再次感谢远在重庆的高鹏不辞辛苦的帮忙分析定位以及给出各种解释。通过这次分析我们对gdb和MySQL 源码又有进一步了解。

其次如何规避,我们目前的规避方式是减少show 命令的执行,以后做gtid的变更的时候 会关闭监控以及有赞专门开发的数据库性能快照系统以便和 set gtid_mode 命令冲突。

参考文章

1 深度分析 | GDB定位MySQL5.7特定版本hang死的故障分析

2 使用gdb调试程序完全教程

https://blog.csdn.net/gatieme/article/details/51671430

3 一个 Linux 上分析死锁的简单方法

https://www.ibm.com/developerworks/cn/linux/l-cn-deadlock/index.html