MySQL反连接的优化总结（r10笔记第51天）

今天同事有一个环境发现一条语句执行时间很长，感到非常奇怪。刚好有些时间，就抽空琢磨了下这个问题。总体来看这个环境还是相对比较繁忙的，线程大概是200多个。 # mysqladmin pro|less|wc -l 235 带着好奇查看慢日志，马上定位到这个语句，已做了脱敏处理。 # Time: 161013 9:51:45 # User@Host: root[root] @ localhost [] # Thread_id: 24630498 Schema: test Last_errno: 1160 Killed: 0 # Query_time: 61213.561106 Lock_time: 0.000082 Rows_sent: 7551 Rows_examined: 201945890920 Rows_affected: 0 Rows_read: 7551 # Bytes_sent: 0 Tmp_tables: 1 Tmp_disk_tables: 0 Tmp_table_sizes: 0 # InnoDB_trx_id: 2F8E5A82 SET timestamp=1476323505; select account from t_fund_info where money >=300 and account not in (select distinct(login_account) from t_user_login_record where login_time >='2016-06-01') into outfile '/tmp/data.txt'; 从慢日志来看，执行时间达61213s,这个是相当惊人了，也就意味着这个语句跑了一整天。这引起了我的好奇和兴趣，这个问题有得搞头了。 表t_fund_info数据量近200万，存在一个主键在id列，唯一性索引在account上。 CREATE TABLE `t_fund_info` 。。。 PRIMARY KEY (`id`), UNIQUE KEY `account` (`account`) ) ENGINE=InnoDB AUTO_INCREMENT=1998416 DEFAULT CHARSET=utf8 表t_user_login_record数据量2千多万，存在主键列id CREATE TABLE `t_user_login_record` 。。。 PRIMARY KEY (`id`) ) ENGINE=InnoDB AUTO_INCREMENT=22676193 DEFAULT CHARSET=utf8 从语句可以看出，是在做一个批量的大查询，是希望把查询结果生成一个文本文件来，但是过滤条件很有限。目前根据查询来看肯定是全表扫描。先简单看了下过滤条件，从t_fund_info这个表中，根据一个过滤条件能过滤掉绝大多数的数据，得到1万多数据，还是比较理想的。 > select count(*)from t_fund_info where money >=300; +----------+ | count(*) | +----------+ | 13528 | +----------+ 1 row in set (0.99 sec) 那问题的瓶颈看来是在后面的子查询了。把下面的语句放入一个SQL脚本query.sql select distinct(login_account) from t_user_login_record where login_time >='2016-06-01'; 导出数据，大概耗时1分钟。 time mysql test < query.sql > query_rt.log real 0m59.149s user 0m0.394s sys 0m0.046s 过滤后的数据有50多万，相对还是比较理想的过滤情况。 # less query_rt.log|wc -l 548652 我们来解析一下这个语句，看看里面的Not in的条件是怎么解析的。 explain extended select account from t_fund_info where money >=300 and account not in (select distinct(login_account) from t_user_login_record where login_time >='2016-06-01')； show warnings; 结果如下： Message: select `test`.`t_fund_info`.`account` AS `account` from `test`.`t_fund_info` where ((`test`.`t_fund_info`.`money` >= 300) and (not(<in_optimizer>(`test`.`t_fund_info`.`account`,<exists>(select distinct 1 from `test`.`t_user_login_record` where ((`test`.`t_user_login_record`.`login_time` >= '2016-06-01') and ((<cache>(`test`.`t_fund_info`.`account`) = `test`.`t_user_login_record`.`login_account`) or isnull(`test`.`t_user_login_record`.`login_account`))) having <is_not_null_test>(`test`.`t_user_login_record`.`login_account`)))))) 可以看到整个解析的过程非常复杂，原本简单的一个语句，经过解析，竟然变得如此复杂。因为MySQL里面的优化改进空间相比Oracle还是少很多，我决定循序渐进来尝试优化。因为这个环境还是很重要的，所以我在从库端使用mysqldump导出数据，导入到另外一个测试环境，放开手脚来测试了。首先对于not in的部分，是否是因为生成临时表的消耗代价太高导致，所以我决定建立一个临时表来缓存子查询的数据。 > create table test_tab as select distinct(login_account) login_account from t_user_login_record where login_time >='2016-06-01'; Query OK, 548650 rows affected (1 min 3.78 sec) Records: 548650 Duplicates: 0 Warnings: 0 这样查看这个临时表就很轻松了，不到1秒就出结果。 > select count(*)from test_tab; +----------+ | count(*) | +----------+ | 548650 | +----------+ 1 row in set (0.38 sec) 然后再次查看使用临时表后的查询是否解析会有改善。 explain extended select account from t_fund_info where money >=300 and account not in(select login_account from test_tab); show warnings; 发现还是一样，可见临时表的改进效果不大。 | Note | 1003 | select `test`.`t_fund_info`.`account` AS `account` from `test`.`t_fund_info` where ((`test`.`t_fund_info`.`money` >= 300) and (not(<in_optimizer>(`test`.`t_fund_info`.`account`,<exists>(select 1 from `test`.`test_tab` where ((<cache>(`test`.`t_fund_info`.`account`) = `test`.`test_tab`.`login_account`) or isnull(`test`.`test_tab`.`login_account`)) having <is_not_null_test>(`test`.`test_tab`.`login_account`)))))) | 是否是因为子查询中的数据量太大导致整个反连接的查询过程中回表太慢，那我缩小一下子查询的数据条数。 select account from t_fund_info where money >=300 and not exists (select login_account from test_tab where login_account=t_fund_info.account limit 1,10); 这种方式依旧很卡，持续了近半个小时还是没有反应，所以果断放弃。是不是t_fund_info的过滤查询导致了性能问题，我们也创建一个临时表 > create table test_tab1 as select account from t_fund_info -> where money >=300; Query OK, 13528 rows affected (1.38 sec) Records: 13528 Duplicates: 0 Warnings: 0 再次查询效果依旧很不理想。 select account from test_tab1 where not exists (select login_account from test_tab where login_account=test_tab1.account limit 1,10); 持续了20多分钟还是没有反应，所以还是果断放弃。这个时候能想到就是索引了，我们在临时表test_tab上创建索引。 > create index ind_tmp_login_account on test_tab(login_account); Query OK, 0 rows affected (4.31 sec) Records: 0 Duplicates: 0 Warnings: 0 在临时表test_tab1上也创建索引。 > create index ind_tmp_account on test_tab1(account); Query OK, 0 rows affected (0.18 sec) Records: 0 Duplicates: 0 Warnings: 0 再次查看性能就变得很好了，运行时间0.15秒，简直不敢相信。 explain select account from test_tab1 where not exists (select login_account from test_tab where login_account=test_tab1.account ); 11364 rows in set (0.15 sec) 执行计划如下：

可见通过这种拆分，不断的猜测和排除，已经找到了一些思路。我们开始抓住问题的本质。首先删除test_tab1上的索引，看看执行效果如何。 > alter table test_tab1 drop index ind_tmp_account; Query OK, 0 rows affected (0.00 sec) Records: 0 Duplicates: 0 Warnings: 0 语句如下，执行时间0.15秒 select account from test_tab1 where not exists (select login_account from test_tab where login_account=test_tab1.account ); +--------------------------------+ 11364 rows in set (0.15 sec) 是否not in的方式会有很大的差别呢，持续0.18秒，有差别，但差别不大。 select account from test_tab1 where account not in (select login_account from test_tab ); +--------------------------------+ 11364 rows in set (0.18 sec) 我们逐步恢复原来的查询，去除临时表test_tab1,整个查询持续了1.12秒。 select account from t_fund_info where money >=300 and account not in(select login_account from test_tab); +--------------------------------+ 11364 rows in set (1.12 sec) 这个时候，问题已经基本定位了。在反连接的查询中，在这个问题场景中，需要对子查询的表添加一个索引基于login_account，可以和外层的查询字段映射，提高查询效率。当然在一个数据量庞大，业务相对繁忙的系统中，添加一个临时需求的索引可能不是一个很好的方案。不过我们还是在测试环境体验一下。 > create index ind_tmp_account1 on t_user_login_record(login_account); Query OK, 0 rows affected (4 min 45.48 sec) Records: 0 Duplicates: 0 Warnings: 0 添加索引的过程持续了近4分钟，在这个时候我们使用最开始的查询语句，性能如何呢。 select account from t_fund_info where money >=300 and account not in (select distinct(login_account) from t_user_login_record where); +--------------------------------+ 11364 rows in set (2.52 sec) 只要2.52秒就可以完成之前20多个小时查询结果，性能简直就是天壤之别。不过话说回来，跑批查询可以在从库上执行，从库上创建一个这样的索引，用完再删掉也是不错的选择，要么就是创建一个临时表，在临时表上创建索引，临时表的意义就在于此，不是为了做查询结果缓存而是创建索引来提高数据过滤效率。在此有个问题就是临时表只有一个字段，创建索引的意义在哪里呢。我画一个图来解释一下。

首先这个查询的数据是以t_fund_info的过滤条件为准，从200万数据中过滤得到1万条数据，然后两个字段通过account=login_account的条件关联，而不是先关联子查询的过滤条件 login_time,过滤完之后account的值之后再过滤login_time，最后根据not in的逻辑来取舍数据，整个数据集就会大大减少。如此一来，子查询的表千万行，性能的差别就不会是指数级的。