MySQL 5.7中锁的一个通用问题

前几天分析了一个死锁的问题，有一个网友看了以后，就发了邮件给我问一个问题。一般来说，能够发送邮件提出问题的同学，都是很认真的，因为他要准备好日志，准备好操作过程，准备好他已经在做的事情。所以这类问题，我都会认真的分析一下，如果没有结果，那就继续分析再等等，掐指一算，有很多问题已经拖了好久了。

这位网友提的一个问题，我看了以后感觉很是奇怪，因为有些颠覆我对MySQL锁的一些认识。这该如何是好。

这个环境的事务隔离级别是RR,存在主键，存在范围查询。

如何复现这个问题，网友提供了信息。

创建表
mysql> create table tt(a int not null primary key) engine=innodb;
mysql> insert into tt values(10),(20),(30),(40),(50);

复现这个问题可以参考：
session1:
mysql> set session tx_isolation='repeatable-read';
mysql> begin;
mysql> select * from tt where a > 15 and a < 35 for update;
+----+
| a  |
+----+
| 20 |
| 30 |
+----+

session2:
mysql>  insert into tt select 1;
此时这个操作会被阻塞，如果你按照这个思路来看，总是会感觉不对劲。
怎么MySQL这么矫情了。
我带着疑问在新搭建的一套MySQL 5.7环境上做了测试，结果还真是。
接下来的任务就是如何说服我，然后我理解了来说服这个网友。

结果这样一个操作下来，我连连测试了5个场景，如何SQL稍作改变，结果又会大大不同。

#for update的场景1

先来做一个基于主键的操作。先来验证一个最基本的情况，稳定下自己的情绪。

#session1

mysql> begin;
Query OK, 0 rows affected (0.00 sec)
 
mysql> select *from tt;
+----+
| a  |
+----+
| 10 |
| 20 |
| 30 |
| 40 |
| 50 |
+----+
5 rows in set (0.00 sec)

mysql> select * from tt where a =10 for update;
+----+
| a  |
+----+
| 10 |
+----+
1 row in set (0.00 sec)

#session2
mysql> insert into tt select 1;
Query OK, 1 row affected (0.01 sec)
Records: 1  Duplicates: 0  Warnings: 0

这是一个最为保守的使用方法，如果这个还有问题，那就明显证明数据库有问题了，基于主键，去掉范围扫描，肯定妥妥的。

#for update的场景2

这个场景里面我们修改下范围，原来的(15,35)修改为(10,30),结果差别就很大了。有些阻塞的语句我直接就手工取消了。由此也可以看出其中的差别来，不过可能会看得有点懵了。

#session1
mysql> begin;
Query OK, 0 rows affected (0.00 sec)

mysql> select * from tt where a <30 and a>10 for update;
+----+
| a  |
+----+
| 20 |
+----+
1 row in set (0.00 sec)

session2:
mysql> insert into tt select 35;
Query OK, 1 row affected (0.01 sec)
Records: 1  Duplicates: 0  Warnings: 0

mysql> insert into tt select 31;
Query OK, 1 row affected (0.00 sec)
Records: 1  Duplicates: 0  Warnings: 0

mysql> insert into tt select 30;
^C^C -- query aborted
ERROR 1317 (70100): Query execution was interrupted

mysql> insert into tt select 5;
Query OK, 1 row affected (0.00 sec)
Records: 1  Duplicates: 0  Warnings: 0

mysql> insert into tt select 10;
ERROR 1062 (23000): Duplicate entry '10' for key 'PRIMARY'
mysql> insert into tt select 11;
^C^C -- query aborted
ERROR 1317 (70100): Query execution was interrupted

这里可以明显感觉到和当前环境的数据分布关系很微妙，范围只有一个20，但是似乎和0有着一定的联系，至少，我不能保证我的查询一定得按照这个精确的范围。

#for update 场景3

这个场景我把最开始碰到的问题做了一些扩展，看看其它范围的数据是否也有类似的情况。我扩大了数据范围，结果很明显的，结果让我有些意料之外。

session1:
mysql> select *from tt;
+----+
| a  |
+----+
| 10 |
| 20 |
| 30 |
| 40 |
| 50 |
+----+
5 rows in set (0.00 sec)
mysql> begin;select * from tt where a <35 and a>15 for update;
Query OK, 0 rows affected (0.00 sec)

+----+
| a  |
+----+
| 20 |
| 30 |
+----+
2 rows in set (0.00 sec)

session2:
mysql>  insert into tt select 1;
^C^C -- query aborted
ERROR 1317 (70100): Query execution was interrupted
mysql>  insert into tt select 9;
^C^C -- query aborted
ERROR 1317 (70100): Query execution was interrupted
mysql>  insert into tt select 10;
^C^C -- query aborted
ERROR 1317 (70100): Query execution was interrupted
mysql>  insert into tt select 35;
^C^C -- query aborted
ERROR 1317 (70100): Query execution was interrupted
mysql>  insert into tt select 36;
^C^C -- query aborted
ERROR 1317 (70100): Query execution was interrupted
mysql>  insert into tt select 40;
^C^C -- query aborted
ERROR 1317 (70100): Query execution was interrupted
mysql>  insert into tt select 50;
^C^C -- query aborted
ERROR 1317 (70100): Query execution was interrupted

#for update 场景4

尽管这个时候我已经有些乱了，但是我还是耐着性子测试了另外几个场景。我把范围有（15,35）修改为（15,30），结果让我很意外。原本阻塞的insert就可以了。


session1
mysql> begin;select * from tt where a <30 and a>15 for update;
Query OK, 0 rows affected (0.00 sec)

+----+
| a  |
+----+
| 20 |
+----+
1 row in set (0.00 sec)

session2
mysql>  insert into tt select 1;
Query OK, 1 row affected (0.00 sec)
Records: 1  Duplicates: 0  Warnings: 0

mysql>  insert into tt select 15;
^C^C -- query aborted
ERROR 1317 (70100): Query execution was interrupted
mysql>  insert into tt select 15;
^C^C -- query aborted
ERROR 1317 (70100): Query execution was interrupted
mysql>  insert into tt select 16;
^C^C -- query aborted
ERROR 1317 (70100): Query execution was interrupted
mysql>  insert into tt select 14;
^C^C -- query aborted
ERROR 1317 (70100): Query execution was interrupted
mysql>  insert into tt select 13;
^C^C -- query aborted
ERROR 1317 (70100): Query execution was interrupted
mysql>  insert into tt select 11;
^C^C -- query aborted
ERROR 1317 (70100): Query execution was interrupted
mysql>  insert into tt select 10;
ERROR 1062 (23000): Duplicate entry '10' for key 'PRIMARY'
mysql>  insert into tt select 9;
Query OK, 1 row affected (0.01 sec)
Records: 1  Duplicates: 0  Warnings: 0
mysql>  insert into tt select 30;
^C^C -- query aborted
ERROR 1317 (70100): Query execution was interrupted
mysql>  insert into tt select 31;
Query OK, 1 row affected (0.00 sec)
Records: 1  Duplicates: 0  Warnings:

这一切的对比，从直观感受来看是和表里的数据分布是有一定的关系的。

比如场景4，如果我把范围由（15,35）修改为（15,30),这个数据的情况有什么特别之处吗，从我的猜测来看，应该是和里面的索引存储有一定的关系，我查看了Information_schema.innodb_trx,innodb_locks的细节，里面都是指向了同一行。

mysql> select * from INFORMATION_SCHEMA.innodb_locksG;
*************************** 1. row ***************************
    lock_id: 4081:36:3:2
lock_trx_id: 4081
  lock_mode: X,GAP
  lock_type: RECORD
 lock_table: `test`.`tt`
 lock_index: PRIMARY
 lock_space: 36
  lock_page: 3
   lock_rec: 2
  lock_data: 10
*************************** 2. row ***************************
    lock_id: 4078:36:3:2
lock_trx_id: 4078
  lock_mode: X
  lock_type: RECORD
 lock_table: `test`.`tt`
 lock_index: PRIMARY
 lock_space: 36
  lock_page: 3
   lock_rec: 2
  lock_data: 10
2 rows in set, 1 warning (0.00 sec)

通过上面的信息可以看到，都是只想了页面3的数据第2行，这个明显就不对应啊。

但是MySQL 5.7中出现这个问题，自己还是带着一丝的侥幸心理，在MGR上测试了一把，能够复现，结果今天继续耐着性子看了下这个问题，在5.6上模拟了一下，5.6全然没有这个问题，问题到了这里，就有了柳暗花明的一面，能够肯定的是这个问题在MySQL 5.7中可以复现，在MySQL 5.6中是正常的。

如此一来，问题的定论就有了方向，很快就在bugs.mysql.com里面找到了一个相关的bug(85749)

里面也做了类似的测试，能够复现，MySQL官方做了确认。

[31 Mar 18:10] Sinisa Milivojevic

Hi!
I have run your test case and got the same results as you have.
Upon further analysis, I concluded that this is a bug.  A small bug , but a bug.
Verified.
而有看点的是问题的提出者定位到了相关的代码，还是希望文档的部分能够把间隙锁的部分补充一下。
No locks are released in this case, but we do request X lock on the gap before the next, non-matching record when non-unique secondary index is used. Check code starting from this line (https://github.com/mysql/mysql-server/blob/71f48ab393bce80a59e5a2e498cd1f46f6b43f9a/storag...):
			/* Try to place a gap lock on the next index record
			to prevent phantoms in ORDER BY ... DESC queries */
			const rec_t*	next_rec = page_rec_get_next_const(rec);

			offsets = rec_get_offsets(next_rec, index, offsets,
						  ULINT_UNDEFINED, &heap);
			err = sel_set_rec_lock(pcur,
					       next_rec, index, offsets,
					       prebuilt->select_lock_type,
LOCK_GAP, thr, &mtr);

in row_search_mvcc(). See the (potential) reason to set this gap lock in the comment above.
Maybe there is another reason for the behavior we see. Then it should be also documented.