redis cluster 学习 实战篇(二)

时间:2022-06-20
本文章向大家介绍redis cluster 学习 实战篇(二),主要内容包括其使用实例、应用技巧、基本知识点总结和需要注意事项,具有一定的参考价值,需要的朋友可以参考一下。

一 前言

上一篇文章 介绍了如何搭建redis cluster集群和槽位管理等动作,本文通过添加集群节点介绍集群管理工具使用以及基于主从结构的容灾切换。

二 集群运维实践

2.1 redis-trib.rb 工具介绍

本文案例基于redis 4.0.4 版本测试集群运维管理功能,redis 5.X 版本已经将集群管理功能集成到redis-cli里面,不再推荐使用redis-trib.rb。有兴趣的朋友可以直接上手redis 5.0 版本。

我们先看看管理工具redis-trib.rb的帮助信息:

redis-trib.rb 
Usage: redis-trib <command> <options> <arguments ...>
create          host1:port1 ... hostN:portN  #创建集群
                  --replicas <arg> #带上该参数表示是否有从,arg表示从的数量
check           host:port #检查集群节点,主从关系,槽位分配等信息
info            host:port #查看集群信息
fix             host:port #修复集群,迁移slot出现问题时可以执行该参数
                --timeout <arg>
#在线迁移slot功能 
reshard         host:port #必传,用来从指定节点获取集群信息,相当于获取集群信息的入口
                --from <arg>  #需要从哪些源节点上迁移slot,
                --to <arg>    #slot需要迁移的目的节点的node id                --slots <arg> #需要迁移的slot数量,不传递该参数的话,则会在迁移过程中提示用户输入。
                --yes         #设置该参数,可以在打印执行reshard计划的时候,提示用户输入yes确认后再执行reshard
                --timeout <arg>  #设置migrate命令的超时时间。
                --pipeline <arg> #定义cluster getkeysinslot命令一次取出的key数量,不传的话使用默认值为10。
#平衡集群节点slot数量  
rebalance       host:port
                --weight <arg>  ##节点的权重,不指定为默认权重为1
                --auto-weights  
                --use-empty-masters #默认没有分配slot节点的master是不参与rebalance的,
                设置--use-empty-masters可以让没有分配slot的节点参与rebalance
                --timeout <arg>
                --simulate
                --pipeline <arg>
                --threshold <arg>

#将新节点加入集群 
add-node          new_host:new_port existing_host:existing_port
                  --slave   ###支持为集群中某个节点添加从库节点。
                  --master-id <arg>

#从集群中删除节点,节点上必须为空,没有分配槽位。
del-node        host:port node_id

#设置集群节点间心跳连接的超时时间
set-timeout     host:port milliseconds

#在集群全部节点上执行命令
call            host:port command arg arg .. arg

#将外部redis数据导入集群
import          host:port
                  --from <arg>
                  --copy
                  --replace
对于 check, fix, reshard, del-node, set-timeout 这些命令的参数可以接入集群中任何一个节点的ip:port,作为整体信息的获取入口。

2.2 查看集群key 个数,槽位分配 ,slave状态

# redis-trib.rb info 10.215.20.7:7001
10.215.20.7:7001 (143f4d2d...) -> 99975 keys | 5461 slots | 0 slaves.
10.215.20.7:7003 (d54a286d...) -> 99956 keys | 5461 slots | 0 slaves.
10.215.20.7:7002 (ae0e1276...) -> 100069 keys | 5462 slots | 0 slaves.
[OK] 300000 keys in 3 masters.
18.31 keys per slot on average.

2.3 创建集群

redis-trib.rb create --replicas 0 10.215.20.7:7001 10.215.20.7:7002 10.215.20.7:7003

将各个节点加入集群并且分配槽位。

2.4 新增节点

将新的节点10.215.20.7:7004 加入到集群中。

#redis-trib.rb  add-node 10.215.20.7:7004 10.215.20.7:7001
>>> Adding node 10.215.20.7:7004 to cluster 10.215.20.7:7001
>>> Performing Cluster Check (using node 10.215.20.7:7001)
M: 143f4d2d8788125e62a46ccea2663dcb42847e76 10.215.20.7:7001
   slots:0-5460 (5461 slots) master
   0 additional replica(s)
M: d54a286d7d55bef355173a04eb8e75692c09a06d 10.215.20.7:7003
   slots:10923-16383 (5461 slots) master
   0 additional replica(s)
M: ae0e1276b6af4850f5143aea9a4a6297f2719160 10.215.20.7:7002
   slots:5461-10922 (5462 slots) master
   0 additional replica(s)
[OK] All nodes agree about slots configuration.
>>> Check for open slots...
>>> Check slots coverage...
[OK] All 16384 slots covered.
>>> Send CLUSTER MEET to node 10.215.20.7:7004 to make it join the cluster.
[OK] New node added correctly.

至此节点添加完毕,查看集群的信息,槽位分配

2.5 检查槽位分配

# redis-trib.rb info 10.215.20.7:7001
10.215.20.7:7001 (143f4d2d...) -> 99975 keys | 5461 slots | 0 slaves.
10.215.20.7:7004 (cbedf1e1...) -> 0 keys | 0 slots | 0 slaves.
10.215.20.7:7003 (d54a286d...) -> 99956 keys | 5461 slots | 0 slaves.
10.215.20.7:7002 (ae0e1276...) -> 100069 keys | 5462 slots | 0 slaves.
[OK] 300000 keys in 4 masters.
18.31 keys per slot on average.

此时没有给新的节点 10.215.20.7:7004槽位。接下来要利用 reshard 命令为新的节点分片槽位。

2.6 redis-trib.rb reshard 在线迁移slot

# redis-trib.rb reshard 10.215.20.7:7001
>>> Performing Cluster Check (using node 10.215.20.7:7001)
M: 143f4d2d8788125e62a46ccea2663dcb42847e76 10.215.20.7:7001
   slots:0-5460 (5461 slots) master
   0 additional replica(s)
M: cbedf1e1d789ee926fb3b85dd3df575e96228cc4 10.215.20.7:7004
   slots: (0 slots) master
   0 additional replica(s)
M: d54a286d7d55bef355173a04eb8e75692c09a06d 10.215.20.7:7003
   slots:10923-16383 (5461 slots) master
   0 additional replica(s)
M: ae0e1276b6af4850f5143aea9a4a6297f2719160 10.215.20.7:7002
   slots:5461-10922 (5462 slots) master
   0 additional replica(s)
[OK] All nodes agree about slots configuration.
>>> Check for open slots...
>>> Check slots coverage...
[OK] All 16384 slots covered.

####迁移多少个槽位到目标节点
How many slots do you want to move (from 1 to 16384)? 15  
###目标节点的node id
What is the receiving node ID?

cbedf1e1d789ee926fb3b85dd3df575e96228cc4
Please enter all the source node IDs.
  Type 'all' to use all the nodes as source nodes for the 
  hash slots.
  Type 'done' once you entered all the source nodes IDs.
###从哪些node_id迁移?可以是集群里的所有节点,输入 all  ,
###也可以是node_id1,node_id2,node_id3 类似的写法。
Source node #1:all  
Ready to move 15 slots.
  Source nodes:
    M: 143f4d2d8788125e62a46ccea2663dcb42847e76 10.215.20.7:7001
   slots:0-5460 (5461 slots) master
   0 additional replica(s)
    M: d54a286d7d55bef355173a04eb8e75692c09a06d 10.215.20.7:7003
   slots:10923-16383 (5461 slots) master
   0 additional replica(s)
    M: ae0e1276b6af4850f5143aea9a4a6297f2719160 10.215.20.7:7002
   slots:5461-10922 (5462 slots) master
   0 additional replica(s)
  Destination node:
    M: cbedf1e1d789ee926fb3b85dd3df575e96228cc4 10.215.20.7:7004
   slots: (0 slots) master
   0 additional replica(s)
  Resharding plan:
  Moving slot 5461 from ae0e1276b6af4850f5143aea9a4a6297f2719160
  Moving slot 5462 from ae0e1276b6af4850f5143aea9a4a6297f2719160
  Moving slot 5463 from ae0e1276b6af4850f5143aea9a4a6297f2719160
  Moving slot 5464 from ae0e1276b6af4850f5143aea9a4a6297f2719160
  Moving slot 5465 from ae0e1276b6af4850f5143aea9a4a6297f2719160
  Moving slot 5466 from ae0e1276b6af4850f5143aea9a4a6297f2719160
  Moving slot 0 from 143f4d2d8788125e62a46ccea2663dcb42847e76
  Moving slot 1 from 143f4d2d8788125e62a46ccea2663dcb42847e76
  Moving slot 2 from 143f4d2d8788125e62a46ccea2663dcb42847e76
  Moving slot 3 from 143f4d2d8788125e62a46ccea2663dcb42847e76
  Moving slot 10923 from d54a286d7d55bef355173a04eb8e75692c09a06d
  Moving slot 10924 from d54a286d7d55bef355173a04eb8e75692c09a06d
  Moving slot 10925 from d54a286d7d55bef355173a04eb8e75692c09a06d
  Moving slot 10926 from d54a286d7d55bef355173a04eb8e75692c09a06d
#### 是否查看迁移计划。
Do you want to proceed with the proposed reshard plan (yes/no)? no

当然我们可以使用参数 --from source_id --to dest_id 直接指定槽位从source_id 迁移到 dest_id ,例如:

redis-trib.rb reshard --from d54a286d7d55bef355173a04eb8e75692c09a06d --to cbedf1e1d789ee926fb3b85dd3df575e96228cc4 --slots 2 10.215.20.7:7003

注意:

--from 后的参数可以指定多个源节点的node id,以逗号隔开,或者直接写成 --from all,表示集群的所有节点,不传递该参数的话,则会在迁移过程中提示用户输入。 --to 后面的参数表示目的节点,只能写一个节点node_id。不传递该参数的话,则会在迁移过程中提示用户输入。

日志如下:

>>> Performing Cluster Check (using node 10.215.20.7:7003)
...省略...
Ready to move 2 slots.
  Source nodes:
    M: d54a286d7d55bef355173a04eb8e75692c09a06d 10.215.20.7:7003
   slots:10927-16383 (5457 slots) master
   0 additional replica(s)
  Destination node:
    M: cbedf1e1d789ee926fb3b85dd3df575e96228cc4 10.215.20.7:7004
   slots:0-3,5461-5466,10923-10926 (14 slots) master
   0 additional replica(s)
  Resharding plan:
    Moving slot 10927 from d54a286d7d55bef355173a04eb8e75692c09a06d
    Moving slot 10928 from d54a286d7d55bef355173a04eb8e75692c09a06d

2.7 均衡slot

利用 rebalance 可以将集群中的槽位平均的分配到各个节点上。reshard也是迁移槽位,但是reshard可以指定更多变量满足有针对性迁移槽位的需求。

# redis-trib.rb rebalance 10.215.20.7:7004
>>> Performing Cluster Check (using node 10.215.20.7:7004)
[OK] All nodes agree about slots configuration.
>>> Check for open slots...
>>> Check slots coverage...
[OK] All 16384 slots covered.
>>> Rebalancing across 4 nodes. Total weight = 4
Moving 1361 slots from 10.215.20.7:7001 to 10.215.20.7:7004
###########################################################
Moving 1360 slots from 10.215.20.7:7002 to 10.215.20.7:7004
###########################################################
Moving 1359 slots from 10.215.20.7:7003 to 10.215.20.7:7004
############################################################

2.8 删除节点

redis-cluster 要求只能删除没有分配slot的节点,节点被删除之后直接关闭。对于正常运行的集群,我们必须先使用 rehard将被删除节点上的slot迁移到其他节点之后才能删除该节点。

# redis-trib.rb del-node 10.215.20.7:7004 cbedf1e1d789ee926fb3b85dd3df575e96228cc4
>>> Removing node 8cc4 from cluster 10.215.20.7:7004
[ERR] Node 10.215.20.7:7004 is not empty! 
Reshard data away and try again.

我们可以利用前面介绍reshard操作步骤进行迁移。本案例是将槽位分配还原到添加节点之前槽位的分配状态。

slots:0-1364 迁移到 10.215.20.7:7001
slots:5461-6826    迁移到 10.215.20.7:7002
slots:10923-12287  迁移到 10.215.20.7:7003

注意,生产环境如果key-value 比较大,迁移的时候最好加上 --timeout N ,设置migrate命令的超时时间为一个比较大的值,避免迁移失败。

进行三次reshard之后,再次执行删除节点 10.215.20.7:7004 的命令

# redis-trib.rb del-node 10.215.20.7:7004 cbedf1e1d789ee926fb3b85dd3df575e96228cc4
>>> Removing node cbedf1e1d789ee926fb3b85dd3df575e96228cc4 from cluster 10.215.20.7:7004
>>> Sending CLUSTER FORGET messages to the cluster...
>>> SHUTDOWN the node.

2.9 容灾演练

redis-cluster 支持主从模式的容灾切换,当master节点遇到故障,系统自动切换,提升从库节点为主节点,继续提供服务。我们在从库节点使用 cluster failover来模拟故障演练。

redis-cli -h 10.215.20.13 -p 6381 -c
10.215.20.13:6381> cluster failover
OK

观察从库节点的日志:

31906:S 31 Mar 21:04:58.847 # Manual failover user request accepted.
31906:S 31 Mar 21:04:58.880 # Received replication offset for paused master manual failover: 9907573
31906:S 31 Mar 21:04:58.922 # All master replication stream processed, manual failover can start.
31906:S 31 Mar 21:04:58.922 # Start of election delayed for 0 milliseconds (rank #0, offset 9907573).
31906:S 31 Mar 21:04:59.022 # Starting a failover election for epoch 7.
31906:S 31 Mar 21:04:59.025 # Currently unable to failover: Waiting for votes, but majority still not reached.
31906:S 31 Mar 21:04:59.029 # Failover election won: I'm the new master. ###选取从库为主节点。
31906:S 31 Mar 21:04:59.029 # configEpoch set to 7 after successful failover
31906:M 31 Mar 21:04:59.029 # Setting secondary replication ID to 16718fda793b9fdb4964eb71f2858d3e18129a45, valid up to offset: 9907574. New replication ID is 1d950330b6a027fe34cc40bfc5e61e17c1a7671d ###重置主从关系。
31906:M 31 Mar 21:04:59.029 # Connection with master lost.
31906:M 31 Mar 21:04:59.029 * Caching the disconnected master state.
31906:M 31 Mar 21:04:59.029 * Discarding previously cached master state.
31906:M 31 Mar 21:04:59.280 * Slave 10.215.20.7:6381 asks for synchronization
31906:M 31 Mar 21:04:59.280 * Partial resynchronization request from 10.215.20.7:6381 accepted. Sending 0 bytes of backlog starting from offset 9907574.

当然我们也可以直接关闭主节点来模拟主节点挂了场景。不过需要说明的时候主节点关闭之后再重启,会自动添加到集群中,并且从重新建立和新主的复制关系。

三 遇到的问题

1 执行redis-trib.rb reshard的时候可能会遇到如下错误: [ERR] Calling MIGRATE ERR Syntax error, try CLIENT (LIST | KILL | GETNAME | SETNAME | PAUSE | REPLY)

解决方法: 使用ruby gem 安装4.x 版本以下的redis库,版本不能使用最新的4.0,否则redis-trib.rb reshard 重新分片时会报错误。

  1. 卸载最新redis库 gem uninstall redis
  2. 安装3.x版本 gem install redis -v 3.3.5

2 遇到 has slots in importing state

解决方法:登入提示错误的两个节点执行 cluster setslot 0 stable命令清除迁移状态。

redis-trib.rb check 10.215.20.7:7004
>>> Check for open slots...
[WARNING] Node 10.215.20.7:7004 has slots in importing state (0).
[WARNING] Node 10.215.20.7:7001 has slots in migrating state (0).
[WARNING] The following slots are open: 0
>>> Check slots coverage...
[OK] All 16384 slots covered.


# redis-cli -h 10.215.20.7 -p 7001 -c
10.215.20.7:7001>
10.215.20.7:7001> cluster setslot  0 stable
OK

# redis-cli -h 10.215.20.7 -p 7004 -c
10.215.20.7:7004> cluster setslot  0 stable
OK

再次检查,错误已经解决。

# redis-trib.rb check 10.215.20.7:7004
>>> Performing Cluster Check (using node 10.215.20.7:7004)
......
[OK] All nodes agree about slots configuration.
>>> Check for open slots...
>>> Check slots coverage...
[OK] All 16384 slots covered.

四 总结

redis-cluster的运维涵盖节点的增删,槽位的迁移,平衡。不过对于生产的集群,redis-cluster集群运维以槽位为中心,基于此做各种操作。什么是槽位呢?槽位迁移的机制是什么? 后面理论篇的学习会介绍redis cluster的机制和相关要素。

五 参考文章

[1] https://redis.io/topics/cluster-tutorial

[2] https://www.cnblogs.com/zhoujinyi/p/6477133.html