真实世界的机房网络往往并不是风平浪静的,它们经常会发生各种各样的小问题。比如网络抖动就是非常常见的一种现象,突然之间部分连接变得不可访问,然后很快又恢复正常。
为解决
开始,先重现主从切换
先查看一下当前集群节点信息
[root@localhost redis-cluster]# /usr/local/redis-cluster/bin/redis-cli -c -h 192.168.0.240 -p 8001
192.168.0.240:8001> cluster nodes
7953c14cc80d3e88265d41bfcd55ac5a77122cca 192.168.0.240:8003@18003 master - 0 1555486298000 3 connected 10923-16383
4ce3c04b61228ffb25f71b43c6a6691bf85780e1 192.168.0.240:8001@18001 myself,master - 0 1555486298000 8 connected 0-5460
53e4735bc35c1baa5375eab58ad279f1a74abec7 192.168.0.240:8006@18006 slave 2eb9606aa417c269771da4ebea81f5ee1dc5df61 0 1555486297581 6 connected
c0a062244a7b8be436e1b9dd045115c5ed5cd33b 192.168.0.240:8004@18004 slave 7953c14cc80d3e88265d41bfcd55ac5a77122cca 0 1555486298791 4 connected
79bd4b3013f0ca02d27543983539c3e2a6c41c5e 192.168.0.240:8005@18005 slave 4ce3c04b61228ffb25f71b43c6a6691bf85780e1 0 1555486298590 8 connected
2eb9606aa417c269771da4ebea81f5ee1dc5df61 192.168.0.240:8002@18002 master - 0 1555486298000 2 connected 5461-10922
这时候我们kill掉一个master(8001),再查看一下节点信息
[root@localhost redis-cluster]# /usr/local/redis-cluster/bin/redis-cli -c -h 192.168.0.240 -p 8002
#kill master_node 8001前
192.168.0.240:8001> cluster nodes
7953c14cc80d3e88265d41bfcd55ac5a77122cca 192.168.0.240:8003@18003 master - 0 1555486298000 3 connected 10923-16383
4ce3c04b61228ffb25f71b43c6a6691bf85780e1 192.168.0.240:8001@18001 myself,master - 0 1555486298000 8 connected 0-5460
53e4735bc35c1baa5375eab58ad279f1a74abec7 192.168.0.240:8006@18006 slave 2eb9606aa417c269771da4ebea81f5ee1dc5df61 0 1555486297581 6 connected
c0a062244a7b8be436e1b9dd045115c5ed5cd33b 192.168.0.240:8004@18004 slave 7953c14cc80d3e88265d41bfcd55ac5a77122cca 0 1555486298791 4 connected
79bd4b3013f0ca02d27543983539c3e2a6c41c5e 192.168.0.240:8005@18005 slave 4ce3c04b61228ffb25f71b43c6a6691bf85780e1 0 1555486298590 8 connected
2eb9606aa417c269771da4ebea81f5ee1dc5df61 192.168.0.240:8002@18002 master - 0 1555486298000 2 connected 5461-10922
#kill master_node 8001后
192.168.0.240:8002> cluster nodes
7953c14cc80d3e88265d41bfcd55ac5a77122cca 192.168.0.240:8003@18003 master - 0 1555486404381 3 connected 10923-16383
4ce3c04b61228ffb25f71b43c6a6691bf85780e1 192.168.0.240:8001@18001 master,fail - 1555486393779 1555486392000 8 disconnected
2eb9606aa417c269771da4ebea81f5ee1dc5df61 192.168.0.240:8002@18002 myself,master - 0 1555486404000 2 connected 5461-10922
53e4735bc35c1baa5375eab58ad279f1a74abec7 192.168.0.240:8006@18006 slave 2eb9606aa417c269771da4ebea81f5ee1dc5df61 0 1555486404000 6 connected
c0a062244a7b8be436e1b9dd045115c5ed5cd33b 192.168.0.240:8004@18004 slave 7953c14cc80d3e88265d41bfcd55ac5a77122cca 0 1555486405000 4 connected
79bd4b3013f0ca02d27543983539c3e2a6c41c5e 192.168.0.240:8005@18005 master - 0 1555486405391 9 connected 0-5460
我们看到这时候,8001显示的是fail,而8005则选举为master,这时候我们将8001重新启动,查看一下节点信息
192.168.0.240:8002> cluster nodes
7953c14cc80d3e88265d41bfcd55ac5a77122cca 192.168.0.240:8003@18003 master - 0 1555486650000 3 connected 10923-16383
4ce3c04b61228ffb25f71b43c6a6691bf85780e1 192.168.0.240:8001@18001 slave 79bd4b3013f0ca02d27543983539c3e2a6c41c5e 0 1555486651000 9 connected
2eb9606aa417c269771da4ebea81f5ee1dc5df61 192.168.0.240:8002@18002 myself,master - 0 1555486648000 2 connected 5461-10922
53e4735bc35c1baa5375eab58ad279f1a74abec7 192.168.0.240:8006@18006 slave 2eb9606aa417c269771da4ebea81f5ee1dc5df61 0 1555486650501 6 connected
c0a062244a7b8be436e1b9dd045115c5ed5cd33b 192.168.0.240:8004@18004 slave 7953c14cc80d3e88265d41bfcd55ac5a77122cca 0 1555486650000 4 connected
79bd4b3013f0ca02d27543983539c3e2a6c41c5e 192.168.0.240:8005@18005 master - 0 1555486651814 9 connected 0-5460
这时候8001位slave,它的master:79bd4b3013f0ca02d27543983539c3e2a6c41c5e,也就是8005,可见redis的高可用还是靠谱的。
原理分析
当slave
1.slave发现自己的master变为FAIL
2.将自己记录的集群currentEp
3.其他节点收到该信息,只
4.尝试failover的slave收集FAILOVER_AUTH_ACK
5.超过半数后变成新Master
6.广播Pong通知其他集群节点。
从节点并不是在主节点一进入 FAIL 状态就马上尝试发起选举,而是有一定延迟,一定的延迟确保我们等待FAIL状态在集群中传播,slave如果立即尝试选举,其它masters或许尚未意识到FAIL状态,可能会拒绝投票
延迟计算公式:
DELAY = 500ms + random(0 ~ 500ms) + SLAVE_RANK * 1000ms
SLAVE_RANK表示此slave已经从master复制数据的总量的rank。Rank越小代表已复制的数据越新。这种方式下,持有最新数据的slave将会首先发起选举(理论上)。
跳转重定位
当客户端向一个错误的节点发出了指令,该节点会发现指令的 key 所在的槽位并不归自己管理,这时它会向客户端发送一个特殊的跳转指令携带目标操作的节点地址,告诉客户端去连这个节点去获取数据。客户端收到指令后除了跳转到
192.168.0.240:8002> set name redis5
OK
192.168.0.240:8002> keys *
1) "name"
[root@localhost bin]# /usr/local/redis-cluster/bin/redis-cli -c -h 192.168.0.240 -p 8003
192.168.0.240:8003> keys *
1) "8888"
192.168.0.240:8003> get name
-> Redirected to slot [5798] located at 192.168.0.240:8002
"redis5"
192.168.0.240:8002>