在adopt redis cluster之前,第一件需要做的事情是自动化部署redis cluster, 基本流程如下:
一般安装流程:
a. prepare all nodes hardware b. delete all old configures(such as nodes.conf) and persistence file(such as rdb/aof) c. auto install all redis package beside ruby d. change all nodes' configure to enable cluster e. startup all nodes f. use redis-trib.rb to create cluster
重建cluster流程:
a. flush all nodes' data b. cluster reset all nodes. c. redis-trbie.rb create cluster.
下面提及几个自动部署redis cluster的要点:
(1)关于授权问题:
官方的cluster管理工具/opt/redis/bin/redis-trib.rb不支持密码,所以现在一般文章提到的做法是准备好一批redis nodes后,先不做授权,然后等create cluster之后,逐个将所有的node加密,保存密码。
config set config rewrite
但是如果按照这样的自动化的问题是,假设重新部署,需要重新run redis-trib.rb的时候,仍然需要去掉密码,同时使用其他功能时,比如fix, check等也需要去掉密码,比较繁琐。所以自动化这步:
可以继续按照传统的方式来做:先设置密码,后调用redis-trib.rb来创建,这里有两种方法实现:
a)直接修改redis-trib.rb,加上使用密码的功能:
参考文章:
https://trodzen.wordpress.com/2017/02/09/redis-cluster-with-passwords/
b)修改redis-trib.rb调用的ruby lib,加上密码的功能,这样也可以一劳永逸:
/usr/lib/ruby/gems/1.8/gems/redis-3.3.3/lib/redis/client.rb
DEFAULTS = { :url => lambda { ENV["REDIS_URL"] }, :scheme => "redis", :host => "127.0.0.1", :port => 6379, :path => nil, :timeout => 5.0, :password => "{{password}}", :db => 0, :driver => nil, :id => nil, :tcp_keepalive => 0, :reconnect_attempts => 1, :inherit_socket => false }
(2) 关于cluster的重建
在已有的cluster的基础上,直接重新建立cluster会报错:
echo yes | /opt/redis/bin/redis-trib.rb create --replicas 1 10.224.2.141:8690 10.224.2.142:8690 10.224.2.143:8690 10.224.2.144:8690 10.224.2.145:8690 10.224.2.146:8690 [ERR] Node 10.224.2.141:8690 is not empty. Either the node already knows other nodes (check with CLUSTER NODES) or contains some key in database 0.
顾名思义,这里有2种情况:
a)存在已有数据: 针对这种情况需要清理数据:/opt/redis/bin/redis-cli -p 8690 -a P@ss123 flushall
b)已经是cluster了:针对此情况需要重置cluster: /opt/redis/bin/redis-cli -p 8690 -a P@ss123 cluster reset
同时,还可能遇到这种错误:
opt/redis/bin/redis-cli -h 10.224.2.146 -p 8690 -a P@ss123 cluster reset ["ERR CLUSTER RESET can't be called with master nodes containing keys\n", '\n']
因为需要保持以上2个命令的顺序步骤来做。
(3)关于重新部署的数据清理:
如果重新装包之后,直接启动,仍然会存在一些数据,因为redis cluster可能会存在rdb/aof文件在磁盘上,在启动时,会读取这些文件,所以直接重新装包在原来目录,什么配置都不变情况下,会导致读取过去的数据,所以需要清理掉数据,当然既然是重新部署,所以保存cluster信息的nodes.conf文件也需要清理:
rm -rf /etc/redis/nodes.conf rm -rf /opt/redis/dump.rdb rm -rf /opt/redis/appendonly.aof
(4) 关于日志的rotate
既然自动化部署,需要长久运行,需要日志rotate,以防止log越来越多。
1)在redis的配置文件中指定日志文件:
#级别不能设置太高,否则log太多,使用默认即可: loglevel verbose logfile "/var/redis/log/redis.log"
2)创建rotate配置:
在/etc/logrotate.d/目录下创建文件,例如redis_log_rotate
//每天归档,保存15天。 /var/redis/log/redis*.log { daily rotate 15 copytruncate delaycompress compress notifempty missingok }
(5) 关于启动、关闭,查看redis服务脚本与自动重启
需要写一个集中的管理脚本来维护redis的启动、关闭等,例如
for ARG in $@ $ARGS do case $ARG in start) echo "##################begin to start redis server##################" #setting the value of os parameter #sh /opt/redis/bin/set_os_parms.sh #start redis server /opt/redis/bin/redis-server /etc/redis/wbx-redis.conf echo "##################complete to start redis server##################" ;; stop) echo "##################begin to stop redis server##################" dtpid=`ps -efw --width 1024 |grep redis-server |grep -v grep |awk '{print $2}'` dtpid=`echo $dtpid` if [ "x$dtpid" = "x" ] then echo "INFO: Redis Server is not running." echo "##################complete to stop redis server##################" exit 0 else /opt/redis/bin/redis-shutdown wbx-redis echo "##################complete to stop redis server##################" fi ;; status) echo "##################begin to check redis server status##################" dtpid=`ps -efw --width 1024|grep redis-server |grep -v grep|awk '{print $2}'` dtpid=`echo $dtpid` if [ "x$dtpid" != "x" ] then echo "[INFO] Redis Server($dtpid) is started." echo "##################complete to check redis server status ##################" else echo "[INFO] Redis Server cannot be started." echo "##################complete to check redis server status ##################" exit 1; fi ;; *) echo "Usage: $0 (start|stop|status)" cat <<EOF start - start Redis Server stop - stop Redis Server status - check Redis Server status EOF ;; esac done
写完后,可以绑定守护程序来保持redis service挂了后,自动拉起服务。这种情况,对于纯当cache的redis cluster比较实用。
(6) 创建create cluster命令:
最终我们要得到一个cluster create 的命令,但是在自动化部署,所以需要动态拼接处redis cluster创建命令,例如:
/opt/redis/bin/redis-trib.rb create –replicas 1 10.224.2.141:8690 10.224.2.142:8690 10.224.2.143:8690 10.224.2.144:8690 10.224.2.145:8690 10.224.2.146:8690
因为事先不定知道机器多少,或者说,最好不要关心有多少节点,只需要保持已有的节点数可以除以replicas的配比(例如1主1从时,保持机器数是2个倍数即可)就可以了。例如可以使用下面的脚本,来动态拼接一个create cluster的命令:
#!/usr/bin/python import os import string print {{system.boxList}} def check_cluster(host_port): check_command = "/opt/redis/bin/redis-trib.rb check " + host_port result = os.popen(check_command).readlines() print result return result[-1] == "[OK] All 16384 slots covered.\n" def destory_cluster(): box_list = {{system.boxList}} for box in box_list: i = 0 while i &lt; 100: flush_command = "/opt/redis/bin/redis-cli -h " + box["ip"] + " -p {{port}} -a {{password}} flushall" print flush_command result = os.popen(flush_command).readlines() print result cluster_reset_command = "/opt/redis/bin/redis-cli -h " + box["ip"] + " -p {{port}} -a {{password}} cluster reset" print cluster_reset_command result = os.popen(cluster_reset_command ).readlines() print result if string.find(" ".join(str(x) for x in result),"containing keys") == -1: break print "##########try again....times: " + str(i) i = i + 1 def stop_servers(): print "##########stop_servers...." box_list = {{system.boxList}} for box in box_list: stop_command = "" //stop command need to change here print stop_command result = os.popen(stop_command).readlines() print result def start_servers(): print "##########start_servers...." box_list = {{system.boxList}} for box in box_list: start_command = "" //start command need to change here print start_command result = os.popen(start_command).readlines() print result def clean_servers(): print "##########clean servers's dump file...." box_list = {{system.boxList}} for box in box_list: clean_command = "rm -rf /opt/redis/*.rdb" print clean_command result = os.popen(clean_command).readlines() print result def create_cluster(): box_list = {{system.boxList}} new_box_list = [] for box in box_list: if check_cluster(box["ip"] + ":{{port}}"): return True new_box_list.append(box["ip"] + ":{{port}}") print "##########check complete...." print "##########begin to execute create cluster command...." create_command = "echo yes | /opt/redis/bin/redis-trib.rb create --replicas 1 " + " ".join(new_box_list) print create_command result = os.popen(create_command).readlines()[-1] print result return string.find(result,"ERR") == -1 print "##########clean all servers..." stop_servers() clean_servers() start_servers() print "##########destroy old cluster..." destory_cluster() print "##########create new cluster...." if create_cluster(): print "##########success to complete create cluster...." else: print "##########fail to complete create cluster...." exit(1)
对于交互式的命令,可以使用echo yes |, 例如:create_command = “echo yes | /opt/redis/bin/redis-trib.rb create –replicas 1 ” + ” “.join(new_box_list)
同时考虑flushall命令在太多数据时会阻塞,让cluster切换slave,然后slave变成master后又重复,所以直接先停掉所有机器,然后删除rdb file,确保所有数据清楚。然后再启动,这样不仅可以保持数据清空,同时也保证了,所有机器都是启动状态;
另外,cluster reset的时候,为了确保用户刚好在flush数据插入了新的数据,可以尝试100次来确保rest不会出现:
ERR CLUSTER RESET can't be called with master nodes containing keys
(7)考虑需要可以配置的内容:
redis有太多配置,有一些配置项最后暴露出来可以配置,例如:
a) port和password: 安全考虑
b) loglevel: 产线环境和测试环境可以设置不同
c) metric内容: 如果有监控,一般都是通过通过info命令来实现,监控的项目要么全部配齐,要么可配
d) maxmemory: 不同机器的内存大小不同,需要设置成不同。
最终成功后:
/opt/redis/bin/redis-trib.rb create --replicas 1 10.224.2.141:8690 10.224.2.142:8690 10.224.2.143:8690 10.224.2.144:8690 10.224.2.145:8690 10.224.2.146:8690 [OK] All 16384 slots covered.