redis analyst (3)- auto deploy redis cluster

在adopt redis cluster之前，第一件需要做的事情是自动化部署redis cluster, 基本流程如下：

一般安装流程：

a. prepare all nodes hardware
b. delete all old configures(such as nodes.conf) and persistence file（such as rdb/aof）
c. auto install all redis package beside ruby
d. change all nodes' configure to enable cluster
e. startup all nodes
f. use redis-trib.rb to create cluster

重建cluster流程：

a. flush all nodes' data
b. cluster reset all nodes.
c. redis-trbie.rb create cluster.

下面提及几个自动部署redis cluster的要点：

（1）关于授权问题：
官方的cluster管理工具/opt/redis/bin/redis-trib.rb不支持密码，所以现在一般文章提到的做法是准备好一批redis nodes后，先不做授权，然后等create cluster之后，逐个将所有的node加密，保存密码。

config set 
config rewrite

但是如果按照这样的自动化的问题是，假设重新部署，需要重新run redis-trib.rb的时候，仍然需要去掉密码，同时使用其他功能时，比如fix, check等也需要去掉密码，比较繁琐。所以自动化这步：
可以继续按照传统的方式来做：先设置密码，后调用redis-trib.rb来创建，这里有两种方法实现：
a）直接修改redis-trib.rb，加上使用密码的功能：
参考文章：
https://trodzen.wordpress.com/2017/02/09/redis-cluster-with-passwords/

b）修改redis-trib.rb调用的ruby lib，加上密码的功能，这样也可以一劳永逸：

/usr/lib/ruby/gems/1.8/gems/redis-3.3.3/lib/redis/client.rb

    DEFAULTS = {
      :url => lambda { ENV["REDIS_URL"] },
      :scheme => "redis",
      :host => "127.0.0.1",
      :port => 6379,
      :path => nil,
      :timeout => 5.0,
      :password => "{{password}}",
      :db => 0,
      :driver => nil,
      :id => nil,
      :tcp_keepalive => 0,
      :reconnect_attempts => 1,
      :inherit_socket => false
    }

(2) 关于cluster的重建
在已有的cluster的基础上，直接重新建立cluster会报错：

echo yes | /opt/redis/bin/redis-trib.rb create --replicas 1 10.224.2.141:8690 10.224.2.142:8690 10.224.2.143:8690 10.224.2.144:8690 10.224.2.145:8690 10.224.2.146:8690
[ERR] Node 10.224.2.141:8690 is not empty. Either the node already knows other nodes (check with CLUSTER NODES) or contains some key in database 0.

顾名思义，这里有2种情况：
a）存在已有数据：针对这种情况需要清理数据：/opt/redis/bin/redis-cli -p 8690 -a P@ss123 flushall
b）已经是cluster了：针对此情况需要重置cluster: /opt/redis/bin/redis-cli -p 8690 -a P@ss123 cluster reset

同时，还可能遇到这种错误：

opt/redis/bin/redis-cli -h 10.224.2.146 -p 8690 -a P@ss123 cluster reset
["ERR CLUSTER RESET can't be called with master nodes containing keys\n", '\n']

因为需要保持以上2个命令的顺序步骤来做。

（3）关于重新部署的数据清理：
如果重新装包之后，直接启动，仍然会存在一些数据，因为redis cluster可能会存在rdb/aof文件在磁盘上，在启动时，会读取这些文件，所以直接重新装包在原来目录，什么配置都不变情况下，会导致读取过去的数据，所以需要清理掉数据，当然既然是重新部署，所以保存cluster信息的nodes.conf文件也需要清理：

rm -rf /etc/redis/nodes.conf
rm -rf /opt/redis/dump.rdb
rm -rf /opt/redis/appendonly.aof

(4) 关于日志的rotate
既然自动化部署，需要长久运行，需要日志rotate,以防止log越来越多。

1）在redis的配置文件中指定日志文件：

#级别不能设置太高，否则log太多，使用默认即可:
loglevel verbose
logfile "/var/redis/log/redis.log"

2）创建rotate配置：
在/etc/logrotate.d/目录下创建文件，例如redis_log_rotate

//每天归档，保存15天。
/var/redis/log/redis*.log {
    daily
    rotate 15  
    copytruncate
    delaycompress
    compress
    notifempty
    missingok
}

(5) 关于启动、关闭，查看redis服务脚本与自动重启

需要写一个集中的管理脚本来维护redis的启动、关闭等，例如


for ARG in $@ $ARGS
do
	case $ARG in
	start)
		echo "##################begin to start redis server##################"

		#setting the value of os parameter
		#sh /opt/redis/bin/set_os_parms.sh
		   
		#start redis server
		/opt/redis/bin/redis-server /etc/redis/wbx-redis.conf
		echo "##################complete to start redis server##################"
		;;
	stop)
		echo "##################begin to stop redis server##################"
		dtpid=`ps -efw --width 1024 |grep redis-server |grep -v grep |awk '{print $2}'`
		dtpid=`echo $dtpid`
		if [ "x$dtpid" = "x" ]
		then
			echo "INFO: Redis Server is not running."
			echo "##################complete to stop redis server##################"
			exit 0
		else
			/opt/redis/bin/redis-shutdown  wbx-redis
			echo "##################complete to stop redis server##################"
		fi
		;;
	status)
	
		echo "##################begin to check redis server status##################"
		dtpid=`ps -efw --width 1024|grep redis-server |grep -v grep|awk '{print $2}'`
		dtpid=`echo $dtpid`
		if [ "x$dtpid" != "x" ]
		        then
		                echo "[INFO] Redis Server($dtpid) is started."
				echo "##################complete to check redis server status ##################"
		        else
		                echo "[INFO] Redis Server cannot be started."
                                echo "##################complete to check redis server status ##################"
		                exit 1;
		fi
		;;
	*)

echo "Usage: $0 (start|stop|status)"
cat <<EOF

start		- start Redis Server
stop		- stop  Redis Server
status      - check Redis Server status

EOF
	;;

	esac

done

写完后，可以绑定守护程序来保持redis service挂了后，自动拉起服务。这种情况，对于纯当cache的redis cluster比较实用。

(6) 创建create cluster命令：

最终我们要得到一个cluster create 的命令，但是在自动化部署，所以需要动态拼接处redis cluster创建命令，例如：

/opt/redis/bin/redis-trib.rb create –replicas 1 10.224.2.141:8690 10.224.2.142:8690 10.224.2.143:8690 10.224.2.144:8690 10.224.2.145:8690 10.224.2.146:8690

因为事先不定知道机器多少，或者说，最好不要关心有多少节点，只需要保持已有的节点数可以除以replicas的配比(例如1主1从时，保持机器数是2个倍数即可)就可以了。例如可以使用下面的脚本，来动态拼接一个create cluster的命令：

#!/usr/bin/python
import os
import string

print {{system.boxList}}

def check_cluster(host_port):
 check_command = "/opt/redis/bin/redis-trib.rb check " + host_port
 result = os.popen(check_command).readlines()
 print result
 return result[-1] == "[OK] All 16384 slots covered.\n"


def destory_cluster():
 box_list = {{system.boxList}}
 for box in box_list:
 i = 0
 while i &amp;lt; 100:
 flush_command = "/opt/redis/bin/redis-cli -h " + box["ip"] + " -p {{port}} -a {{password}} flushall"
 print flush_command
 result = os.popen(flush_command).readlines()
 print result
 cluster_reset_command = "/opt/redis/bin/redis-cli -h " + box["ip"] + " -p {{port}} -a {{password}} cluster reset"
 print cluster_reset_command 
 result = os.popen(cluster_reset_command ).readlines()
 print result
 if string.find(" ".join(str(x) for x in result),"containing keys") == -1:
 break
 print "##########try again....times: " + str(i)
 i = i + 1

 

def stop_servers():
 print "##########stop_servers...."
 box_list = {{system.boxList}}
 for box in box_list:
 stop_command = ""  //stop command need to change here
 print stop_command
 result = os.popen(stop_command).readlines()
 print result

def start_servers():
 print "##########start_servers...."
 box_list = {{system.boxList}}
 for box in box_list:
 start_command = ""  //start command need to change here
 print start_command
 result = os.popen(start_command).readlines()
 print result
 
def clean_servers():
 print "##########clean servers's dump file...."
 box_list = {{system.boxList}}
 for box in box_list:
 clean_command = "rm -rf /opt/redis/*.rdb" 
 print clean_command
 result = os.popen(clean_command).readlines()
 print result


def create_cluster():
 box_list = {{system.boxList}}
 new_box_list = []
 for box in box_list:
 if check_cluster(box["ip"] + ":{{port}}"):
 return True
 new_box_list.append(box["ip"] + ":{{port}}")

 print "##########check complete...."
 print "##########begin to execute create cluster command...."

 create_command = "echo yes | /opt/redis/bin/redis-trib.rb create --replicas 1 " + " ".join(new_box_list)
 print create_command
 result = os.popen(create_command).readlines()[-1]
 print result
 return string.find(result,"ERR") == -1

print "##########clean all servers..."
stop_servers()
clean_servers()
start_servers()
print "##########destroy old cluster..."
destory_cluster()
print "##########create new cluster...."
if create_cluster():
 print "##########success to complete create cluster...."
else:
 print "##########fail to complete create cluster...."
 exit(1)

对于交互式的命令，可以使用echo yes |, 例如：create_command = “echo yes | /opt/redis/bin/redis-trib.rb create –replicas 1 ” + ” “.join(new_box_list)

同时考虑flushall命令在太多数据时会阻塞，让cluster切换slave，然后slave变成master后又重复，所以直接先停掉所有机器，然后删除rdb file，确保所有数据清楚。然后再启动，这样不仅可以保持数据清空，同时也保证了，所有机器都是启动状态；

另外，cluster reset的时候，为了确保用户刚好在flush数据插入了新的数据，可以尝试100次来确保rest不会出现：


ERR CLUSTER RESET can't be called with master nodes containing keys

（7）考虑需要可以配置的内容：
redis有太多配置，有一些配置项最后暴露出来可以配置，例如：

a) port和password: 安全考虑
b) loglevel: 产线环境和测试环境可以设置不同
c) metric内容: 如果有监控，一般都是通过通过info命令来实现，监控的项目要么全部配齐，要么可配
d) maxmemory: 不同机器的内存大小不同，需要设置成不同。

最终成功后：

/opt/redis/bin/redis-trib.rb create --replicas 1 10.224.2.141:8690 10.224.2.142:8690 10.224.2.143:8690 10.224.2.144:8690 10.224.2.145:8690 10.224.2.146:8690
[OK] All 16384 slots covered.

效果图：

redis analyst (3)- auto deploy redis cluster

发布者

傅, 健

发表评论取消回复

发布者

傅, 健

发表评论 取消回复

发表评论取消回复