再次遇到NoHostAvailableException,诡异的是:检查了下Cassandra的节点,又都是Up状态,和之前遇到的情况如出一辙。
所以有必要关于这个异常做个记录,汇总下两次遇到这种情况的原因:
(1) localDc配置错误
之前一直work良好,后来Review设计时,发现之前开启远程dc failover可能有点问题,具体参考之前文章的总结:
开启dc level failover后,是否能满足性能需求,如果不能,和一个挂掉的系统又有何区别。
考虑到这种情况,不如整个应用层做failover,而不是只远程访问远程Cassandra DC.
应用层代码是这么控制的:
DCAwareRoundRobinPolicy childPolicy = isDCLevelFailoverSupport?new DCAwareRoundRobinPolicy(localDC,1,true):new DCAwareRoundRobinPolicy(localDC);
使用feature taggle关闭后(isDCLevelFailoverSupport = false):
Server直接启动不了:
Caused by: java.util.concurrent.ExecutionException: com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (no host was tried) at com.google.common.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:299) at com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:272) at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:96) ... 34 more Caused by: com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (no host was tried) at com.datastax.driver.core.RequestHandler.reportNoMoreHosts(RequestHandler.java:205) at com.datastax.driver.core.RequestHandler.access$1000(RequestHandler.java:44) at com.datastax.driver.core.RequestHandler$SpeculativeExecution.sendRequest(RequestHandler.java:271) at com.datastax.driver.core.RequestHandler.startNewExecution(RequestHandler.java:112) at com.datastax.driver.core.RequestHandler.sendRequest(RequestHandler.java:92) at com.datastax.driver.core.SessionManager.executeAsync(SessionManager.java:134) at com.datastax.driver.core.AbstractSession.executeAsync(AbstractSession.java:63) ... 36 more
查看driver层代码:
com.datastax.driver.core.policies.DCAwareRoundRobinPolicy
CopyOnWriteArrayList<Host> localLiveHosts = perDcLiveHosts.get(localDc); //获取配置的localdc的所有节点 final List<Host> hosts = localLiveHosts == null ? Collections.<Host>emptyList() : cloneList(localLiveHosts); final int startIdx = index.getAndIncrement(); return new AbstractIterator<Host>() { private int idx = startIdx; private int remainingLocal = hosts.size(); //匹配配置的dcid的节点的数量 // For remote Dcs private Iterator<String> remoteDcs; private List<Host> currentDcHosts; private int currentDcRemaining; @Override protected Host computeNext() { while (true) { if (remainingLocal > 0) { remainingLocal--; int c = idx++ % hosts.size(); if (c < 0) { c += hosts.size(); } return hosts.get(c); } if (currentDcHosts != null && currentDcRemaining > 0) { currentDcRemaining--; int c = idx++ % currentDcHosts.size(); if (c < 0) { c += currentDcHosts.size(); } return currentDcHosts.get(c); } ConsistencyLevel cl = statement.getConsistencyLevel() == null ? configuration.getQueryOptions().getConsistencyLevel() : statement.getConsistencyLevel(); //获取一致性级别 if (dontHopForLocalCL && cl.isDCLocal()) //这里是关键,使用之前构造器关闭远程访问时,这个dontHopForLocalCL是false(即不要打破一致性约束),且当前我们的应用是localquorom级别。 return endOfData(); //所以返回null if (remoteDcs == null) { Set<String> copy = new HashSet<String>(perDcLiveHosts.keySet()); copy.remove(localDc); remoteDcs = copy.iterator(); } if (!remoteDcs.hasNext()) break; String nextRemoteDc = remoteDcs.next(); CopyOnWriteArrayList<Host> nextDcHosts = perDcLiveHosts.get(nextRemoteDc); if (nextDcHosts != null) { // Clone for thread safety List<Host> dcHosts = cloneList(nextDcHosts); currentDcHosts = dcHosts.subList(0, Math.min(dcHosts.size(), usedHostsPerRemoteDc)); currentDcRemaining = currentDcHosts.size(); } } return endOfData(); } }; }
实际上,默认不会直接用上面的,而是将上面做一个子策略包装到com.datastax.driver.core.policies.TokenAwarePolicy里面去。
.withLoadBalancingPolicy(new TokenAwarePolicy(childPolicy))
而这个策略的query node plan(选择发送请求到哪个节点上的选择器)是:
com.datastax.driver.core.policies.TokenAwarePolicy:
@Override protected Host computeNext() { while (iter.hasNext()) { Host host = iter.next(); if (host.isUp() && childPolicy.distance(host) == HostDistance.LOCAL) //必须是local dc return host; } if (childIterator == null) childIterator = childPolicy.newQueryPlan(loggedKeyspace, statement); while (childIterator.hasNext()) { //这种情况下,子策略一个节点都没有,理由如上 Host host = childIterator.next(); // Skip it if it was already a local replica if (!replicas.contains(host) || childPolicy.distance(host) != HostDistance.LOCAL) return host; } return endOfData(); } };
所以可见,假设配置的dcid不属于任何cassandra nodes之时,且关闭了可以远程访问的情况下,这个时候driver会认为没有任何节点Up可用。
其实从log中也可以看到具体的情况,当然它本身没有告诉你后面的异常是因为这个引起的。
[10/30/2017 08:51:07.771][][main]INFO DCAwareRoundRobinPolicy-Using provided data-center name 'MyDC' for DCAwareRoundRobinPolicy [10/30/2017 08:51:07.772][][main]WARN DCAwareRoundRobinPolicy-Some contact points don't match local data center. Local DC = MyDC. Non-conforming contact points: ......
解决方案: 配置对local dc.
(2)system_auth存在单点故障,且Driver版本不够高
现象:当Cassandra节点down了后,重启后,忽然发现不工作了,查询日志,没有可用节点,但是明明可用。
解决:
从日志看,是因为授权失败,导致Control Connection不再重试,所以所有节点再重启后,感知不到节点已经启动起来了,所以提示NoHostAvailableException:
[07/10/2017 02:28:56.817][][Reconnection-1]ERROR AbstractReconnectionHandler-Authentication error on host sjwdcaat101.webex.com/10.252.67.130:9042: Username and/or password are incorrect [07/10/2017 02:28:56.817][][Reconnection-1]ERROR AbstractReconnectionHandler-Retry against testcasandra.test.com/10.224.2.110 have been suspended. It won't be retried unless the node is restarted.
但是明明密码可以登录。却说用户名和密码不对,后来查了下,是因为system_auth表配置的复制因子被设置成1,所以当这个节点挂掉后,无法授权,而刚好这种错误在低版本的driver中是不做重试的,所以后面即使启动起来也不行了。
解决:
(1)修改复制策略,不要搞成单点:
ALTER KEYSPACE system_auth WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' : 3 };
(2)升级Driver: 2.1.10+
新版本:
com.datastax.driver.core.AbstractReconnectionHandler
// Retrying on authentication errors makes sense for applications that can update the credentials at runtime, we don't want to force them // to restart. protected boolean onAuthenticationException(AuthenticationException e, long nextDelayMs) { return true; }
旧版本:
protected boolean onAuthenticationException(AuthenticationException e, long nextDelayMs) { return false; }
总结:
Cassandra所有节点都是up状态,但是却报NoHostAvailableException这种错误,最后查出的原因往往比较搞笑,但是这种错误确实让人很抓狂,因为是眼瞅没有问题但是客户端彻底不工作了。遇到这种错误,还是应该第一步找到最早出问题的那个时间点,然后还原场景做分析,方能有的放矢。