We have 3 Nodes of Ignite 2.7.6 cluster where we defined off heap memory/data region "datawarm" with max size 10GB and persistence enabled. We are facing very strange issue for more than a month. While connecting via SQLLine or Java Thin Client API, suddenly out of three nodes only 2 or sometime 1 node only providing the response (random order each time). After restart of the cluster the issue got resolved each time however after 3-4 hours of restart again it started. During checking the logs on the node where the connection not established only found below. We don't have any clue how to resolve this.
Jul 21 18:08:13 node1.example.com Ignite[6097]: 2020-07-21 18:08:13,219 DEBUG c.e.d.Logger [grid-nio-worker-client-listener-0-#30] Got client connection from address: /10.0.0.12:42262
Related
After running windows redis 3.2 for over 4 years without issue I am now getting the intermittent error "ClusterDown Hash slot not served" when connecting from my client application. The error occurs on roughly 50% of the calls to redis.
The client uses stackexchange redis with coding in C#.
We have 2 redis servers set up as master ports (7000, 7001, 7002) and secondary on ports (7003, 7004, 7005)
Using Redis Insight to view the redis server we see the error below, but only sometimes.
"The seed nodes have different cluster configuration. This means that your application may be reading from two different clusters."
We have inspected the all configurations and tried failovers with no success fixing the issue.
Any idea on what to look at would be very appreciated.
we have ambari cluster with 872 data-nodes machines , when ambari version is 2.6.x
we have for now some network problem ,
after long investigation we found that , ambari agent that runs on some machine not communicate well with the ambari server
therefore we get some strange behaviors as 5 dead data-nodes from ambari dashboard , while for sure datanodes machine are healthy
is it possible to give more tolerated value in ambari agent configuration so the ack between ambari agent to ambari server will be after more little time in order to ignore the network problems ?
something like timeout or time connection between the ambari agent to ambari server
First of all, you need to get the root cause of the issue why Data Node is showing as Dead.
Ambari agent runs on every node. It is responsible for sending
metrics and heartbeat to the Ambari server which then publishes to
your Ambari web UI.
The name node waits for 10 minutes till it declares the data node as dead and copies
the blocks to other data nodes.
If it's showing that data node is dead then please check the Ambari agent status in
the specific node by running-service ambari-agent status. Parallelly you can check the ambari-agent.log in the worker node to check why Ambari agent stopped working.
You can configure your http timeouts in ambari-agents for service tasks, http timeouts
https://github.com/apache/ambari/blob/trunk/ambari-agent/conf/unix/ambari-agent.ini
There's a HTTP Timeout section you can configure it based on your network throughput.
The file should be in /etc/ambari-agent/ambari.properties
We currently have a redis cache cluster with 3 masters and 3 slaves hosted on 3 windows servers (1 master/slave by server). We are using StackExhange.Redis as our client.
We have RBD disabled but AOF enabled and are experiencing some problems with the cluster in the following situation :
One of our servers became full and the redis node on this server was unable to write to the AOF file (the error returned to the client was MISCONF Errors writing to the AOF file: No space left on device).
The cluster did not detect that the node was failing and so did not exlclude it from the cluster.
All cache operations were blocked until we make some place on the server.
We know that we don't need the AOF, so we have disalbed it after the incident.
But we would like to confirm or infirm our view on redis clustering: for us, if a node was experiencing a failure, the cluster would redirect all requests to another one. We have tested that with a stopped node master, a slave is promoted into a master so we are confident that our cluster is working, but we are not sure why, in our case, the node was not marked as a failure.
Is the cluster capable of detecting a node failure when the failure is only happening when a request is made from a client to the cluster ?
I've configured 3 zookeepers and 3 activemq instances in 1 cluster.
Scenario
3 activemq instances with only 1 master and other two is slave.
all 3 activemq instances are running, i.e. sudo service activemq status returns running but checking the logs, 1 instance(activemq1) is currently waiting for other cluster members, 1 instance(activemq2) stops, 1 instance(activemq3) has error. Assumming that we only require two instance to elect master, this setup should be able to run successfully .
two activemq instances should be running
zookeeper instances are running fine.
Issue
Below are the stacktraces of the respective activemq instances. Based on my understanding, it needs at least two properly running activemq intances for the cluster to nominate a master instance. Given that all activemq instanes produces running when issued with sudo service activemq status , I'm assuming there is an issue inside each activemq instances - refer to below stacktraces. Now, I noticed on logs, that activemq1 only fails to be properly running since other activemq instances failed internally. Notice the stacktrace on activemq2, it's stucked after it successfully connected to zookeeper and activemq3 has issue, I still need to figure out. The issue is fixed when I restarted activemq2 and activemq3. However, I can't be sure this won't happen again, thus this question.
activem1 show the below stacktrace, which I assume that this is because the other 2 activemq instances are running but has errors
Session establishment complete on server 10.5.4.111/10.5.4.111:2181, sessionid = 0x1582db00708000c, negotiated timeout = 4000
Not enough cluster members connected to elect a master.
Not enough cluster members connected to elect a master.
Not enough cluster members connected to elect a master.
activemq2 has the below stacktrace, which is the one I don't understand. It has stopped after successful connection to zookeeper, which should be detected by other activemq instances belonging to cluster-activem1 and activemq3
Opening socket connection to server 10.5.4.111/10.5.4.111:2181
Socket connection established to 10.5.4.111/10.5.4.111:2181, initiating session
Session establishment complete on server 10.5.4.111/10.5.4.111:2181, sessionid = 0x1582db00708000d, negotiated timeout = 4000
activemq3 has the below stacktrace
org.apache.jasper.servlet.JspServletWrapper.handleJspException(JspServletWrapper.java:568)[apache-jsp-8.0.9.M3.jar:2.3]
Configuration for activemq
the previous config here is with 2s zkSessionTimeout - which is the default. I made it to 4s as per googled to maximize the time needed for an activemq instance registers itself to zookeeper.
<persistenceAdapter>
<replicatedLevelDB
directory="${activemq.data}/leveldb"
replicas="3"
bind="tcp://0.0.0.0:61619"
zkAddress="zookeeper_addresses_here"
hostname="activemq_hostname_here"
zkSessionTimeout="4s"
/>
</persistenceAdapter>
Configuration for zookeeper
tickTime=2000
initLimit=10
syncLimit=5
dataDir=/my/data/dir
clientPort=2181
server.1=activemq1_privateIP:2888:3888
server.2=activemq2_privateIP:2888:3888
server.3=activemq3_privateIP::2888:3888
autopurge.purgeInterval=24
autopurge.snapRetainCount=5
Zookeeper version 3.4.9
ActiveMQ version 5.13.4
Setup via Opswork
The attribute "directory" master-slave mq is need to refer to the same folder
I've been trying to setup Enterprise Jenkins with the High Availabilty setup. The current setup consists of two jenkins masters sharing the same jenkins home, say master1 and master2, an installation of the jenkins-ha-monitor-1.1-1.1 rpm on both these masters, say monitor1 and monitor2. With this setup, according to the documentation atleast, the HA plugin should work as expected. Promotion and demotion scripts are similar to the ones in the documentation (only the ip and interface is different, same approach). i.e
For demotion
ifconfig eth0:2 down
For promotion
ifconfig eth0:2 the.floating.ip
Now for the nodes to get registered correctly I have to start master1, master2, monitor1 and monitor2 in that order. Tailing the logs for both I see that when the services are started in that order they are registered correctly by both monitor services as nodes in a cluster, and in the HA status gui in the jenkins console.
Now when master1 is killed by sending it a KILL signal monitor2 recognizes this and runs the promotion script. But monitor one keeps throwing :
Oct 24, 2012 3:47:36 PM
com.cloudbees.jenkins.ha.singleton.HASingleton$3 suspect INFO:
Suspecting a node failure in a cluster: jenkins-master-1-285 Oct 24,
2012 3:47:39 PM com.cloudbees.jenkins.ha.singleton.HASingleton$3
suspect INFO: Suspecting a node failure in a cluster:
jenkins-master-1-285
continuously without ever runnign the demotion script. Now since master2 has taken up the floating ip via its promotion script, and master1 still has that ip because demotion script is not run the setup ends up with two boxes claiming the same ip. Moreover restarting master1 does not do anything, i.e master1 does not get added to the cluster as a seconday node, monitor1 still keeps spitting the above messages to log, the floating ip keeps returning "Unable to connect" and master2 and monitor2 show the cluster as master2,monitor2 and monitor1. So my question/problem is twofold - why isnt master1 accepted back into the cluster? And why isn't the demotion script run as it should?
Also FYI i have tried to do a
service jenkins stop
and in that case the demotion script runs but again there are similar issues when
service jenkins start
is run on the master that was stopped earlier since the promotion script is run regardless of whether a primary jenkins exists. And in this case the two monitors register different clusters like so monitor1 : master1,monitor1 and monitor2 : master2,monitor2.
Running an ifconfig shows that both masters have taken up the floating ip at this point.
Any help is appreciated! Thanks!
Still under investigation with support. The originally reported problem (here) suggests that the two nodes are communicating fine, but promotions/demotions are not run correctly—either a bug in JGroups or in its usage in Jenkins high availability.
But further tests turned up problems with UDP multicast communication, which has been reported for RedHat/CentOS hosts. Work is underway to offer an alternate JGroups stack which does not rely on multicast (or UDP) at all, using the shared $JENKINS_HOME directory to register Jenkins and monitor instances (as TCP address:port records).