DC/OS Mesos-Master rejoined and causes interruptions on the master agents - dcos

I'm having a strange issue today. First of all, every thing was still working fine yesterday when I left the office, but today when I went back to work my DC/OS dashboard showed my that there weren't any services running, or Nodes connected.
I've ran into this issue once or twice before and was related to the marathon not being able to elect a master. One of the 3 master nodes is then also showing a lot of errors in the journal. This can be resolved by stopping / starting the dcos-marathon service on that host, which brings it back into the marathon group.
I did see the Nodes and services again. But now it sometimes tells me there is only one Node connected and then 3 again, and just 1 again, etc..
When I stop the dcos-mesos-master process on the conflicting host, this stops and I have a stable master cluster (but probably not really resilient).
It looks like the failing node is trying to become the master, which causes this.. I've tried to search about rejoining a failed mesos-master.. but came up
I'm running DC/OS on a CoreOS environment.

Although a general behavior is described, you may need to provide more specifics such as the kernel version, dc/os version, specs and etc. The simplest answer I can provide based what's been given is to reach out via their support channel on Slack ( https://dcos-community.slack.com/ ).

Related

What could be the reason for OpenLDAP replica to skip few items while synchronization

I have an OpenLdap cluster with 6 nodes, when an item is added/deleted in the master, the synchronization kicks in and the changes are replicated to other slave nodes in the cluster, but sometimes one of the slave cluster nodes (the same node all the time) misses the updates and hence there is a difference between this slave node and the rest of the slave nodes and the master, so sometimes when the request goes to the unsynchronized slave it yields invalid results.
In the problematic slave's ldap logs, there is no error information during this operation to the master which explains the miss, so cant figure out what has caused this problem, bringing down that slave and re-add does not help either.
Anyone has faced similar problem and figured out the cause ?
I have posted the query in OpenLDAP itself and they suggested to go for the upgrade as the version i am using is 2.4.44 which is couple of years old and looks like there have been many replication related fixes went in since 2.4.44
Below is the OpenLDAP forum link for the same:
https://bugs.openldap.org/show_bug.cgi?id=9701

Best Practice to Upgrade Redis with Sentinels?

I have three redis nodes being watched by 3 sentinels. I've searched around and the documentation seems to be unclear as to how best to upgrade a configuration of this type. I'm currently on version 3.0.6 and I want to upgrade to the latest 5.0.5. I have a few questions on the procedure around this.
Is it ok to upgrade two major versions? I did this in our staging environment and it seemed to be fine. We use pretty basic redis functionality and there are no breaking changes between the versions.
Does order matter? Should I upgrade say all the sentinels first and then the redis nodes, or should the sentinel plane be last after verifying the redis plane? Should I do one sentinel/redis node at a time?
Any advice or experience on this would be appreciated.
I am surprised by the lack of response to this, but I understand that the subject kind of straddles something like stackoverflow and something like stack exchange. I'm also surprised at the lack of documentation I was able to find on the subject.
I did some extensive testing in a staging environment and then proceeded to our production and the procedure I followed seemed to work for the most part:
Upgrading from 3.0.6 to 5.0.5 in our case seems to be working without a hitch. As I said in the original post, we use the basics in redis and there hasn't been much changed from the client perspective.
I went forward upgrading in this order:
The first two sentinel peers and then the sentinel currently in the leader status.
Each of the redis nodes listed as slaves (now known as replicas).
After each node is upgraded, it will want to copy its dump.rdb from the master
A sync can be done to a 5 node from a 3 node, but once a 5 node is the master, a 3 node cannot sync, so once you've failed over to an upgraded node, you can't go back to the earlier version.
Finally use the sentinels to failover to an upgraded node as master and upgrade the former master
Hopefully someone might find this useful going forward.

How to configure Akka.Cluster for services that Crash when binding to port 0

What I am testing is the following scenario:
Start 2 Lighthouses, then start a 3 service that is a member of the cluster. It's seed nodes are configured to be the two Lighthouses that were previously started.
Now this 3rd service has it's HOCON set to bind to port 0, which does it's job and gives me a random port.
Now when I force quit this service to simulate a crash, The logging output from Akka.Net gets REAL chatty (important parts)
AssociationError...Tried to associate with unreachable remote address
address is now gated for 5000ms ... No connection could be made because the target machine actively refused it.
And it seems like it just goes on forever. I assume this is probably harmless and it just looks like a terrible error. The message itself makes sense, the service is literally gone so it can not and will never be able to connect.
Now if I restart the service since it's configured to bind to 0 for Akka.Remoting, it will get an entirely new port, so the Unreachable status of the other failed service will never be resolved.
Is this the expected behavior? I also think there is a configuration setting that might come into play here:
auto-down-unreachable-after
Now this comes with it's own warning about:
Using auto-down implies that two separate clusters will automatically be formed in case of network partition.
Setting this does silence the messages:
auto-down-unreachable-after = 3s
And I get a new message after the node is marked unreachable:
Association to [akka.tcp://ClusterName#localhost:58977] having UID [983892349]is irrecoverably failed. UID is now quarantined and all messages to this UID will be delivered to dead letters. Remote actorsystem must be restarted to recover from this situation.
Remote actorsystem must be restarted to recover from this situation. Seems pretty serious and something to avoid. At the same time, given that the service joins on a random port, it is irrecoverable. In trying to gain some more knowledge about the UID it seems that it's internally assigned. So I can only guess there would not be any collisions later in time with UIDs, so this would be the proper behavior.
This seems to be the only option outside of
log-info = off
to just silence the logs
I assume the logging of the lighthouse services are chatty, right? That is 'normal' behaviour of the Akka gossip protocol trying to communicate with the crashed node. When this happens, you must configure what you want to do.
The solution for solving this is not always the same for each situation. It could depend for example if you are running the services on a cloud microservices platform for example. But one of the options is indeed 'auto-downing'. This will mark the service as 'UNREACHABLE' (as you can see). This means that the node isn't out of the cluster, but the cluster continues to operate without the crashed node. That's the reason that the same node cannot join, because it is still marked as 'UNREACHABLE'.
Be aware that auto-downing could result into a 'split-brain' of the cluster, where the two parts of the cluster (for example one cluster of 4 nodes gets split into 2 clusters of 2 nodes). This is a situation that you don't want, so this may not be the best solution!
Akka.NET has some other solution to you can configure to correctly deal with this: the Split Brain Resolver. More information how to configure this: https://getakka.net/articles/clustering/split-brain-resolver.html
These are all strategies to prevent 'split-brain' situations and will involve sacrificing nodes to keep the cluster consistent. Use these strategies in combination with for example a microservices orchestration platform (so that instances will restart themselves after crashing/exiting) to create a perfect self-healing Akka cluster.

DC/OS has three roles, they are master, slave, slave_public, why can't put them on one host?

I just investigate DC/OS, I find that DC/OS has three roles:master, slave, slave_public, I want to deploy a cluster which can host master, slave or slave_public roles on one host, but currently I can't do that.
I want to know that why can't put them on one host when designed. If I do that, could I get some suggestions?
I just have the idea. If I can't do, I'll quit using DCOS, I'll use mesos and marathon.
Is there someone has the idea with me? I look forward to the reply.
This is by design, and things are actually being worked on to re-enforce that an machine is installed with only one role because things break with more than one.
If you're trying to demo / experiment with DC/OS and you only have one machine, you can use Virtual Machines or Docker to partition that one machine into multiple machines / parts which you can install DC/OS on. dcos-vagrant and dcos-docker can help you there.
As far as installing though, the configuration for each of the three roles is incompatible with one another. The "master" role causes a whole bunch of pieces of software to be started / installed on a host (Mesos-DNS, Mesos master, marathon, exhibitor, zookeeper, 3dt, adminrouter, rexray, spartan, navstar among others) which listen on various ports. The "slave" role causes a machine to have a mesos-agent (mesos renamed mesos-slave to mesos-agent, hence the disconnect) configured and started on the agent. The mesos-agent is configured to control / most ports greater than 1024 to tasks which are launched by mesos frameworks on the agent. Several of those ports are used by services which are run on masters, resulting in odd conflicts and hard to fix bad behavior.
In the case of running the "slave" and "slave_public" on the same host, those two conflict more directly, because both of them cause mesos-agent to be run on the host, with slightly different configuration. Both the mesos-agent (the one configured with the "slave" role and the one with the "slave_public" role are configured to listen on port 5051. Only one of them can use it though, so you end up with one of the agents being non-functional.
DC/OS only supports running a node as either a master or an agent(slave). You are correct that Mesos does not have this limitation. But DC/OS is more than just a Mesos/Marathon. To enable all the additional features of DC/OS there are various components built around Mesos and Marathon. At times these components behave differently whether they are running on a master or an agent and at other times the components that exist on a master may or may not exist on an agent or vice versa. So running a master and an agent on the same node would lead to conflicts/issues.
If you are looking to run a small development setup before scaling the solution out to a bigger distributed system DC/OS Vagrant might be a good starting point.

Enterprise Jenkins HA plugin not working as it should

I've been trying to setup Enterprise Jenkins with the High Availabilty setup. The current setup consists of two jenkins masters sharing the same jenkins home, say master1 and master2, an installation of the jenkins-ha-monitor-1.1-1.1 rpm on both these masters, say monitor1 and monitor2. With this setup, according to the documentation atleast, the HA plugin should work as expected. Promotion and demotion scripts are similar to the ones in the documentation (only the ip and interface is different, same approach). i.e
For demotion
ifconfig eth0:2 down
For promotion
ifconfig eth0:2 the.floating.ip
Now for the nodes to get registered correctly I have to start master1, master2, monitor1 and monitor2 in that order. Tailing the logs for both I see that when the services are started in that order they are registered correctly by both monitor services as nodes in a cluster, and in the HA status gui in the jenkins console.
Now when master1 is killed by sending it a KILL signal monitor2 recognizes this and runs the promotion script. But monitor one keeps throwing :
Oct 24, 2012 3:47:36 PM
com.cloudbees.jenkins.ha.singleton.HASingleton$3 suspect INFO:
Suspecting a node failure in a cluster: jenkins-master-1-285 Oct 24,
2012 3:47:39 PM com.cloudbees.jenkins.ha.singleton.HASingleton$3
suspect INFO: Suspecting a node failure in a cluster:
jenkins-master-1-285
continuously without ever runnign the demotion script. Now since master2 has taken up the floating ip via its promotion script, and master1 still has that ip because demotion script is not run the setup ends up with two boxes claiming the same ip. Moreover restarting master1 does not do anything, i.e master1 does not get added to the cluster as a seconday node, monitor1 still keeps spitting the above messages to log, the floating ip keeps returning "Unable to connect" and master2 and monitor2 show the cluster as master2,monitor2 and monitor1. So my question/problem is twofold - why isnt master1 accepted back into the cluster? And why isn't the demotion script run as it should?
Also FYI i have tried to do a
service jenkins stop
and in that case the demotion script runs but again there are similar issues when
service jenkins start
is run on the master that was stopped earlier since the promotion script is run regardless of whether a primary jenkins exists. And in this case the two monitors register different clusters like so monitor1 : master1,monitor1 and monitor2 : master2,monitor2.
Running an ifconfig shows that both masters have taken up the floating ip at this point.
Any help is appreciated! Thanks!
Still under investigation with support. The originally reported problem (here) suggests that the two nodes are communicating fine, but promotions/demotions are not run correctly—either a bug in JGroups or in its usage in Jenkins high availability.
But further tests turned up problems with UDP multicast communication, which has been reported for RedHat/CentOS hosts. Work is underway to offer an alternate JGroups stack which does not rely on multicast (or UDP) at all, using the shared $JENKINS_HOME directory to register Jenkins and monitor instances (as TCP address:port records).