Enterprise Jenkins HA plugin not working as it should - cloudbees

I've been trying to setup Enterprise Jenkins with the High Availabilty setup. The current setup consists of two jenkins masters sharing the same jenkins home, say master1 and master2, an installation of the jenkins-ha-monitor-1.1-1.1 rpm on both these masters, say monitor1 and monitor2. With this setup, according to the documentation atleast, the HA plugin should work as expected. Promotion and demotion scripts are similar to the ones in the documentation (only the ip and interface is different, same approach). i.e
For demotion
ifconfig eth0:2 down
For promotion
ifconfig eth0:2 the.floating.ip
Now for the nodes to get registered correctly I have to start master1, master2, monitor1 and monitor2 in that order. Tailing the logs for both I see that when the services are started in that order they are registered correctly by both monitor services as nodes in a cluster, and in the HA status gui in the jenkins console.
Now when master1 is killed by sending it a KILL signal monitor2 recognizes this and runs the promotion script. But monitor one keeps throwing :
Oct 24, 2012 3:47:36 PM
com.cloudbees.jenkins.ha.singleton.HASingleton$3 suspect INFO:
Suspecting a node failure in a cluster: jenkins-master-1-285 Oct 24,
2012 3:47:39 PM com.cloudbees.jenkins.ha.singleton.HASingleton$3
suspect INFO: Suspecting a node failure in a cluster:
jenkins-master-1-285
continuously without ever runnign the demotion script. Now since master2 has taken up the floating ip via its promotion script, and master1 still has that ip because demotion script is not run the setup ends up with two boxes claiming the same ip. Moreover restarting master1 does not do anything, i.e master1 does not get added to the cluster as a seconday node, monitor1 still keeps spitting the above messages to log, the floating ip keeps returning "Unable to connect" and master2 and monitor2 show the cluster as master2,monitor2 and monitor1. So my question/problem is twofold - why isnt master1 accepted back into the cluster? And why isn't the demotion script run as it should?
Also FYI i have tried to do a
service jenkins stop
and in that case the demotion script runs but again there are similar issues when
service jenkins start
is run on the master that was stopped earlier since the promotion script is run regardless of whether a primary jenkins exists. And in this case the two monitors register different clusters like so monitor1 : master1,monitor1 and monitor2 : master2,monitor2.
Running an ifconfig shows that both masters have taken up the floating ip at this point.
Any help is appreciated! Thanks!

Still under investigation with support. The originally reported problem (here) suggests that the two nodes are communicating fine, but promotions/demotions are not run correctly—either a bug in JGroups or in its usage in Jenkins high availability.
But further tests turned up problems with UDP multicast communication, which has been reported for RedHat/CentOS hosts. Work is underway to offer an alternate JGroups stack which does not rely on multicast (or UDP) at all, using the shared $JENKINS_HOME directory to register Jenkins and monitor instances (as TCP address:port records).

Related

Certain java-based containers throw "UnknownHostException"

I have two issues with my kubernetes.
kubernetes version 1.12.5, ubuntu16.04
the first issue is
Occasionally, containers on a specific node are restarted including kube-proxy
kernel: IPVS: rr TCP - no destination available
IPVS: __ip_vs_del_service:enter
net_ratelimit: callbacks suppressed
As these logs are continuously recorded,
The load avarage of node system resources is rather high.
Docker containers uploaded to the node keep repeating the restart.
In this case, node drain can relieve the symptoms.
the second issue
Certain java-based containers throw "UnknownHostException".
Restarting the container manually will resolve the symptoms.
Should I look at the container deployment settings?
Should I look at the cluster dns, resolve related settings?
I want to know if UnknownHostException is related to dns settings.
Can you give me some good comments?

Restarting managed servers by clusters without outage

I want to write script for restarting weblogics managed servers, which would do the following:
It would contain loop ,which would restart first nodes of all clusters at one time.
a.)FORCE_SHUTDOWN
b.)wait for status: SHUTDOWN
c.)START managed servers
d.)wait for status: RUNNING
e.)move to next node of each cluster and repeat until all managed servers are restarted.
So in first iteration it would restart all first nodes of each cluster, in second iteration it would restart the second nodes of each cluster and repeat this action until all managed servers are restarted.
I have not started to writing the script yet, I am newbie with weblogic and this is just concept. Do you have any suggestions how to achieve that goal?
Why reinvent the wheel?
rollingRestart
Category: Control Commands
Use with WLST: Online
Description Initiates a rolling restart of all servers in a domain or all servers in a specific cluster or clusters without interrupting
the service. This command provides the ability to sequentially restart
servers.
This operation involves the graceful shutdown of the servers, and the
servers being restarted without interrupting the service for the user.
Syntax
rollingRestart(target, [options])

DC/OS has three roles, they are master, slave, slave_public, why can't put them on one host?

I just investigate DC/OS, I find that DC/OS has three roles:master, slave, slave_public, I want to deploy a cluster which can host master, slave or slave_public roles on one host, but currently I can't do that.
I want to know that why can't put them on one host when designed. If I do that, could I get some suggestions?
I just have the idea. If I can't do, I'll quit using DCOS, I'll use mesos and marathon.
Is there someone has the idea with me? I look forward to the reply.
This is by design, and things are actually being worked on to re-enforce that an machine is installed with only one role because things break with more than one.
If you're trying to demo / experiment with DC/OS and you only have one machine, you can use Virtual Machines or Docker to partition that one machine into multiple machines / parts which you can install DC/OS on. dcos-vagrant and dcos-docker can help you there.
As far as installing though, the configuration for each of the three roles is incompatible with one another. The "master" role causes a whole bunch of pieces of software to be started / installed on a host (Mesos-DNS, Mesos master, marathon, exhibitor, zookeeper, 3dt, adminrouter, rexray, spartan, navstar among others) which listen on various ports. The "slave" role causes a machine to have a mesos-agent (mesos renamed mesos-slave to mesos-agent, hence the disconnect) configured and started on the agent. The mesos-agent is configured to control / most ports greater than 1024 to tasks which are launched by mesos frameworks on the agent. Several of those ports are used by services which are run on masters, resulting in odd conflicts and hard to fix bad behavior.
In the case of running the "slave" and "slave_public" on the same host, those two conflict more directly, because both of them cause mesos-agent to be run on the host, with slightly different configuration. Both the mesos-agent (the one configured with the "slave" role and the one with the "slave_public" role are configured to listen on port 5051. Only one of them can use it though, so you end up with one of the agents being non-functional.
DC/OS only supports running a node as either a master or an agent(slave). You are correct that Mesos does not have this limitation. But DC/OS is more than just a Mesos/Marathon. To enable all the additional features of DC/OS there are various components built around Mesos and Marathon. At times these components behave differently whether they are running on a master or an agent and at other times the components that exist on a master may or may not exist on an agent or vice versa. So running a master and an agent on the same node would lead to conflicts/issues.
If you are looking to run a small development setup before scaling the solution out to a bigger distributed system DC/OS Vagrant might be a good starting point.

DC/OS Mesos-Master rejoined and causes interruptions on the master agents

I'm having a strange issue today. First of all, every thing was still working fine yesterday when I left the office, but today when I went back to work my DC/OS dashboard showed my that there weren't any services running, or Nodes connected.
I've ran into this issue once or twice before and was related to the marathon not being able to elect a master. One of the 3 master nodes is then also showing a lot of errors in the journal. This can be resolved by stopping / starting the dcos-marathon service on that host, which brings it back into the marathon group.
I did see the Nodes and services again. But now it sometimes tells me there is only one Node connected and then 3 again, and just 1 again, etc..
When I stop the dcos-mesos-master process on the conflicting host, this stops and I have a stable master cluster (but probably not really resilient).
It looks like the failing node is trying to become the master, which causes this.. I've tried to search about rejoining a failed mesos-master.. but came up
I'm running DC/OS on a CoreOS environment.
Although a general behavior is described, you may need to provide more specifics such as the kernel version, dc/os version, specs and etc. The simplest answer I can provide based what's been given is to reach out via their support channel on Slack ( https://dcos-community.slack.com/ ).

Flink Jobmanager not able to see task managers

So I've installed an apache flink cluster on our network. I've done the configurations as illustrated below. This Master (JobManager) starts, and sends the start command to all the slaves via ssh. I can see that the task managers are running after they were started by the master node.
Config file on all nodes:
jobmanager.rpc.address: flmaster
jobmanager.rpc.port: 6123
jobmanager.heap.mb: 1024
taskmanager.heap.mb: 2048
taskmanager.numberOfTaskSlots: 1
taskmanager.memory.preallocate: false
parallelism.default: 1
jobmanager.web.port: 8081
taskmanager.tmp.dirs: /apps/storage/runtime/flink/workspace
recovery.mode: zookeeper
recovery.zookeeper.quorum:zk1:2181, zk2:2181, zk3:2181
recovery.zookeeper.storageDir: /apps/runtime/flink/recovery
env.java.home: /apps/java/
Then i have a file called slaves in the config folder with a list of the slaves nodes.
flSlave1
flSlave2
flSlave3
I then start it
../bin/start-cluster.sh
This opens an ssh session to all the slave nodes, and starts the task manager. I can see this with ps ax | grep java
I can open the Web-Ui on flMaster:8081
On the WebUI I can see the slave node count is 0. I have no task managers.
As a test, I started the wordcount.jar job, and it tells me it cannot run the job since there are no slots open.
/apps/flink/bin/flink run /apps/flink/examples/batch/WordCount.jar
the response:
07/20/2016 13:19:01 Job execution switched to status FAILING.
org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: Not enough free slots available to run the job.*
Well I guess if there is no task managers/slave nodes, there will be no slots.
Any one ever seen this issue?
Use fully qualified hostname instead of short name. For e.g hostname.xyx.com instead of just hostname. OR you could also try using ip address.
Try doing a telnet on jobmanager machine rpc port. The taskmanagers talk with jobmanager through rpc. So check the network settings whether you are able to access the jobmanager and task managers' rpc ports or not.
Also check the blob server port. Check the taskmanager logs whether it is able to connect to the jobmanager blob server or not.