DC/OS has three roles, they are master, slave, slave_public, why can't put them on one host? - dcos

I just investigate DC/OS, I find that DC/OS has three roles:master, slave, slave_public, I want to deploy a cluster which can host master, slave or slave_public roles on one host, but currently I can't do that.
I want to know that why can't put them on one host when designed. If I do that, could I get some suggestions?
I just have the idea. If I can't do, I'll quit using DCOS, I'll use mesos and marathon.
Is there someone has the idea with me? I look forward to the reply.

This is by design, and things are actually being worked on to re-enforce that an machine is installed with only one role because things break with more than one.
If you're trying to demo / experiment with DC/OS and you only have one machine, you can use Virtual Machines or Docker to partition that one machine into multiple machines / parts which you can install DC/OS on. dcos-vagrant and dcos-docker can help you there.
As far as installing though, the configuration for each of the three roles is incompatible with one another. The "master" role causes a whole bunch of pieces of software to be started / installed on a host (Mesos-DNS, Mesos master, marathon, exhibitor, zookeeper, 3dt, adminrouter, rexray, spartan, navstar among others) which listen on various ports. The "slave" role causes a machine to have a mesos-agent (mesos renamed mesos-slave to mesos-agent, hence the disconnect) configured and started on the agent. The mesos-agent is configured to control / most ports greater than 1024 to tasks which are launched by mesos frameworks on the agent. Several of those ports are used by services which are run on masters, resulting in odd conflicts and hard to fix bad behavior.
In the case of running the "slave" and "slave_public" on the same host, those two conflict more directly, because both of them cause mesos-agent to be run on the host, with slightly different configuration. Both the mesos-agent (the one configured with the "slave" role and the one with the "slave_public" role are configured to listen on port 5051. Only one of them can use it though, so you end up with one of the agents being non-functional.

DC/OS only supports running a node as either a master or an agent(slave). You are correct that Mesos does not have this limitation. But DC/OS is more than just a Mesos/Marathon. To enable all the additional features of DC/OS there are various components built around Mesos and Marathon. At times these components behave differently whether they are running on a master or an agent and at other times the components that exist on a master may or may not exist on an agent or vice versa. So running a master and an agent on the same node would lead to conflicts/issues.
If you are looking to run a small development setup before scaling the solution out to a bigger distributed system DC/OS Vagrant might be a good starting point.

Related

Hyper-V Anti Affinity

I'm trying to setup anti affinity with a cluster Hyper-V setup but am struggling to get any VMs to stay apart. It seems to be that the anti affinity is simply not honored.
Setup:
3 x Hyper-V servers (server1, server2, server3)
3 x VMs (web_test_1, web_test_2, web_test3)
Attempt 1:
I ran the below script on server1:
$WEBAntiAffinity = New-Object System.Collections.Specialized.StringCollection
$WEBAntiAffinity.Add("WEB Servers")
(Get-ClusterGroup –Name WEB_TEST_1).AntiAffinityClassNames = $WEBAntiAffinity
(Get-ClusterGroup –Name WEB_TEST_2).AntiAffinityClassNames = $WEBAntiAffinity
(Get-ClusterGroup –Name WEB_TEST_3).AntiAffinityClassNames = $WEBAntiAffinity
Get-ClusterGroup |Select-Object -Property name,AntiAffinityClassNames
All three VMs were powered off before I ran the above and all created on server1.
When powering them on, they all powered on and stayed on server1.
Attempt 2:
I ran the same script above, on the additional servers (server2 and server3).
I powered off the VMs and powered them back on again, they again all remained on server1.
Attempt 3:
After having ran the script on all the servers, I restarted the servers one by one. The VMs moved between nodes as normal during the reboots but when all were rebooted I stopped all the VMs, moved them to server1 and then started them again.
My assumption was that 2 would move before powering on but that didn't happen, they all started on server1.
Anyone know what I'm doing wrong here? Am I missing some pre-reqs? There's not a great amount of examples online that I've been able to find.
This is not called out specifically in Microsoft's documentation, but to make anti-affinity rules work in Hyper-V you also need System Center Virtual Machine Manager (SCVMM). SCVMM reads the anti-affinity rules (and other rules such as affinity, priority, etc.) and performs the migrations to apply those rules.
This is equivalent to the vmware stack, where ESXi is the hypervisor, the VM configuration contains the rules, and Director or vSphere actually apply the rules.

How can I configure Apache Zookeeper with redundancy on only two physical frames?

I would like to have a high-availability/redundant installation of Zookeeper running in my production environment. The problem is that I only have 2 physical frames available, so that rules out configuring a Zookeeper cluster/ensemble since I'd only have redundancy if the frame with the minority of servers goes down. What is the best practice in this situation? Is it possible to have a separate standalone install running on each frame connected to the same set of SOLR nodes or to use one server as primary and one as backup?
Zookeeper requires 3 nodes. In your scenario if you cannot get another machine you can setup multiple zookeeper nodes on the same machine in different directories using different ports.

DC/OS Mesos-Master rejoined and causes interruptions on the master agents

I'm having a strange issue today. First of all, every thing was still working fine yesterday when I left the office, but today when I went back to work my DC/OS dashboard showed my that there weren't any services running, or Nodes connected.
I've ran into this issue once or twice before and was related to the marathon not being able to elect a master. One of the 3 master nodes is then also showing a lot of errors in the journal. This can be resolved by stopping / starting the dcos-marathon service on that host, which brings it back into the marathon group.
I did see the Nodes and services again. But now it sometimes tells me there is only one Node connected and then 3 again, and just 1 again, etc..
When I stop the dcos-mesos-master process on the conflicting host, this stops and I have a stable master cluster (but probably not really resilient).
It looks like the failing node is trying to become the master, which causes this.. I've tried to search about rejoining a failed mesos-master.. but came up
I'm running DC/OS on a CoreOS environment.
Although a general behavior is described, you may need to provide more specifics such as the kernel version, dc/os version, specs and etc. The simplest answer I can provide based what's been given is to reach out via their support channel on Slack ( https://dcos-community.slack.com/ ).

Do I need to run the WebLogic node manager on a single machine that has multiple WebLogic instances?

Forward: I'm using Java 6u45, WebLogic 10.3.6, and Ubuntu Desktop 14.04 64-bit.
I just started as a student assistant at one of my state's IT offices. On my first day I was tasked with testing WebLogic on Ubuntu (Windows isn't cases sensitive, causing later issues because WebLogic is...). I started messing around with clustering, and now my setup is as follows:
1 Ubuntu machine
1 domain
6 servers: Admin server, wls1-4, and wlsmaster (wlsmaster was supposed to be what wls1 and wls2 reported to within the cluster because I set the cluster to be unicast, but that's a secondary question for now).
2 clusters: cluster1 and cluster2. wls1, wls2, and wlsmaster are on cluster1. wls3 and 4 are on cluster2.
Given my setup, do I even need to use node manager because I'm only using one physical machine? Secondary question; if I want to use unicast, how do I set the master? $state uses unicast for what few Weblogic servers we have, so I was told to check that out.
A few things:
No, you don't necessarily have to use a nodemanager, but it will make your life easier. When you log into the weblogic admin console and attempt to start one of your servers e.g. wls1-4, the Admin server will attempt to talk to the node manager to start the servers. Without the nodemanager you will have to start each server individually using the startManagedWebLogic.sh script and if you need to bring servers up and down often it will be very annoying.
With regards to Unicast it is pretty easy to set up (we just leave all the default values alone). Here is the pertinent info from the Oracle Docs:
"Each of the Managed Servers in a WebLogic Server cluster has a name. For unicast clusters, WebLogic Server reads these Managed Server names and then sorts them into an ordered list by alphanumeric name. The first 10 Managed Servers in the list (up to 10 Managed Servers) become the first unicast clustering group. The second set of 10 Managed Servers (if applicable) becomes the second group, and so on until all Managed Servers in the cluster are organized into groups of 10 Managed Servers or less. The first Managed Server for each group becomes the group leader for the other (up to) nine Managed Servers in the group."
So you will want to name your master servers in such a way that they are the first alphanumerically in the cluster. That said, for your use case I doubt you need those master servers as all. Just have 2 clusters, one with wls1-2 and one with wls3-4.

Enterprise Jenkins HA plugin not working as it should

I've been trying to setup Enterprise Jenkins with the High Availabilty setup. The current setup consists of two jenkins masters sharing the same jenkins home, say master1 and master2, an installation of the jenkins-ha-monitor-1.1-1.1 rpm on both these masters, say monitor1 and monitor2. With this setup, according to the documentation atleast, the HA plugin should work as expected. Promotion and demotion scripts are similar to the ones in the documentation (only the ip and interface is different, same approach). i.e
For demotion
ifconfig eth0:2 down
For promotion
ifconfig eth0:2 the.floating.ip
Now for the nodes to get registered correctly I have to start master1, master2, monitor1 and monitor2 in that order. Tailing the logs for both I see that when the services are started in that order they are registered correctly by both monitor services as nodes in a cluster, and in the HA status gui in the jenkins console.
Now when master1 is killed by sending it a KILL signal monitor2 recognizes this and runs the promotion script. But monitor one keeps throwing :
Oct 24, 2012 3:47:36 PM
com.cloudbees.jenkins.ha.singleton.HASingleton$3 suspect INFO:
Suspecting a node failure in a cluster: jenkins-master-1-285 Oct 24,
2012 3:47:39 PM com.cloudbees.jenkins.ha.singleton.HASingleton$3
suspect INFO: Suspecting a node failure in a cluster:
jenkins-master-1-285
continuously without ever runnign the demotion script. Now since master2 has taken up the floating ip via its promotion script, and master1 still has that ip because demotion script is not run the setup ends up with two boxes claiming the same ip. Moreover restarting master1 does not do anything, i.e master1 does not get added to the cluster as a seconday node, monitor1 still keeps spitting the above messages to log, the floating ip keeps returning "Unable to connect" and master2 and monitor2 show the cluster as master2,monitor2 and monitor1. So my question/problem is twofold - why isnt master1 accepted back into the cluster? And why isn't the demotion script run as it should?
Also FYI i have tried to do a
service jenkins stop
and in that case the demotion script runs but again there are similar issues when
service jenkins start
is run on the master that was stopped earlier since the promotion script is run regardless of whether a primary jenkins exists. And in this case the two monitors register different clusters like so monitor1 : master1,monitor1 and monitor2 : master2,monitor2.
Running an ifconfig shows that both masters have taken up the floating ip at this point.
Any help is appreciated! Thanks!
Still under investigation with support. The originally reported problem (here) suggests that the two nodes are communicating fine, but promotions/demotions are not run correctly—either a bug in JGroups or in its usage in Jenkins high availability.
But further tests turned up problems with UDP multicast communication, which has been reported for RedHat/CentOS hosts. Work is underway to offer an alternate JGroups stack which does not rely on multicast (or UDP) at all, using the shared $JENKINS_HOME directory to register Jenkins and monitor instances (as TCP address:port records).