Resolving Chef Dependencies - automation

In my lab, I am currently managing a 20 nodes cluster with Cobbler and Chef. Cobbler is used for OS provisioning and basic network settings, which is working fine as expected. I can manage several OS distributions with preseed-based NQA installation and local repo mirroring.
We also successfully installed chef server and started managing nodes but chef is not working as I expected. The issue is that I am not being able to set node dependencies within chef. Our one important use case is this:
We are setting up ceph and openstack on these nodes
Ceph should be installed before openstack because openstack uses ceph as back-end storage
Ceph monitor should be installed before Ceph osd because creating osd requires talking to monitor
The dependencies between Openstack and Ceph does not matter because it is a dependency in one node; just installing openstack later would resolve the issue.
However, a problem arises with the dependency between ceph monitor and ceph osd. Ceph osd provisioning requires a running ceph monitor. Therefore, ceph osd recipe should always be run after ceph mon recipe finishes in another node. Our current method is just to run "chef-client" in "ceph-osd" node after "chef-client" run completely finishes in "ceph-mon" node but I think this is a too much of a hassle. Is there a way to set these dependencies in Chef so that nodes will provision sequentially according to their dependencies? If not, are there good frameworks who handles this?

In chef itself, I know no method for orchestrating (that's not chef Job).
A workaround given your use case could be to use tags and search.
You monitor recipe could tag the node at end (with tag("CephMonitor") or with setting any attribute you wish to search on).
After that the solr index of chef has to catch it up (usually in the minute) and you can use search in the Cephosd recipe you can do something like this:
CephMonitor = search(:node,"tags:CephMonitor") || nil
return if CephMonitor.nil?
[.. rest of the CephOsd recipe, using the CephMonitor['fqdn'] or other attribute from the node ..]
The same behavior can be used to avoid trying to run the OpenStack recipe until the osd has run.
The drawback if that it will take 2 or 3 chef run to get to a converged infrastructure.
I've nothing to recommend to do the orchestration, zookeeper or consul could help instead of tags and to trigger the runs.
Rundeck can tage the runs on different nodes and aggregate this in one job.
Which is best depends on your feeling there.

Related

Google cloud kubernetes cluster newbie question

I am a newbie of GKE. I created a GKE cluster with very simple setup. It only has on gpu node and all other stuff was default. After the cluster is up, I was able to list the nodes and ssh into the nodes. But I have two questions here.
I tried to install nvidia driver using the command:
kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml
It output that:
kubectl apply --filename https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml
daemonset.apps/nvidia-driver-installer configured
But 'nvidia-smi' cannot be found at all. Should I do something else to make it work?
On the worker node, there wasn't the .kube directory and the file 'config'. I had to copy it from the master node to the worker node to make things work. And the config file on the master node automatically updates so I have to copy again and again. Did I miss some steps in the creation of the cluster or how to resolve this problem?
I appreciate someone can shed some light on this. It drove me crazy after working on it for several days.
Tons of thanks.
Alex.
For the DaemonSet to work, you need to have a tag on your worker Node as cloud.google.com/gke-accelerator (see this line). The DaemonSet checks for this tag on a node before scheduling any pods for installing the driver. I'm guessing a default node pool you create did not have this tag on it. You can find more details on this on the GKE docs here.
The worker nodes, by design are just that worker nodes. They do not need privileged access to the Kubernetes API so they don't need any kubeconfig files. The communication between worker nodes and the API is strictly controlled through the kubelet binary running on the node. Therefore, you will never find kubeconfig files on a worker node. Also, you should never put them on the worker node either, since if a node gets compromised, the keys in that file can be used to damage the API Server. Instead, you should make it a habit to either use the master nodes for kubectl commands, or better yet, have the kubeconfig on your local machine, and keep it safe, and issue commands remotely to your cluster.
After all, all you need is access to an API endpoint for your Kubernetes API server, and it shouldn't matter where you access it from, as long as the endpoint is reachable. So, there is no need whatsoever to have kubeconfig on the worker nodes :)

Ignite error upgrading the setup in Kubernetes

While I upgraded the Ignite that is deployed in Kubernetes (EKS) for Log4j vulnerability, I get the error below
[ignite-1] Caused by: class org.apache.ignite.spi.IgniteSpiException: BaselineTopology of joining node (54b55de4-7742-4e82-9212-7158bf51b4a9) is not compatible with BaselineTopology in the cluster. Joining node BlT id (4) is greater than cluster BlT id (3). New BaselineTopology was set on joining node with set-baseline command. Consider cleaning persistent storage of the node and adding it to the cluster again.
The setup is a 3 node cluster, with native persistence enabled (PVC). This seems to be occurring many times in our journey with Apache Ignite, having followed the official guide.
I cannot clean the storage as the pod gets restarted every now and then, by the time I get the pod shell the pod crash & restarts.
This might happen to be due to the wrong startup order, starting nodes manually in reverse order may resolve this, but I'm not sure if that is possible in K8s. Another possible issue might be related to the baseline auto-adjustment that might change your baseline unexpectedly, I suggest you turn it off if it's enabled.
One of the workarounds to clean a DB of a failing POD might be (quite tricky) - to replace Ignite image with some simple image like a plain Debian or Alpine docker images (just to be able to access CLI) keeping the same PVC attached, and once you fix the persistence issue, set the Ignite image back. The other one is - to access underlying PV directly if possible and do surgery in place.

Altering host sysctl params from privileged container

we are using GKE for NET Core containers with ASP. Each ASP container uses at least one inotify instance (to watch Razer templates) and can use another to watch config files (if not explicitly disabled).
Linux default limit for number of inotify instances per host is 128 (fs.inotify.max_user_instances=128). Some instances are consumed by kubernetes itself (e.g. fluend daemons). So when lots of pods are deployed on single host, host runs out of free inotify instances and containers are stuck in crash loop.
Since we use GKE, we cannot manage worker nodes and alter sysctl settings directly.
My questions are:
Can I somehow alter sysctl setting for host VM through privileged container?
Is there a way to setup kubernetes scheduler to take number of free inotify instances (or at least a number of pods deployed) into account when selecting a node to deploy new pods?
As noted here, "Sysctls with no namespace are called node-level sysctls. If you need to set them, you must manually configure them on each node’s operating system, or by using a DaemonSet with privileged containers".
Regarding scheduling pods, there doesn't seem to be a way for the scheduler to take inotify or number of pods into account when scheduling. The scheduler is only aware of available resources (CPU and memory) and pod specs such as pod or node affinity.
To attain the kind of spread you are looking for will take a good deal of planning and use of both resource requests and pod affinity/anti-affinity. You can review this.

DC/OS running a service on each agent

Is there any way of running a service (single instance) on each deployed agent node? I need that because each agent needs to mount a storage from S3 using s3fs
The name of the feature you're looking for is "daemon tasks", but unfortunately, it's still in the planning phase for Mesos itself.
Due to the fact that schedulers don't know the entire state of the cluster, Mesos needs to add a feature to enable this functionality. Once in Mesos it can be integrated with DC/OS.
The primary workaround is to use Marathon to deploy an app with the UNIQUE constraint ("constraints": [["hostname", "UNIQUE"]]) and set the app instances to the number of agent nodes. Unfortunately this means you have to adjust the instances number when you add new nodes.

DC/OS has three roles, they are master, slave, slave_public, why can't put them on one host?

I just investigate DC/OS, I find that DC/OS has three roles:master, slave, slave_public, I want to deploy a cluster which can host master, slave or slave_public roles on one host, but currently I can't do that.
I want to know that why can't put them on one host when designed. If I do that, could I get some suggestions?
I just have the idea. If I can't do, I'll quit using DCOS, I'll use mesos and marathon.
Is there someone has the idea with me? I look forward to the reply.
This is by design, and things are actually being worked on to re-enforce that an machine is installed with only one role because things break with more than one.
If you're trying to demo / experiment with DC/OS and you only have one machine, you can use Virtual Machines or Docker to partition that one machine into multiple machines / parts which you can install DC/OS on. dcos-vagrant and dcos-docker can help you there.
As far as installing though, the configuration for each of the three roles is incompatible with one another. The "master" role causes a whole bunch of pieces of software to be started / installed on a host (Mesos-DNS, Mesos master, marathon, exhibitor, zookeeper, 3dt, adminrouter, rexray, spartan, navstar among others) which listen on various ports. The "slave" role causes a machine to have a mesos-agent (mesos renamed mesos-slave to mesos-agent, hence the disconnect) configured and started on the agent. The mesos-agent is configured to control / most ports greater than 1024 to tasks which are launched by mesos frameworks on the agent. Several of those ports are used by services which are run on masters, resulting in odd conflicts and hard to fix bad behavior.
In the case of running the "slave" and "slave_public" on the same host, those two conflict more directly, because both of them cause mesos-agent to be run on the host, with slightly different configuration. Both the mesos-agent (the one configured with the "slave" role and the one with the "slave_public" role are configured to listen on port 5051. Only one of them can use it though, so you end up with one of the agents being non-functional.
DC/OS only supports running a node as either a master or an agent(slave). You are correct that Mesos does not have this limitation. But DC/OS is more than just a Mesos/Marathon. To enable all the additional features of DC/OS there are various components built around Mesos and Marathon. At times these components behave differently whether they are running on a master or an agent and at other times the components that exist on a master may or may not exist on an agent or vice versa. So running a master and an agent on the same node would lead to conflicts/issues.
If you are looking to run a small development setup before scaling the solution out to a bigger distributed system DC/OS Vagrant might be a good starting point.