Altering host sysctl params from privileged container - asp.net-core

we are using GKE for NET Core containers with ASP. Each ASP container uses at least one inotify instance (to watch Razer templates) and can use another to watch config files (if not explicitly disabled).
Linux default limit for number of inotify instances per host is 128 (fs.inotify.max_user_instances=128). Some instances are consumed by kubernetes itself (e.g. fluend daemons). So when lots of pods are deployed on single host, host runs out of free inotify instances and containers are stuck in crash loop.
Since we use GKE, we cannot manage worker nodes and alter sysctl settings directly.
My questions are:
Can I somehow alter sysctl setting for host VM through privileged container?
Is there a way to setup kubernetes scheduler to take number of free inotify instances (or at least a number of pods deployed) into account when selecting a node to deploy new pods?

As noted here, "Sysctls with no namespace are called node-level sysctls. If you need to set them, you must manually configure them on each node’s operating system, or by using a DaemonSet with privileged containers".
Regarding scheduling pods, there doesn't seem to be a way for the scheduler to take inotify or number of pods into account when scheduling. The scheduler is only aware of available resources (CPU and memory) and pod specs such as pod or node affinity.
To attain the kind of spread you are looking for will take a good deal of planning and use of both resource requests and pod affinity/anti-affinity. You can review this.

Related

Certain java-based containers throw "UnknownHostException"

I have two issues with my kubernetes.
kubernetes version 1.12.5, ubuntu16.04
the first issue is
Occasionally, containers on a specific node are restarted including kube-proxy
kernel: IPVS: rr TCP - no destination available
IPVS: __ip_vs_del_service:enter
net_ratelimit: callbacks suppressed
As these logs are continuously recorded,
The load avarage of node system resources is rather high.
Docker containers uploaded to the node keep repeating the restart.
In this case, node drain can relieve the symptoms.
the second issue
Certain java-based containers throw "UnknownHostException".
Restarting the container manually will resolve the symptoms.
Should I look at the container deployment settings?
Should I look at the cluster dns, resolve related settings?
I want to know if UnknownHostException is related to dns settings.
Can you give me some good comments?

HTTPD response time is increasing after 90 TPS

I am doing load test to tune my apache to server maximum concurrent https request. Below is the details of my test.
System
I dockerized my httpd and deployed in openshift with pod configuration is 4CPU, 8GB RAM.
Running load from Jmeter with 200 thread, 600sec ramup time, loop is for infinite. duration is long run (Jmeter is running in same network with VM configuration 16CPU, 32GB RAM ).
I compiled by setting module with worker and deployed in openshift.
Issue
Httpd is not scaling more than 90TPS, even after tried multiple mpm worker configuration (no difference with default and higher configuration)
2.Issue which i'am facing after 90TPS, average time is increasing and TPS is dropping.
Please let me know what could be the issue, if any information is required further suggestions.
I don't have the answer, but I do have questions.
1/ What does your Dockerfile look like?
2/ What does your OpenShift cluster look like? How many nodes? Separate control plane and workers? What version?
2b/ Specifically, how is traffic entering the pod (if you are going in via a route, you'll want to look at your load balancer; if you want to exclude OpenShift from the equation then for the short term, expose a NodePort and have Jmeter hit that directly)
3/ Do I read correctly that your single pod was assigned 8G ram limit? Did you mean the worker node has 8G ram?
4/ How did you deploy the app -- raw pod, deployment config? Any cpu/memory limits set, or assumed? Assuming a deployment, how many pods does it spawn? What happens if you double it? Doubled TPS or not - that'll help point to whether the problem is inside httpd or inside the ingress route.
5/ What's the nature of the test request? Does it make use of any files stored on the network, or "local" files provisioned in a network PV.
And,
6/ What are you looking to achieve? Maximum concurrent requests in one container, or maximum requests in the cluster? If you've not already look to divide and conquer -- more pods on more nodes.
Most likely you have run into a bottleneck/limitation at the SUT. See the following post for a detailed answer:
JMeter load is not increasing when we increase the threads count

DC/OS running a service on each agent

Is there any way of running a service (single instance) on each deployed agent node? I need that because each agent needs to mount a storage from S3 using s3fs
The name of the feature you're looking for is "daemon tasks", but unfortunately, it's still in the planning phase for Mesos itself.
Due to the fact that schedulers don't know the entire state of the cluster, Mesos needs to add a feature to enable this functionality. Once in Mesos it can be integrated with DC/OS.
The primary workaround is to use Marathon to deploy an app with the UNIQUE constraint ("constraints": [["hostname", "UNIQUE"]]) and set the app instances to the number of agent nodes. Unfortunately this means you have to adjust the instances number when you add new nodes.

Clone RabbitMQ admin users, etc. on replacement server

We have a couple of crusty AWS hosts running a RabbitMQ implementation in a cluster. We need to upgrade the hardware, and therefore we developed a Chef cookbook to spawn replacement servers.
One thing that we would rather not recreate by hand is the admin users, the queues, etc.
What is the best method to get that stuff from the old hosts to the new ones? I believe it's everything that lives in the /var/lib/rabbitmq/mnesia directory.
Is it wise to copy the files from one host to another?
Is there a programmatic means to do this?
Can it be coded into our Chef cookbook?
You can definitely export and import configuration via command line: https://www.rabbitmq.com/management-cli.html
I'm not sure about admin user, though.
If you create new rabbitmq nodes on your new hardware, you will get all the users in that new node. This is easy to try:
run docker container with image of rabbitmq (with management plugin)
and create a user
run another container and add that node to the
cluster of the first one
kill rabbitmq on the first one, or delete
the docker container and you will see that you still have the newly
created user on the 2nd (but now master) node
I wrote docker since it's faster to create a cluster this way, but if you already have a cluster you could use it for testing if you prefer.
For the queues and exchanges, I don't want to quote almost everything found in the rabbitmq doc page for the high availability, but I will just say that you have to pay attention to the following:
exclusive queues because they are gone once the client connection is gone
queue mirroring (if you have any set up, if not it would be wise to consider it, if not even necessary)
I would do the migration gradually, waiting for the queues to get emptied and then kill of the nodes on the old hardware. It maybe doable in a big-bang fashion, but seems riskier. If you have a running system, than set up queue mirroring and try to find appropriate moment to do manual sync - but careful, this has a huge impact on the broker performance.
Additionally there is this shovel plugin (I have to point out that I did not use it or even explore it) but that may be another way to go since (quoting form the link):
In essence, a shovel is a simple pump. Each shovel:
connects to the source broker and the destination broker, consumes
messages from the queue, re-publishes each message to the destination
broker (using, by default, the original exchange name and
routing_key).

Resolving Chef Dependencies

In my lab, I am currently managing a 20 nodes cluster with Cobbler and Chef. Cobbler is used for OS provisioning and basic network settings, which is working fine as expected. I can manage several OS distributions with preseed-based NQA installation and local repo mirroring.
We also successfully installed chef server and started managing nodes but chef is not working as I expected. The issue is that I am not being able to set node dependencies within chef. Our one important use case is this:
We are setting up ceph and openstack on these nodes
Ceph should be installed before openstack because openstack uses ceph as back-end storage
Ceph monitor should be installed before Ceph osd because creating osd requires talking to monitor
The dependencies between Openstack and Ceph does not matter because it is a dependency in one node; just installing openstack later would resolve the issue.
However, a problem arises with the dependency between ceph monitor and ceph osd. Ceph osd provisioning requires a running ceph monitor. Therefore, ceph osd recipe should always be run after ceph mon recipe finishes in another node. Our current method is just to run "chef-client" in "ceph-osd" node after "chef-client" run completely finishes in "ceph-mon" node but I think this is a too much of a hassle. Is there a way to set these dependencies in Chef so that nodes will provision sequentially according to their dependencies? If not, are there good frameworks who handles this?
In chef itself, I know no method for orchestrating (that's not chef Job).
A workaround given your use case could be to use tags and search.
You monitor recipe could tag the node at end (with tag("CephMonitor") or with setting any attribute you wish to search on).
After that the solr index of chef has to catch it up (usually in the minute) and you can use search in the Cephosd recipe you can do something like this:
CephMonitor = search(:node,"tags:CephMonitor") || nil
return if CephMonitor.nil?
[.. rest of the CephOsd recipe, using the CephMonitor['fqdn'] or other attribute from the node ..]
The same behavior can be used to avoid trying to run the OpenStack recipe until the osd has run.
The drawback if that it will take 2 or 3 chef run to get to a converged infrastructure.
I've nothing to recommend to do the orchestration, zookeeper or consul could help instead of tags and to trigger the runs.
Rundeck can tage the runs on different nodes and aggregate this in one job.
Which is best depends on your feeling there.