Increase Max pods per node with advanced networking on AKS - azure-container-service

we have created our Kubernetes cluster with Advanced Networking via the Azure Management UI.
Some time later we've run into the limitation of pods per node described here:
https://learn.microsoft.com/fi-fi/azure/aks/container-service-quotas
We need to change the limitation of 30 pods per node as it is very incomprehensible one for us. Before the advanced networking was possible at all, there was no such limitation and it was also undocumented at the moment, we've created the cluster. Could someone help, how to do change the max pods amount without recreation the whole cluster?
Regards, Gena

You cannot change the max number of pods per node on an existing cluster
https://learn.microsoft.com/fi-fi/azure/aks/networking-overview#configure-maximum---existing-clusters
You will need to redeploy a new cluster and specify the new max number during provisioning.

Late to the party, but with different solution.
As of June 2021 there is no need to redeploy the cluster - we can add a new node pool with required pods to node ratio.
The dialog box allows us to set new values and then we can use node labels to redirect pods or just shut down previous pool.

You can create a new nodepool and set max-pods:
az aks nodepool add --name mypool --node-vm-size Standard_E4s_v3 --node-count 1 -g AKS-resource-name --cluster-name cluster-name --max-pods 100
https://learn.microsoft.com/en-us/cli/azure/aks/nodepool?view=azure-cli-latest

No one mentionned it so I will
Be careful if you are using Azure CNI and want to set a high max-pods per node
Azure CNI will reserve the number of IP per node in your subnet, so this value should be calculated to avoid facing IP exhaustion issues
See https://learn.microsoft.com/en-us/azure/aks/configure-azure-cni

Related

Ignite error upgrading the setup in Kubernetes

While I upgraded the Ignite that is deployed in Kubernetes (EKS) for Log4j vulnerability, I get the error below
[ignite-1] Caused by: class org.apache.ignite.spi.IgniteSpiException: BaselineTopology of joining node (54b55de4-7742-4e82-9212-7158bf51b4a9) is not compatible with BaselineTopology in the cluster. Joining node BlT id (4) is greater than cluster BlT id (3). New BaselineTopology was set on joining node with set-baseline command. Consider cleaning persistent storage of the node and adding it to the cluster again.
The setup is a 3 node cluster, with native persistence enabled (PVC). This seems to be occurring many times in our journey with Apache Ignite, having followed the official guide.
I cannot clean the storage as the pod gets restarted every now and then, by the time I get the pod shell the pod crash & restarts.
This might happen to be due to the wrong startup order, starting nodes manually in reverse order may resolve this, but I'm not sure if that is possible in K8s. Another possible issue might be related to the baseline auto-adjustment that might change your baseline unexpectedly, I suggest you turn it off if it's enabled.
One of the workarounds to clean a DB of a failing POD might be (quite tricky) - to replace Ignite image with some simple image like a plain Debian or Alpine docker images (just to be able to access CLI) keeping the same PVC attached, and once you fix the persistence issue, set the Ignite image back. The other one is - to access underlying PV directly if possible and do surgery in place.

HTTPD response time is increasing after 90 TPS

I am doing load test to tune my apache to server maximum concurrent https request. Below is the details of my test.
System
I dockerized my httpd and deployed in openshift with pod configuration is 4CPU, 8GB RAM.
Running load from Jmeter with 200 thread, 600sec ramup time, loop is for infinite. duration is long run (Jmeter is running in same network with VM configuration 16CPU, 32GB RAM ).
I compiled by setting module with worker and deployed in openshift.
Issue
Httpd is not scaling more than 90TPS, even after tried multiple mpm worker configuration (no difference with default and higher configuration)
2.Issue which i'am facing after 90TPS, average time is increasing and TPS is dropping.
Please let me know what could be the issue, if any information is required further suggestions.
I don't have the answer, but I do have questions.
1/ What does your Dockerfile look like?
2/ What does your OpenShift cluster look like? How many nodes? Separate control plane and workers? What version?
2b/ Specifically, how is traffic entering the pod (if you are going in via a route, you'll want to look at your load balancer; if you want to exclude OpenShift from the equation then for the short term, expose a NodePort and have Jmeter hit that directly)
3/ Do I read correctly that your single pod was assigned 8G ram limit? Did you mean the worker node has 8G ram?
4/ How did you deploy the app -- raw pod, deployment config? Any cpu/memory limits set, or assumed? Assuming a deployment, how many pods does it spawn? What happens if you double it? Doubled TPS or not - that'll help point to whether the problem is inside httpd or inside the ingress route.
5/ What's the nature of the test request? Does it make use of any files stored on the network, or "local" files provisioned in a network PV.
And,
6/ What are you looking to achieve? Maximum concurrent requests in one container, or maximum requests in the cluster? If you've not already look to divide and conquer -- more pods on more nodes.
Most likely you have run into a bottleneck/limitation at the SUT. See the following post for a detailed answer:
JMeter load is not increasing when we increase the threads count

Migrate a (storm+nimbus) cluster to a new Zookeeper, without loosing the information or having downtime

I have a nimbus+storm cluster using Zookeeper, and I wish to move my cluster and point it to a new Zookeeper. Do you know if this is possible? Can I keep all the information of the old zookeeper and save it in the new one? Is it possible to do it without downtime?
I have looked in the internet for this procedure but I have not found much.
Would it be as simples as change the storm.yml file in both the master . and worker nodes? Do I need a restart afterwards?
# storm.zookeeper.servers:
# - "server1"
# - "server2"
If you just change storm.yml, you'd be pointing Storm at a new empty Zookeeper cluster, and it will be like you just installed Storm from scratch. More likely, you want to grow your Zookeeper cluster to include your new machines, then update storm.yml to point at the new machines, then shrink the cluster to exclude the machines you want to move away from. That way, your Zookeeper quorum is preserved even though you've moved to other physical machines.
This is easier to do on Zookeeper 3.5 with dynamic reconfiguration http://zookeeper.apache.org/doc/r3.5.5/zookeeperReconfig.html. I'm unsure whether Storm will run on Zookeeper 3.5, but you may consider investigating whether you can upgrade to 3.5 before growing/shrinking the cluster.
Otherwise you will have to do a rolling restart to add the new Zookeeper nodes, then do another one to remove the old machines once the cluster has stabilized.
Let me suggest a hack here. This was a script provided by microsoft for migration on HD Insight cluster , but you can change it and use it for your need.
The script can be downloaded from : https://github.com/hdinsight/hdinsight-storm-examples/tree/master/tools/zkdatatool-1.0 and you can read more about it here :
https://blogs.msdn.microsoft.com/azuredatalake/2017/02/24/restarting-storm-eventhub/
I have used it in the past when i had to migrate some stuff between PaaS clusters and i can confirm it works ok!

Altering host sysctl params from privileged container

we are using GKE for NET Core containers with ASP. Each ASP container uses at least one inotify instance (to watch Razer templates) and can use another to watch config files (if not explicitly disabled).
Linux default limit for number of inotify instances per host is 128 (fs.inotify.max_user_instances=128). Some instances are consumed by kubernetes itself (e.g. fluend daemons). So when lots of pods are deployed on single host, host runs out of free inotify instances and containers are stuck in crash loop.
Since we use GKE, we cannot manage worker nodes and alter sysctl settings directly.
My questions are:
Can I somehow alter sysctl setting for host VM through privileged container?
Is there a way to setup kubernetes scheduler to take number of free inotify instances (or at least a number of pods deployed) into account when selecting a node to deploy new pods?
As noted here, "Sysctls with no namespace are called node-level sysctls. If you need to set them, you must manually configure them on each node’s operating system, or by using a DaemonSet with privileged containers".
Regarding scheduling pods, there doesn't seem to be a way for the scheduler to take inotify or number of pods into account when scheduling. The scheduler is only aware of available resources (CPU and memory) and pod specs such as pod or node affinity.
To attain the kind of spread you are looking for will take a good deal of planning and use of both resource requests and pod affinity/anti-affinity. You can review this.

SQL Database Blue / Green deployments

I am dealing with the infrastructure for a new project. It is a standard Laravel stack = PHP, SQL server, and Nginx. For the PHP + Nginx part, we are using Kubernetes cluster - so scaling and blue/green deployments are taken care of.
When it comes to the database I am a bit unsure. We don't want to use Kubernetes for SQL, so the current idea is to go for Google Cloud SQL managed service (Are the competitors better for blue/green deployment of SQL?). The question is can it sync the data between old and new versions of the database nodes?
Let's say that we have 3 active Pods and at least 2 active database nodes (and a load balancer).
So the standard deployment should look like this:
Pod with the new code is created.
New database node is created with current data.
The new Pod gets new environment variables to connect to the new database.
Database migrations are run on the new database node.
Health check for the new Pod is run, if it passes Pod starts to receive traffic.
One of the old Pods is taken offline.
It should keep doing this iteration until all of the Pods and Database nodes are replaced.
The question is can this work with the database? Let's imagine there is a user on the website that is using the last OLD database node to write some data and when switched to the NEW database node the data are simply not there until the last database node is upgraded. Can they be synced behind the scenes? Does Google Cloud SQL managed service provide that?
Or is there a completely different and better solution to this problem?
Thank you!
I'm not 100% sure if this is what you are looking for, but for my understanding, Cloud SQL replicas would be a better solution. You can have read replicas [1], that are a copy of the master instance and have different options [2]
A read replica is a copy of the master that reflects changes to the master instance in almost real time. You create a replica to offload read requests or analytics traffic from the master. You can create multiple read replicas for a single master instance.
or a failover replica [3], that in case the master goes down, the data continue to be available there.
If an instance configured for high availability experiences an outage or becomes unresponsive, Cloud SQL automatically fails over to the failover replica, and your data continues to be available to clients. This is called a failover.
You can combine those if you need.