Correct implementation for Kubernetes Liveness and Readiness probes

Correct implementation for Kubernetes Liveness and Readiness probes - amazon-s3

Given a Python application which polls Kafka topic in an infinite loop and uploads the result to s3 bucket after processing the received Kafka message.
What should be the things to be considered in defining readiness and liveness probes for Kubernetes.
Does it make sense to include in readiness probe:
That the s3 buckets exist.
That Kafka topic exist.
Loop which polls Kafka topic has been initialized.
And the liveness probe only check that the poll loop has not exited.
Is it strictly a bad practice to check such things in readiness probe?

I would not check any of these things in Kubernetes probes. Have your application startup check for them on its own, and if the environment isn't suitable, exit immediately. Your pod will show up in CrashLoopBackOff state, and it will restart a couple of times, but it will be very clear that something is wrong.
There is some possibility that these things will fail while the application is running, but you should be able to notice this. A metrics system like Prometheus can help you notice if most of your S3 requests are failing, for example. If you can check to see if your Kafka listener main loop has exited, you can also just restart it.

Related

Kubernetes probe running acceptance test

I have a situation where my acceptance test makes a connection with a rabbitMQ instance during the pipeline. But the rabbitMQ instance is private, making not possible to make this connection in the pipeline.
I was wondering if making an api endpoint that run this test and adding to the startup probe would be a good approach to make sure this test is passing.

If the rabbitmq is a container in your pod yes, if it isn't then you shouldn't.
There's no final answer to this, but the startup probe is just there to ensure that your pod is not being falsly considered unhealthy by other probes just because it takes a little longer to start. It's aimed at legacy applications that need to build assets or compile stuff at startup.
If there was a place to put a connectivity test to rabitmq would be the liveness probe, but you should only do that if your application is entirely dependent on a connection to rabbitmq, otherwise your authentication would fail because you couldn't connect to the messaging queue. And if you have a second app that tries to connect to your endpoint as a liveness probe? And a third app that connects the second one to check if that app is alive? You could kill an entire ecosystem just because rabbitmq rebooted or crashed real quick.
Not recommended.
You could have that as part of your liveness probe IF your app is a worker, then, not having a connection to rabbitmq would make the worker unusable.
Your acceptance tests should be placed on your CD or in a post-deploy script step if you don't have a CD.

What is benefit of readiness probe and healthCheck?

I am working on an application that, as I can see is doing multiple health checks?
DB readiness probe
Another API dependency readiness probe
When I look at cluster logs, I realize that my service, when it fails a DB-check, just throws 500 and goes down. What I am failing to understand here is that if DB was down or another API was down and IF I do not have a readiness probe then my container is going down anyway. Also, I will see that my application did throw some 500 because DB or another service was off.
What is the benefit of the readiness probe of my container was going down anyway? Another question I have is that is Healthcheck something that I should consider only if I am deploying my service to a cluster? If it was not a cluster microservice environment, would it increase/decrease benefits of performing healtheck?

There are three types of probes that Kubernetes uses to check the health of a Pod:
Liveness: Tells Kubernetes that something went wrong inside the container, and it's better to restart it to see if Kubernetes can resolve the error.
Readiness: Tells Kubernetes that the Pod is ready to receive traffic. Sometimes something happens that doesn't wholly incapacitate the Pod but makes it impossible to fulfill the client's request. For example: losing connection to a database or a failure on a third party service. In this case, we don't want Kubernetes to reset the Pod, but we also don't wish for it to send it traffic that it can't fulfill. When a Readiness probe fails, Kubernetes removes the Pod from the service and stops communication with the Pod. Once the error is resolved, Kubernetes can add it back.
Startup: Tells Kubernetes when a Pod has started and is ready to receive traffic. These probes are especially useful on applications that take a while to begin. While the Pod initiates, Kubernetes doesn't send Liveness or Readiness probes. If it did, they might interfere with the app startup.
You can get more information about how probes work on this link:
https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/

Readiness probes are used in a few places. A big one is that non-ready pods are removed from all Services that reference them. They also matter for rolling updates on Deployments/StatefulSets as the roll won't continue until the new pods reach a ready state. In general the checks used for readiness probes should only be checking the current service. So it shouldn't be reaching out to a database. Sometimes that's hard to implement and does indeed make them less useful. But check per-pod stuff like the web server is listening on the port and can return HTTP responses.

Deploying application which is not a web app? Kubernetes

I am trying to deploy a pod to the cluster. The application I am deploying is not a web server. I have an issue with setting up the liveness and readiness probes. Usually, I would use something like /isActive and /buildInfo endpoint for that.
I've read this https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/#define-a-liveness-command.
Wondering if I need to code a mechanism which will create a file and then somehow prob it from the deployment.yaml file?
Edit: this is what I used to keep the container running, not sure if that is the best way to do it?
- touch /tmp/healthy; while true; do sleep 30; done;

It does not make sense to create files in your application just for the liveness probe. On the K8s documentation this is just an example to show you how the exec command probe works.
The idea behind the liveness probe is bipartite:
Avoid traffic on your Pods, before they have been fully started.
Detect unresponsive applications due to lack of resources or deadlocks where the application main process is still running.
Given that your deployments don't seem to expect external traffic, you don't require a liveness probe for the first case. Regarding the second case, question is how your application could lock up and how you would notice externally, e.g. by monitoring a log file or similar.
Bear in mind, that K8s will still monitor whether your applications main process is running. So, restarts on application failure will still occur, if you application stops running without a liveness probe. So, if you can be fairly sure that your application is not prone to becoming unresponsive while still running, you can also do without a liveness probe.

Manage In-memory cache in multiple servers in aws

Once or twice a day some files are being uploaded to S3 Bucket. I want the uploaded data to be refreshed with the In-memory data of each server on every s3 upload.
Note there are multiple servers running and I want to store the same data in all the servers. Also, the servers are scaling based on the traffic(also on start-up of the new server goes up and older ones go down means server instances will not be the same always).
Like I want to keep updated data in the cache.
I want to build an architecture where auto-scaling of the server can be supported. I came across the FAN-OUT architecture of AWS by using the SNS and multiple SQS from which different servers can poll.
How can we handle the auto-scaling of the queue with respect to servers?
Or is there any other way to handle the scenario?
PS: I m totally new to the AWS environment.
It Will be a great help for any reference.

To me there are a few things that you need to have to make this work. These are opinions and, as with most architectural designs, there is certainly more than one way to handle this.
I start with the assumption that you've got an application running on an EC2 of some sort (Elastic Beanstalk, Fargate, Raw EC2s with auto scaling, etc.) and that you've solved for having the application installed and configured when a scale-up event occurs.
Conceptually I'd have this diagram:
The setup involves having the S3 bucket publish likely s3:ObjectCreated events to the SNS topic. These events will be published when an object in the bucket is updated or created.
Next:
During startup your application will pull the current data from S3.
As part of application startup create a queue named after the instance id of the EC2 (see here for some examples) The queue would need to subscribe to the SNS topic. If the queue already exists then that's not an error.
Your application would have a background thread or process that polls the SQS queue for messages.
If you get a message on the queue then that needs to tell the application to refresh the cache from S3.
When an instance is shut down there is an event from at least Elastic Beanstalk and the load balancers that your instance will be shut down. Remove the SQS queue tied to the instance at that time.
The only issue might be that a hard crash of an environment would leave orphan queues. It may be advisable to either manually clean these up or have a periodic task clean them up.

Redis-cluster Helm chart unable to complete job when using istio

When using the Bitnami Helm chart for Redis-Cluster, there is a redis-cluster-cluster-create job. However, when enabling istio-injection, this job never ends. If I disable istio-injection, the job quickly ends. Any solutions or reason why this phenomenon is happening?

Answering the main question
there is a redis-cluster-cluster-create job. However, when enabling istio-injection, this job never ends. If I disable istio-injection, the job quickly ends. Any solutions or reason why this phenomenon is happening?
The main issue here is that job is not considered complete until all containers have stopped running, and Istio sidecar run indefinitely, while your task may have completed, the Job as a whole will not appear as completed in Kubernetes.
There is github issue about that.
There is one of the workarounds, and you can find more workarounds here.
I can change the podAnnotations from Redis-Cluster Helm Chart, and when disabling the istio-injection, the Job doesn't spin up istio-proxy. However, the main job 'cluster-create' job never ends, and eventually fails the the deploy
As mentioned here
So as a temporary workaround adding sidecar.istio.io/inject: “false” is possible but this disables Istio for any traffic to/from the annotated Pod. As mentioned, we leveraged Kubernetes Jobs for Integration Testing, which meant some tests may need to access the service mesh. Disabling Istio essentially means breaking routing — a show stopper.
So it might actually not work here. I suggest to try with the quitquitquit as it's the most recommended workaround.
Additionally worth to check these github issues:
List of applications incompatible with Istio
helm stable/redis does not work with istio sidecar
Using Istio with CronJobs
Sidecar Containers

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas