I am working on an application that, as I can see is doing multiple health checks?
DB readiness probe
Another API dependency readiness probe
When I look at cluster logs, I realize that my service, when it fails a DB-check, just throws 500 and goes down. What I am failing to understand here is that if DB was down or another API was down and IF I do not have a readiness probe then my container is going down anyway. Also, I will see that my application did throw some 500 because DB or another service was off.
What is the benefit of the readiness probe of my container was going down anyway? Another question I have is that is Healthcheck something that I should consider only if I am deploying my service to a cluster? If it was not a cluster microservice environment, would it increase/decrease benefits of performing healtheck?
There are three types of probes that Kubernetes uses to check the health of a Pod:
Liveness: Tells Kubernetes that something went wrong inside the container, and it's better to restart it to see if Kubernetes can resolve the error.
Readiness: Tells Kubernetes that the Pod is ready to receive traffic. Sometimes something happens that doesn't wholly incapacitate the Pod but makes it impossible to fulfill the client's request. For example: losing connection to a database or a failure on a third party service. In this case, we don't want Kubernetes to reset the Pod, but we also don't wish for it to send it traffic that it can't fulfill. When a Readiness probe fails, Kubernetes removes the Pod from the service and stops communication with the Pod. Once the error is resolved, Kubernetes can add it back.
Startup: Tells Kubernetes when a Pod has started and is ready to receive traffic. These probes are especially useful on applications that take a while to begin. While the Pod initiates, Kubernetes doesn't send Liveness or Readiness probes. If it did, they might interfere with the app startup.
You can get more information about how probes work on this link:
https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/
Readiness probes are used in a few places. A big one is that non-ready pods are removed from all Services that reference them. They also matter for rolling updates on Deployments/StatefulSets as the roll won't continue until the new pods reach a ready state. In general the checks used for readiness probes should only be checking the current service. So it shouldn't be reaching out to a database. Sometimes that's hard to implement and does indeed make them less useful. But check per-pod stuff like the web server is listening on the port and can return HTTP responses.
Related
I have a situation where my acceptance test makes a connection with a rabbitMQ instance during the pipeline. But the rabbitMQ instance is private, making not possible to make this connection in the pipeline.
I was wondering if making an api endpoint that run this test and adding to the startup probe would be a good approach to make sure this test is passing.
If the rabbitmq is a container in your pod yes, if it isn't then you shouldn't.
There's no final answer to this, but the startup probe is just there to ensure that your pod is not being falsly considered unhealthy by other probes just because it takes a little longer to start. It's aimed at legacy applications that need to build assets or compile stuff at startup.
If there was a place to put a connectivity test to rabitmq would be the liveness probe, but you should only do that if your application is entirely dependent on a connection to rabbitmq, otherwise your authentication would fail because you couldn't connect to the messaging queue. And if you have a second app that tries to connect to your endpoint as a liveness probe? And a third app that connects the second one to check if that app is alive? You could kill an entire ecosystem just because rabbitmq rebooted or crashed real quick.
Not recommended.
You could have that as part of your liveness probe IF your app is a worker, then, not having a connection to rabbitmq would make the worker unusable.
Your acceptance tests should be placed on your CD or in a post-deploy script step if you don't have a CD.
I am trying to deploy a pod to the cluster. The application I am deploying is not a web server. I have an issue with setting up the liveness and readiness probes. Usually, I would use something like /isActive and /buildInfo endpoint for that.
I've read this https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/#define-a-liveness-command.
Wondering if I need to code a mechanism which will create a file and then somehow prob it from the deployment.yaml file?
Edit: this is what I used to keep the container running, not sure if that is the best way to do it?
- touch /tmp/healthy; while true; do sleep 30; done;
It does not make sense to create files in your application just for the liveness probe. On the K8s documentation this is just an example to show you how the exec command probe works.
The idea behind the liveness probe is bipartite:
Avoid traffic on your Pods, before they have been fully started.
Detect unresponsive applications due to lack of resources or deadlocks where the application main process is still running.
Given that your deployments don't seem to expect external traffic, you don't require a liveness probe for the first case. Regarding the second case, question is how your application could lock up and how you would notice externally, e.g. by monitoring a log file or similar.
Bear in mind, that K8s will still monitor whether your applications main process is running. So, restarts on application failure will still occur, if you application stops running without a liveness probe. So, if you can be fairly sure that your application is not prone to becoming unresponsive while still running, you can also do without a liveness probe.
Given a Python application which polls Kafka topic in an infinite loop and uploads the result to s3 bucket after processing the received Kafka message.
What should be the things to be considered in defining readiness and liveness probes for Kubernetes.
Does it make sense to include in readiness probe:
That the s3 buckets exist.
That Kafka topic exist.
Loop which polls Kafka topic has been initialized.
And the liveness probe only check that the poll loop has not exited.
Is it strictly a bad practice to check such things in readiness probe?
I would not check any of these things in Kubernetes probes. Have your application startup check for them on its own, and if the environment isn't suitable, exit immediately. Your pod will show up in CrashLoopBackOff state, and it will restart a couple of times, but it will be very clear that something is wrong.
There is some possibility that these things will fail while the application is running, but you should be able to notice this. A metrics system like Prometheus can help you notice if most of your S3 requests are failing, for example. If you can check to see if your Kafka listener main loop has exited, you can also just restart it.
Our aim is to horizontally scale a .NET Core 2.0 Web API using Kubernetes. The Web API application will be served by Kestrel.
It looks like we can gracefully handle the termination of pods by configuring Kestrel's shutdown timeout so now we are looking into how to probe the application to determine readiness and liveness.
Would it be enough to simply probe the Web API with a HTTP request? If so, would it be a good idea to create a new healthcheck controller to handle these probing requests or would it make more sense to probe an actual endpoint that would be consumed in normal use?
What should we consider when differentiating between the liveness and readiness probes?
I would recommend to perform health checks through separate endpoints.
In general, there are a number of good reasons for doing so, like:
Checking that the application is live/ready or, more in general, in a healthy status is not necessarily the same as sending a user request to your web service. When performing health checks you should define what makes your web service healthy: this could be e.g. checking access to external resources, like database.
It is easier to control who can actually perform health checks through your endpoints.
More in general, you do not want to mess up with the actual service functionalities: you would otherwise need to re-think the way you do health checks when maintaining your service's functionalities. E.g. if your service interacts with a database, in a health checks context you want to verify the connection to the database is fine, but you do not actually care much about the data being manipulated internally by your service.
Things get even more complicated if your web service is not stateless: in such case, you will need to make sure data remain consistent independently from your health checks.
As you pointed out, a good way to avoid any of the above could be setting up a separate Controller to handle health checks.
As an alternative option, there is a standard library available in ASP.NET Core for enabling Health Checks on your web service: at the time of writing this answer, it is not officially part of ASP.NET Core and no NuGet packages are available yet, but there is a plan for this to happen on future releases. For now, you can easily pull the code from the Official Repository and include it in your solution as explained in the Microsoft documentation.
This is currently planned to be included in ASP.NET Core 2.2 as described in the ASP.NET Core 2.2 Roadmap.
I personally find it very elegant, as you will configure everything through the Startup.cs and Program.cs and won't need to explicitly create a new endpoint as the library already handles that for you.
I have been using it in a few projects and I would definitely recommend it.
The repository includes an example specific for ASP.NET Core projects you can use to get quickly up to speed.
Liveness vs Readiness
In Kubernetes, you may then setup liveness and readiness probes through HTTP: as explained in the Kubernetes documentation, while the setup for both is almost identical, Kubernetes takes different actions depending on the probe:
Liveness probe from Kubernetes documentation:
Many applications running for long periods of time eventually transition to broken states, and cannot recover except by being restarted. Kubernetes provides liveness probes to detect and remedy such situations.
Readiness probe from Kubernetes documentation:
Sometimes, applications are temporarily unable to serve traffic. For example, an application might need to load large data or configuration files during startup. In such cases, you don’t want to kill the application, but you don’t want to send it requests either. Kubernetes provides readiness probes to detect and mitigate these situations. A pod with containers reporting that they are not ready does not receive traffic through Kubernetes Services.
So, while an unhealthy response to a liveness probe will cause the Pod (and so, the application) to be killed, an unhealthy response to a readiness probe will simply cause the Pod to receive no traffic until it gets back to a healthy status.
What to consider when differentiating liveness and readiness probes?
For liveness probe:
I would recommend to define what makes your application healthy, i.e. minimum requirements for user consumption, and implement health checks based on that.
This typically involves external resources or applications running as separate processes, e.g. databases, web services, etc.
You may define health checks by using ASP.NET Core Health Checks library or manually with a separate Controller.
For readiness probe:
You simply want to load your service to verify it actually responds in time and so allows Kubernetes to balance traffic accordingly. Trivially (and in most cases as suggested by Lukas in another answer), you may use the same exact endpoint you would use for liveness but setting up different timeouts, but this then really depends on your needs and requirements.
What should we consider when differentiating between the liveness and readiness probes
My recommendation would be to provide a /health endpoint in your application separate from your application endpoint. This is useful if you want to block your consumers from calling your internal health endpoint. Then you can configure Kubernetes to query your HTTP /health endpoint like in the example below.
apiVersion: v1
kind: Pod
metadata:
name: goproxy
spec:
containers:
- name: goproxy
image: k8s.gcr.io/goproxy:0.1
ports:
- name: http
containerPort: 8080
readinessProbe:
httpGet:
port: http
path: /health
initialDelaySeconds: 60
livenessProbe:
httpGet:
port: http
path: /health
Inside your /health endpoint you should check the internal state of your application and return a status code of either 200 if everything is OK or 503 if your application is having issues. Keep in mind that health checks are performed usually every 15 seconds for every instance and if you are performing expensive operations to determining your application state you might slow down your application.
What should we consider when differentiating between the liveness and readiness probes
Usually the only difference between liveness and readiness probes are the timeouts in each probe. Maybe your application needs 60 seconds to start then you would need to set the initial timeout of your readiness probe to 60 while keeping the default liveness timeout.
I have a JMS service targeted at a migratable target (using an Auto-Migrate Exactly-Once policy) in a cluster which consists of 2 managed servers, at any point of time the service is hosted at one of them and the consumer (which is targeted at the cluster) is supposed to receive messages seamlessly no matter where the service is hosted.
When I manually switch the host of the migratable target (clicking migrate), without turning the hosting managed server off, the consumer fails to receive messages sent to the queues, unless I turn off the previous hosting managed server forcing the consumer to the new host.
I can rule out sender problems, I can see the messages in the queue right after them being sent.
I'll be grateful if anyone can advice on how to configure either the consumer or the migratable service to work seamlessly when migration happens.
I think that may just be a misunderstanding of how migration works. The docs state Auto-Migrate Exactly-Once:
indicates that if at least one Managed Server in the candidate list
is running, then the JMS service will be active somewhere in the
cluster if servers should fail or are shut down (either gracefully or
forcibly). For example, a migratable target hosting a path service
should use this option so if its hosting server fails or is shut down,
the path service will automatically migrate to another server and so
will always be active in the cluster. Note that this value can lead to
target grouping. For example, if you have five exactly-once migratable
targets and only one server member is started, then all five
migratable targets will be activated on that server member.
The docs also state:
Manual Service Migration—the manual migration of pinned JTA and
JMS-related services (for example, JMS server, SAF agent, path
service, and custom store) after the host server instance fails
Your server/service has neither failed or shut down, you are forcing it to migrate with a healthy host still running, so it has not met the criteria for migration.
See more here as well.
I have some experience that sounds reminiscent of what you're looking at. There was some WLS-specific capability around recognizing reconfiguration in JMS destinations as part of their clustered server design.
In one case I had to call a WLS-specific method: weblogic.jms.extensions.WLSession.setExceptionListener(). This was on their implementation of the JMS Session interface. This is analogous to the standard JMS Connection.setExceptionListener().
With this WLS-specific capability, the WLSession.setExceptionListener() callback would occur at a point where the consuming client should tear down and re-establish the connection / session / consumer in reaction to a reconfiguration (migration) that had happened.