Kubernetes: Kafka pod shutdown after server stops - ssl

I am running a Kafka pod in Kubernetes with Rancher. I am using the confluent image and it is connecting properly to zookeeper. I am using the stable helm chart with the Kafka confluent image 5.3.1. I also added SSL encryption into Helm using this page
It is starting properly then it shutdown abruptly and pod restart. I am getting this error on the log.
[2019-11-15 19:41:49,943] INFO Terminating process due to signal SIGTERM (org.apache.kafka.common.utils.LoggingSignalHandler)
[2019-11-15 19:41:49,945] INFO Shutting down SupportedServerStartable (io.confluent.support.metrics.SupportedServerStartable)
What is the SIGTERM error in Kafka pods? How to fix it?
Thank you

As mentioned by #Hamzatli, it is about liveness and readiness. K8s thinks that your pod is hitting a timeout issue and sends the SIGTERM to the pod for it to shutdown.
In your Helm Chart's values.yaml, there should be an option about liveness and/or readiness. Increase the initialDelaySeconds to a higher time, which you think is enough time for the Kafka to be brought up, so that k8s doesn't send the shutdown signal too quickly during initial startup.
livenessProbe:
enabled: true
initialDelaySeconds: 60 # 60 seconds delay for the pod to start liveness probe
timeoutSeconds: 5
readinessProbe:
enabled: true
initialDelaySeconds: 60 # 60 seconds delay for pod to start readiness probe
timeoutSeconds: 5
You can read more about this here.

This is liveness and readiness problem. When kafka attempt to read topic snapshots liveness ping to kafka and it can't respond. So k8 shutdown kafka.
Remove liveness and readiness

Related

SASL_SSL and SSL listener co-existing in same Kafka Cluster

I am trying to migrate from SSL Kafka listener to SASL_SSL Kafka listener without disturbing ongoing traffic on SSL listener/port. Is there any way to do this on Kafka version 1.1.1?

Cannot start an example build

I installed okd 3.11 in single master multiple nodes via the openshift ansible playbooks. When I try the nginx quick start. I get the the follow error when I asks the events of the pod.
Warning FailedCreatePodSandBox 23m kubelet, Failed create pod sandbox: rpc error: code = Unknown desc = [failed to set up sandbox container "3d57f7bf012f8737e202ac2db7291b58a3d5fde376ff431395584c165928d475" network for pod "nginx-example-1-build": NetworkPlugin cni failed to set up pod "nginx-example-1-build_geoffrey-samper" network: error getting ClusterInformation: Get https://10.43.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: dial tcp 10.43.0.1:443: i/o timeout, failed to clean up sandbox container "3d57f7bf012f8737e202ac2db7291b58a3d5fde376ff431395584c165928d475" network for pod "nginx-example-1-build": NetworkPlugin cni failed to teardown pod "nginx-example-1-build_geoffrey-samper" network: error getting ClusterInformation: Get https://10.43.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: dial tcp 10.43.0.1:443: i/o timeout
When I run oc status but I can't find any pod with 10.* ip range
In project default on server https://p04:8443
https://docker-registry-default (passthrough) (svc/docker-registry)
dc/docker-registry deploys docker.io/openshift/origin-docker-registry:v3.11
deployment #1 deployed about an hour ago - 1 pod
svc/kubernetes - xxx ports 443->8443, 53->8053, 53->8053
https://registry-console-default (passthrough) (svc/registry-console)
dc/registry-console deploys docker.io/cockpit/kubernetes:latest
deployment #1 deployed about an hour ago - 1 pod
svc/router - xxx ports 80, 443, 1936
dc/router deploys docker.io/openshift/origin-haproxy-router:v3.11
deployment #1 deployed about an hour ago - 1 pod
Does anyone know how to resolve this or where to begin.

RabbitMQ Service starts and then stopped due to RabbitMQ: Erlang machine seems to die continously, not restarted

I installed Erlang (OTP 21.2 Windows 64-bit Binary File) followed by RabbitMQ Server 3.7.9 and I also installed management tool rabbitmq-plugins enable rabbitmq_management.
Now whenever I try to start RabbitMQ service, after 1 minutes, it's stopped.
Below event I'm getting,
Info > RabbitMQ: Erlang service restarted
Error >RabbitMQ: Erlang machine seems to die continously, not
restarted.
The RabbitMQ service terminated with the following error:
The process terminated unexpectedly.

spring xd with rabbit transporter not picking up

I am trying to use rabbit as my transporter in spring-xd. Earlier, I installed redis and configured redis properties in servers.yml and it was working.
Now, I moved to rabbitmq, and installed rabbitmq and able to run the service rabbitmq-server.
I made the following changes in servers.yml
xd:
transport: rabbit
#RabbitMQ properties
spring:
rabbitmq:
host: FQHN
port: 5672
username: guest
password: guest
virtual_host: /
But, when I am starting xd-admin and container, still container taking redis as a transporter.
Redis properties are commented in servers.yml.
Please help me on this.
Thanks,
-Suyodha
I use rabbit transport all the time, with no problems.
First, I suggest you move to the recent 1.0.0.RELEASE. The rabbit host and port are now provided in the addresses property
spring:
rabbitmq:
addresses: FQHN:5672
...
How are you launching the container and admin? Do you have any XD_ environment variables set? Using the supplied scripts, the servers.yml settings will normally override the application.yml. It must be on the classpath. If you can't figure it out, post a DEBUG log of the container startup somewhere (too big for here) with DEBUG enabled for org.springframework.

How to configure RabbitMQ using Active/Passive High Availability architecture

I'm trying to setup a cluster of RabbitMQ servers, to get highly available queues using an active/passive server architecture. I'm following this guides:
http://www.rabbitmq.com/clustering.html
http://www.rabbitmq.com/ha.html
http://karlgrz.com/rabbitmq-highly-available-queues-and-clustering-using-amazon-ec2/
My requirement for high availability is simple, i have two nodes (CentOS 6.4) with RabbitMQ (v3.2) and Erlang R15B03. The Node1 must be the "active", responding all requests, and the Node2 must be the "passive" node that has all the queues and messages replicated (from Node1).
To do that, i have configured the following:
Node1 with RabbitMQ working fine in non-cluster mode
Node2 with RabbitMQ working fine in non-cluster mode
The next I did was to create a cluster between both nodes: joining Node2 to Node1 (guide 1). After that I configured a policy to make mirroring of the queues (guide 2), replicating all the queues and messages among all the nodes in the cluster. This works, i can connect to any node and publish or consume message, while both nodes are available.
The problem occurs when i have a queue "queueA" that was created on the Node1 (master on queueA), and when Node1 is stopped, I can't connect to the queueA in the Node2 to produce or consume messages, Node2 throws an error saying that Node1 is not accessible (I think that queueA is not replicated to Node2, and Node2 can't be promoted as master of queueA).
The error is:
{"The AMQP operation was interrupted: AMQP close-reason, initiated by
Peer, code=404, text=\"NOT_FOUND - home node 'rabbit#node1' of durable
queue 'queueA' in vhost 'app01' is down or inaccessible\", classId=50,
methodId=10, cause="}
The sequence of steps used is:
Node1:
1. rabbitmq-server -detached
2. rabbitmqctl start_app
Node2:
3. Copy .erlang.cookie from Node1 to Node2
4. rabbitmq-server -detached
Join the cluster (Node2):
5. rabbitmqctl stop_app
6. rabbitmqctl join_cluster rabbit#node1
7. rabbitmqctl start_app
Configure Queue mirroring policy:
8. rabbitmqctl set_policy ha-all "" '{"ha-mode":"all","ha-sync-mode":"automatic"}'
Note: The pattern used for queue names is "" (all queues).
When I run 'rabbitmqctl list_policies' and 'rabbitmqctl cluster_status' is everything ok.
Why the Node2 cannot respond if Node1 is unavailable? Is there something wrong in this setup?
You haven't specified the virtual host (app01) in your set_policy call, thus the policy will only apply to the default virtual host (/). This command line should work:
rabbitmqctl set_policy -p app01 ha-all "" '{"ha-mode":"all","ha-sync-mode":"automatic"}'
In the web management console, is queueA listed as Node1 +1?
It sounds like there might be some issue with your setup. I've got a set of vagrant boxes that are pre-configured to work in a cluster, might be worth trying that and identifying issues in your setup?
Only mirror queue which are synchronized with the master are promoted to be master, after fails. This is default behavior, but can be changed to promote-on-shutdown always.
Read carefully your reference
http://www.rabbitmq.com/ha.html
You could use a cluster of RabbitMQ nodes to construct your RabbitMQ
broker. This will be resilient to the loss of individual nodes in
terms of the overall availability of service, but some important
caveats apply: whilst exchanges and bindings survive the loss of
individual nodes, queues and their messages do not. This is because a
queue and its contents reside on exactly one node, thus the loss of a
node will render its queues unavailable.
Make sure that your queue is not durable or exclusive.
From the documentation (https://www.rabbitmq.com/ha.html):
Exclusive queues will be deleted when the connection that declared them is closed. For this reason, it is not useful for an exclusive
queue to be mirrored (or durable for that matter) since when the node
hosting it goes down, the connection will close and the queue will
need to be deleted anyway.
For this reason, exclusive queues are never mirrored (even if they
match a policy stating that they should be). They are also never
durable (even if declared as such).
From your error message:
{"The AMQP operation was interrupted: AMQP close-reason, initiated by
Peer, code=404, text=\"NOT_FOUND - home node 'rabbit#node1' of
durable queue 'queueA' in vhost 'app01' is down or inaccessible\", classId=50, methodId=10, cause="}
It looks like you created a durable queue.