Managing the health and well being of multiple pods with dependencies - error-handling

We have several pods (as service/deployments) in our k8s workflow that are dependent on each other, such that if one goes into a CrashLoopBackOff state, then all these services need to be redeployed.
Instead of having to manually do this, is there a programatic way of handling this?
Of course we are trying to figure out why the pod in question is crashing.

If these are so tightly dependant on each other, I would consider these options
a) Rearchitect your system to be more resilient towards failure and tolerate, if a pod is temporary unavailable
b) Put all parts into one pod as separate containers, making the atomic design more explicit
If these don't fit your needs, you can use the Kubernetes API to create a program that automates the task of restarting all dependent parts. There are client libraries for multiple languages and integration is quite easy. The next step would be a custom resource definition (CRD) so you can manage your own system using an extension to the Kubernetes API.

First thing to do is making sure that pods are started in correct sequence. This can be done using initContainers like that:
spec:
initContainers:
- name: waitfor
image: jwilder/dockerize
args:
- -wait
- "http://config-srv/actuator/health"
- -wait
- "http://registry-srv/actuator/health"
- -wait
- "http://rabbitmq:15672"
- -timeout
- 600s
Here your pod will not start until all the services in a list are responding to HTTP probes.
Next thing you may want to define liveness probe that periodically executes curl to the same services
spec:
livenessProbe:
exec:
command:
- /bin/sh
- -c
- curl http://config-srv/actuator/health &&
curl http://registry-srv/actuator/health &&
curl http://rabbitmq:15672
Now if any of those services fail - you pod will fail liveness probe, be restarted and wait for services to become back online.
That's just an example how it can be done. In your case checks can be different of course.

Related

Kubernetes - env variables as API url

So I have an API that's the gateway for two other API's.
Using docker in wsl 2 (ubuntu), when I build my Gateway API.
docker run -d -p 8080:8080 -e A_API_URL=$A_API_URL B_API_URL=$B_API_URL registry:$(somePort)//gateway
I have 2 environnement variables that are the API URI of the two API'S. I just dont know how to make this work in the config.
env:
- name: A_API_URL
value: <need help>
- name: B_API_URL
value: <need help>
I get 500 or 502 errors when accessing then in the network.
I tried specifyng the value of the env var as:
their respective service's name.
the complete URI (http://$(addr):$(port)
the relative path : /something/anotherSomething
Each API is deployed with a Deployment controller and a service
I'm at a lost, any help is appreciated
You just have to hardwire them. Kubernetes doesn't know anything about your local machine. There are templating tools like Helm that could inject things like Bash is in your docker run example but generally not a good idea since if anyone other than you runs the same command, they could see different results. The values should look like http://servicename.namespacename.svc.cluster.local:port/whatever. So if the service is named foo in namespace default with port 8000 and path /api, http://foo.default.svc.cluster.local:8000/api.

S3 - Kubernetes probe

I have the following situation:
Application uses S3 to store data in Amazon. Application is deployed as a pod in kubernetes. Sometimes some of developers messes with access data for S3 (eg. user/password) and application fails to connect to S3 - but pod starts normally and kills previous pod version that worked OK (since all readiness and aliveness probes are OK). I thought of adding S3 probe to readiness - in order to execute HeadBucketRequest on S3 and if this one succeeds it is able to connect to S3. The problem here is that these requests cost money, and I really need them only on start of the pod.
Are there any best-practices related to this one?
If you (quote) "... really need them [the probes] only on start of the pod" then look into adding a startup probe.
In addition to what startup probes help with - pods that take longer time to start - a startup probe will make it possible to verify a condition only at pod startup time.
Readiness and liveness prove as for checking the health of POD or container while running. You scenario is quite wired but with Readiness & liveness probe it wont work as it fire on internal and which cost money.
in this case you might can use the lifecycle hook :
containers:
- image: MAGE_NAME
lifecycle:
postStart:
exec:
command: ["/bin/sh", "-c", "script.sh"]
which will run the hook at starting of the container you can keep shell file inside the POD or image.
inside shell file you can right logic if 200 response move a head and container get started.
https://kubernetes.io/docs/concepts/containers/container-lifecycle-hooks/

How to control job scheduling in a better way in gitlab-ci?

I have in my gitlab projects jobs defined and executed via gitlab-ci. However, it doesn't do well with interdependent jobs as there's no management of this case except manual.
The case I have is a service, which is a part of the overall app, takes long time to start. Starting this service is done within a job, while another job have another service, which is also a part of the overall app, querying the former service. Due to interdependence, I have just delayed the execution of this later job so that most probably the former job has its service up and running.
I wanted to use Rundeck as a job scheduler but not sure if this can be done with gitlab? Maybe I am wrong about gitlab, so does gitlab allow better job scheduling?
Here's an example of what I am doing:
.gitlab-ci.yml
deploy:
environment:
name:$CI_ENVIRONMENT
url: http://$CI_ENVIRONMENT.local.net:4999/
allow_failure: true
script:
- sudo dpkg -i myapp.deb
- sleep 30m //here I wait for the service to be ready for later jobs to run successfully
- RESULT=`curl http://localhost:9999/api/test | grep Success'
it looks like it is a typical trigger feature inside gitlab-ci
see gitlab-ci triggers
mostly at the end of the long start-up service job A to use curl to trigger another one
deploy_service_a:
stage: deploy
script:
- "curl --request POST --form token=TOKEN --form ref=master https://gitlab.example.com/api/v4/projects/9/trigger/pipeline"
only:
- tags

fig up: docker containers start synchronisation

For one of my home projects I decided to use docker containers and fig for orchestration (first time using those tools).
Here is my fig.yaml:
rabbitmq:
image: dockerfile/rabbitmq:latest
mongodb:
image: mongo
app:
build: .
command: python /code/app/main.py
links:
- rabbitmq
- mongodb
volumes:
- .:/code
Rabbitmq starting time is much slower than loading time of my application. Even though rabbitmq container starts loading first (since it is in app links), when my app tries to connect to rabbitmq server it's not yet available (it's definately loading timing problem, since if I just insert sleep for 5 seconds before connecting to rabbitmq - everything works fine). Is there some standard way to resolve loading time synchronisation problems?
Thanks.
I don't think there is an standard way to solve this, but it is a known problem and some people have acceptable workarounds.
There is a proposal on the Docker issue tracker about not considering a container as started until it is listening at the exposed ports. However it likely won't be accepted due to other problems it would create elsewhere. There is a fig proposal on the same topic as well.
The easy solution is to do the sleep like #jcortejoso says. An example from http://blog.chmouel.com/2014/11/04/avoiding-race-conditions-between-containers-with-docker-and-fig/:
function check_up() {
service=$1
host=$2
port=$3
max=13 # 1 minute
counter=1
while true;do
python -c "import socket;s = socket.socket(socket.AF_INET, socket.SOCK_STREAM);s.connect(('$host', $port))" \
>/dev/null 2>/dev/null && break || \
echo "Waiting that $service on ${host}:${port} is started (sleeping for 5)"
if [[ ${counter} == ${max} ]];then
echo "Could not connect to ${service} after some time"
echo "Investigate locally the logs with fig logs"
exit 1
fi
sleep 5
(( counter++ ))
done
}
And then use check_up "DB Server" ${RABBITMQ_PORT_5672_TCP_ADDR} 5672 before starting your app server, as described in the link above.
Another option is to use docker-wait. In your fig.yml.
rabbitmq:
image: dockerfile/rabbitmq:latest
mongodb:
image: mongo
rabbitmqready:
image: aanand/wait
links:
- rabbitmq
app:
build: .
command: python /code/app/main.py
links:
- rabbitmqready
- mongodb
volumes:
- .:/code
Similar problems I have encountered I have solved using a custom script set up as CMD in my Dockerfiles. Then you can run any check command you wish (sleep for a time, or waiting to the service be listening, for example). I think there is not a standard way to do this, anyway I think the best way would be the application run could be able to ask the external service to be up and running, and the connect to them, but this is not possible in most cases.
For testing on our CI, we built a small utility that can be used in a Docker container to wait for linked services to be ready. It automatically finds all linked TCP services from their environment variables and repeatedly and concurrently tries to establish TCP connections until it succeeds or times out.
We also wrote a blog post describing why we built it and how we use it.

RabbitMQ management plugin with local cluster

Is there any reason that the rabbitmq-management plugin wouldn't work when I'm using 'rabbitmq-multi' to spin up a cluster of nodes on my desktop? Or, more precisely, that the management plugin would cause that spinup to fail?
I get Error: {node_start_failed,normal} when rabbitmq-multi starts rabbit_1#localhost
The first node, rabbit#localhost seems to start okay though.
If I take out the management plugins, all the nodes start up (and then cluster) fine. I think I'm using a recent enough Erlang version (5.8/OTP R14A according to the README in my erl5.8.2 folder). I'm using all the plugins that are listed as required on the plugins page, including mochiweb, webmachine, amqp_client, rabbitmq-mochiweb, rabbitmq-management-agent, and rabbitmq-management. Those plugins, and only those plugins.
The problem is that rabbitmq-multi only assigns sequential ports for AMQP, not HTTP (or STOMP or AMQPS or anything else the broker may open). Therefore each node tries to listen on the same port for the management plugin and only the first succeeds. rabbitmq-multi will be going away in the next release; this is one reason why.
I think you'll want to start the nodes without using rabbitmq-multi, just with multiple invocations of rabbitmq-server, using environment variables to configure each node differently. I use a script like:
start-node.sh:
#!/bin/sh
RABBITMQ_NODE_PORT=$1 RABBITMQ_NODENAME=$2 \
RABBITMQ_MNESIA_DIR=/tmp/rabbitmq-$2-mnesia \
RABBITMQ_PLUGINS_EXPAND_DIR=/tmp/rabbitmq-$2-plugins-scratch \
RABBITMQ_LOG_BASE=/tmp \
RABBITMQ_SERVER_START_ARGS="-rabbit_mochiweb port 5$1" \
/path/to/rabbitmq-server -detached
and then invoke it as
start-node.sh 5672 rabbit
start-node.sh 5673 hare