How can watchtower avoid upgrading replicas at the same time - watchtower

My web app has redundancy for high availability. How shall I configure watchtower to avoid upgrading the docker containers simultaneously, which will shut the entire site down?
Is there any solution to configure a delay for each replica so that the upgrading won't happen at the same time?

Related

How do I configure Embedded Infinispan to handle K8s rolling updates?

I have a simple project that allows you to add keys to a distributed cache in an application that is running Infinispan version 13 in embedded mode. It is all published here.
I run a kubernetes setup that can run in minikube. I observe that when I run my example with six pods and perform a rolling update, my infinispan performance degrades from the start of the roll out up until four minutes after the last pod has restarted and created its cache. After this time the cluster operates as normal again. With degrading I mean that the operation of getting the count of items in the cache takes 2-3 seconds to execute, compared to below 0.5 seconds in normal mode. With my setup this is consistently happening, and consistently working again after four minutes.
When running the project on my local machine without a kubernetes environment I have not experienced the same kind of delays.
I have tried using TRACE logs, but can see no event of significance that happens after these four minutes.
Is there something obvious that I'm missing in my configuration of Infinispan (that you can see in my referenced project), or some additional operation that needs to be performed? (currently I start the cache on startup, and perform stop on shutdown).
A colleague found the following logs when running Infinispan in non embedded mode:
2022-01-09 14:56:45,378 DEBUG (jgroups-230,infinispan-server-2) [org.jgroups.protocols.UNICAST3] infinispan-server-2: removing expired connection for infinispan-server-0 (240058 ms old) from recv_table
After these logs the service performance was returned to normal again. This lead us to suspect that JGroups somehow tries to use old connections to pods that have been removed. By changing the conn_close_timeout setting on UNICAST3 for Jgroups to 10 seconds instead of the default value 4 minutes we could confirm that service degradation was fixed in 10s instead of 4 minutes.
Additionally it seems that this fix only works when the service is running as a StatefulSet and not when it runs as a Deployment. I don't have explanation for exactly why this is, but in conclusion make the service to a stateful set and changing the conn_close_timeout on UNICAST3 in the JGroups configuration fixed our problem.

How can I setup Redis Cluster mode or master slave mode in PCF?

This is regarding the use case where we are trying to use the Redis in PCF (Pivotal Cloud Foundry). In our use case, we will refresh the Redis cache daily once or twice with the required data and then API will query Redis and then provide the response.
One thing of particular concern for us is that we want API queries to happen from Redis only that means Redis to be available at all times. But whenever we are refreshing the Redis DB, Redis would not be able to serve the APIs since it is refreshing the keys. To avoid that we wanted to setup a Redis in cluster mode or master-slave mode so if one instance is being written another can be read from.
How can we setup Redis cluster or master-slave mode in PCF and then fulfil our requirement?
Please provide any other suggestions as well that you may have.
At the time I write this, the Redis for Pivotal Platform product does not support clustering. See Availability, in the docs here -> https://docs.pivotal.io/redis/2-3/erc.html#offerings.
All Redis for Pivotal Platform services are single VMs without clustering capabilities. This means that planned maintenance jobs (e.g., upgrades) can result in 2–10 minutes of downtime, depending on the nature of the upgrade. Unplanned downtime (e.g., VM failure) also affects the Redis service.
Redis for Pivotal Platform has been used successfully in enterprise-ready apps that can tolerate downtime. Pre-existing data is not lost during downtime with the default persistence configuration. Successful apps include those where the downtime is passively handled or where the app handles failover logic.
If you require clustered Redis, you'd need to look at a different offering. Redis Labs has some offerings that integrate with PCF, you could use a Cloud Provider's Redis offering, or you could host your own.
If the solution you use isn't integrated into PCF, you can create a user-provided service with cf cups and provide the Redis credentials to your application that way. It will function just like a Redis service instance created through the marketplace.

Should all pods using a redis cache be constrained to the same node as the rediscache itself?

We are running one of our services in a newly created kubernetes cluster. Because of that, we have now switched them from the previous "in-memory" cache to a Redis cache.
Preliminary tests on our application which exposes an API shows that we experience timeouts from our applications to the Redis cache. I have no idea why and it issue pops up very irregularly.
So I'm thinking maybe the reason for these timeouts are actually network related. Is it a good idea to put in affinity so we always run the Redis-cache on the same nodes as the application to prevent network issues?
The issues have not arisen during "very high load" situations so it's concerning me a bit.
This is an opinion question so I'll answer in an opinionated way:
Like you mentioned I would try to put the Redis and application pods on the same node, that would rule out wire networking issues. You can accomplish that with Kubernetes pod affinity. But you can also try nodeslector, that way you always pin your Redis and application pods to a specific node.
Another way to do this is to taint your nodes where you want to run your workloads and then add a toleration to the Redis and your application pods.
Hope it helps!

High-availability Redis?

I am currently setting up an infrastructure for an App in AWS. App is written in Django and is using Redis for some transactions. High availability is key for this application and I am having a hard time trying to get my head around how to configure Redis for High availability.
Application level changes are not an option.
Ideally I would like to have a redis setup, to which I can write and read and replicate and scale when required.
Current Setup is a Redis Fail-over scenario with HAProxy --> Redis Master --> Replica Slave.
Could someone guide me understand various options ? and how to scale redis for high availability !
Use AWS ElastiCache Redis Cluster with Multi-AZ. They provides automatic fail-over. It provides endpoint to access master node.
If master goes down AWS route your endpoint to another node. everything happens automatically, you don't have to do anything.
Just make sure that if you are doing DNS to IP caching in your application, its set to 60 seconds or so instead of default.
http://docs.aws.amazon.com/AmazonElastiCache/latest/UserGuide/AutoFailover.html
Thanks,
KS

spring-data-redis cluster recovery issue

We're running a 7-node redis cluster, with all nodes as masters (no slave replication). We're using this as an in-memory cache, so we've commented out all saves in redis.conf, and we've got the following other non-defaults in redis.conf:
maxmemory: 30gb
maxmemory-policy allkeys-lru
cluster-enabled yes
cluster-config-file nodes.conf
cluster-node-timeout 5000
cluster-require-full-coverage no
The client for this cluster is a spring-boot rest api application, using spring-data-redis with jedis as the driver. We mainly use the spring caching annotations.
We had an issue the other day where one of the masters went down for a while. With a single master down in a 7-node cluster we noted a marked increase in the average response time for api calls involving redis, which I would expect.
When the down master was brought back online and re-joined the cluster, we had a massive spike in response time. Via newrelic I can see that the app started making a ton of redis cluster calls (newrelic doesn't tell me which cluster subcommand was being used). Our normal avg response time is around 5ms; during this time it went up to 800ms and we had a few slow sample transactions that took > 70sec. On all app jvms I see the number of active threads jump from a normal 8-9 up to around 300 during this time. We have configured the tomcat http thread pool to allow 400 threads max. After about 3 minutes, the problem cleared itself up, but I now have people questioning the stability of the caching solution we chose. Newrelic doesn't give any insight into where the additional time on the long requests is being spent (it's apparently in an area that Newrelic doesn't instrument).
I've made some attempt to reproduce by running some jmeter load tests against a development environment, and while I see some moderate response time spikes when re-attaching a redis-cluster master, I don't see anything near what we saw in production. I've also run across https://github.com/xetorthio/jedis/issues/1108, but I'm not gaining any useful insight from that. I tried reducing spring.redis.cluster.max-redirects from the default 5 to 0, which didn't seem to have much effect on my load test results. I'm also not sure how appropriate a change that is for my use case.