How do I configure Embedded Infinispan to handle K8s rolling updates? - kotlin

I have a simple project that allows you to add keys to a distributed cache in an application that is running Infinispan version 13 in embedded mode. It is all published here.
I run a kubernetes setup that can run in minikube. I observe that when I run my example with six pods and perform a rolling update, my infinispan performance degrades from the start of the roll out up until four minutes after the last pod has restarted and created its cache. After this time the cluster operates as normal again. With degrading I mean that the operation of getting the count of items in the cache takes 2-3 seconds to execute, compared to below 0.5 seconds in normal mode. With my setup this is consistently happening, and consistently working again after four minutes.
When running the project on my local machine without a kubernetes environment I have not experienced the same kind of delays.
I have tried using TRACE logs, but can see no event of significance that happens after these four minutes.
Is there something obvious that I'm missing in my configuration of Infinispan (that you can see in my referenced project), or some additional operation that needs to be performed? (currently I start the cache on startup, and perform stop on shutdown).

A colleague found the following logs when running Infinispan in non embedded mode:
2022-01-09 14:56:45,378 DEBUG (jgroups-230,infinispan-server-2) [org.jgroups.protocols.UNICAST3] infinispan-server-2: removing expired connection for infinispan-server-0 (240058 ms old) from recv_table
After these logs the service performance was returned to normal again. This lead us to suspect that JGroups somehow tries to use old connections to pods that have been removed. By changing the conn_close_timeout setting on UNICAST3 for Jgroups to 10 seconds instead of the default value 4 minutes we could confirm that service degradation was fixed in 10s instead of 4 minutes.
Additionally it seems that this fix only works when the service is running as a StatefulSet and not when it runs as a Deployment. I don't have explanation for exactly why this is, but in conclusion make the service to a stateful set and changing the conn_close_timeout on UNICAST3 in the JGroups configuration fixed our problem.

Related

Apache Ignite: Getting affinity for too old topology version that is already out of history (try to increase 'IGNITE_AFFINITY_HISTORY_SiZE'

I am getting this exception intermittently while trying to run co-located join queries on cached data. Below are some of specifics of the environment and how the caches are initialized.
Running embedded with a spring boot application
Deployed in Kubernetes environment with TcpDiscoveryJdbcIpFinder
Running on 3+ nodes
The caches are created dynamically using BinaryObjects and QueryEntity
The affinity keys are forced to be a static value using AffinityKeyMapper (for the same group of data)
I am getting Getting affinity for too old topology version that is already out of history (try to increase 'IGNITE_AFFINITY_HISTORY_SiZE) sporadically. Sometimes this happens continuously for a few minutes. Sometimes it would work on a second or third try and sometimes we don't see this error for hours. I already increased IGNITE_AFFINITY_HISTORY_SiZE to 100000 and we are still getting this message.

How to change the Ignite to maintanance mode?

What is Ignite maintenance mode of Ignite, and how to change an ignite to this mode? I was stuck joining the node to the cluster and complains cleaning up the persistent data, however the data can be cleaned (using control.sh) only in the maintenance mode only.
This is a special mode, similar to running Windows in a safe mode after a crash or a data corruption where most of the cluster functionality is disabled and a user is asked to perform some maintenance task to resolve the issue, most straightforward example I can think of is - to clean (remove) some corrupted files on disk just like in your question. You can refer to IEP-53: Maintenance Mode proposal for the details.
I don't think that there is a way to enter this mode manually unless you trigger some preconfigured conditions like stopping a node in the middle of checkpointing with WAL disabled. Once the state is fixed, maintenance mode should be resolved automatically allowing a node to join the cluster.
Also, from my understanding, this mode is about a particular node rather than a complete cluster. I.e. you can have a 4-nodes cluster with only 1 node in maintenance mode, in that case, you have to run control.sh commands locally for the concrete failed node, not from another healthy node. If that's not the case, please provide more details or file a JIRA ticket because reported behavior looks quite broken to me.

GridGain Server partition loss

We have 3 node Gridgain server and there are 3 client nodes deployed in GCP Kubernetes engine. Cluster is native persistence enabled. Also <property name="shutdownPolicy" value="GRACEFUL"/> as shutdown policy. There is one backup for each cache. After automatic cluster restart getting partition loss. Need to reset these partitions by executing control commands.
Can you provide proper solution for this. We have around 60GB persistent data.
<property name="shutdownPolicy" value="GRACEFUL"/> is supposed to protect from partition loss if certain conditions are met:
The caches must be either PARTITIONED with backups > 0 or REPLICATED. Check your configs. Default cache config in Ignite is PARTITIONED with backups = 0 (for historical reasons), so the defaults won't work.
There must be more than one baseline node (only baseline nodes store data!). Here is the doc.
You must stop the nodes in a graceful way. This is a bit tricky since you don't always control this.
If you stop with a kill to the process, make sure it uses SIGTERM and not SIGKILL because the later always kills the process immediately
If you stop with Ignite.close() this should just work
If you stop with Java System.exit() it'll work, but if you use System.halt() - it won't (because halt() is not graceful)
If you use orchestrators such as Kubernetes, you need to make sure they'll stop the nodes gracefully. For example, in Kubernetes you normally have to set terminationGracePeriodSeconds to a high value so that Kubernetes waits for the nodes to finish graceful shutdown instead of killing them.
If you use custom startup scripts, you need to make sure they forward signals to the Ignite process.
To debug this, check the points above. I would normally start by looking at the server logs (with IGNITE_QUIET=false!) to see if "Invoking shutdown hook" message is there. If it isn't there then your shutdown hook isn't getting called, and the problem is one of the points under 3. Otherwise, there should be other log messages explaining the situation.

Failed Deployment in App Engine Google Cloud

I am deploying my nodejs application in google cloud app engine but it is giving error
This request caused a new process to be started for your application, and thus caused your application code to be loaded for the first time.
This request may thus take longer and use more CPU than a typical request for your application. -- when making request.
I had also saw some stackoverflow answers, but they didn't worked for me.
my app.yaml have this config
runtime: nodejs10
Can anyone help me out
You could add the following to your app.yaml:
inbound_services:
- warmup
And then implement a handler that will catch all warmup requests, so that your application doesn't get the full load. The full explanation is given here. Another detailed post about this topic can be found here.
Additionally you can also add automatic scaling options. You can play a bit with those to find the optimum for your application. Especially the latency related variables are important. Good to note that they can be set in a standard GAE environment.
automatic_scaling:
min_idle_instances: automatic
max_idle_instances: automatic
min_pending_latency: automatic
max_pending_latency: automatic
More scaling options can be found here.
The "request caused a new process to be started" notification usually occurred when there is no warm up request present in your application.
Can you try to implement a health check handler that only returns a ready status when the application is warmed up. This will allow your service to not receive traffic until it is ready.
Warning: Legacy health checks using the /_ah/health path are now
deprecated, and you should migrate to use split health checks.
Here you can find Split health checks for Nodejs
Liveness checks
Liveness checks confirm that the VM and the Docker container are
running. Instances that are deemed unhealthy are restarted.
path: "/liveness_check"
check_interval_sec: 30
timeout_sec: 4
failure_threshold: 2
success_threshold: 2
Readiness checks
Readiness checks confirm that an instance can accept incoming
requests. Instances that don't pass the readiness check are not added
to the pool of available instances.
path: "/readiness_check"
check_interval_sec: 5
timeout_sec: 4
failure_threshold: 2
success_threshold: 2
app_start_timeout_sec: 300
Edit
For App Engine Standard, which doesn't afford you that flexibility, hardware and software failures that cause early termination or frequent restarts can occur without prior warning. link
App Engine attempts to keep manual and basic scaling instances running
indefinitely. However, at this time there is no guaranteed uptime for
manual and basic scaling instances. Hardware and software failures
that cause early termination or frequent restarts can occur without
prior warning and can take considerable time to resolve; thus, you
should construct your application in a way that tolerates these
failures.
Here are some good strategies for avoiding downtime due to instance
restarts:
Reduce the amount of time it takes for your instances restart or for
new ones to start.
For long-running computations, periodically create
checkpoints so that you can resume from that state.
Your app should be "stateless" so that nothing is stored on the instance.
Use queues for performing asynchronous task execution.
If you configure your instances to manual scaling: Use load balancing across > multiple instances. Configure more instances than required to handle normal
traffic. Write fall-back logic that uses cached results when a manual
scaling instance is unavailable.
Instance Uptime

spring-data-redis cluster recovery issue

We're running a 7-node redis cluster, with all nodes as masters (no slave replication). We're using this as an in-memory cache, so we've commented out all saves in redis.conf, and we've got the following other non-defaults in redis.conf:
maxmemory: 30gb
maxmemory-policy allkeys-lru
cluster-enabled yes
cluster-config-file nodes.conf
cluster-node-timeout 5000
cluster-require-full-coverage no
The client for this cluster is a spring-boot rest api application, using spring-data-redis with jedis as the driver. We mainly use the spring caching annotations.
We had an issue the other day where one of the masters went down for a while. With a single master down in a 7-node cluster we noted a marked increase in the average response time for api calls involving redis, which I would expect.
When the down master was brought back online and re-joined the cluster, we had a massive spike in response time. Via newrelic I can see that the app started making a ton of redis cluster calls (newrelic doesn't tell me which cluster subcommand was being used). Our normal avg response time is around 5ms; during this time it went up to 800ms and we had a few slow sample transactions that took > 70sec. On all app jvms I see the number of active threads jump from a normal 8-9 up to around 300 during this time. We have configured the tomcat http thread pool to allow 400 threads max. After about 3 minutes, the problem cleared itself up, but I now have people questioning the stability of the caching solution we chose. Newrelic doesn't give any insight into where the additional time on the long requests is being spent (it's apparently in an area that Newrelic doesn't instrument).
I've made some attempt to reproduce by running some jmeter load tests against a development environment, and while I see some moderate response time spikes when re-attaching a redis-cluster master, I don't see anything near what we saw in production. I've also run across https://github.com/xetorthio/jedis/issues/1108, but I'm not gaining any useful insight from that. I tried reducing spring.redis.cluster.max-redirects from the default 5 to 0, which didn't seem to have much effect on my load test results. I'm also not sure how appropriate a change that is for my use case.