Infinispan Distributed Cache Issues - infinispan

I am new to Infinispan. We have setup a Infinispan cluster so that we can make use of the Distributed Cache for our CPU and memory intensive task. We are using UDP as the communication medium and Infinispan MapReduce for distributed processing. The problem we are using facing is with the throughput. When we run the program on a single node machine, the total program completes in around 11 minutes i.e. processing a few hundred thousand records and finally emitting around 400 000 records.
However, when we use a cluster for the same dataset, we are seeing a throughput of only 200 records per second being updated between the nodes in the cluster, which impacts the overall processing time. Not sure which configuration changes are impacting the throughput so badly. I am sure its something to do with the configuration of either JGroups or Infinispan.
How can this be improved?

Related

Thousands of TimeoutExceptions after switching to Redis Enterprise

We recently attempted to switch from Azure Redis to Redis Enterprise, unfortunately after about an hour we were forced to roll back due to performance issues. We're looking for advice on how to get to the root cause and proceed. Here's what I've figured out so far, but I'm happy to add any more details as necessary.
First off, the client is a .NET Framework app using StackExchange.Redis version 2.1.30. The Azure Redis instance is using 4 shards, and the Redis Enterprise instance is also configured for 4 shards.
When we switched over to Redis Enterprise, we would immediately see several thousand of these exceptions per 5 minute interval:
Timeout performing GET (5000ms), next: GET [Challenges]::306331, inst:
1, qu: 0, qs: 3079, aw: False, rs: ReadAsync, ws: Idle, in: 0,
serverEndpoint: xxxxxxx:17142, mc: 1/1/0, mgr: 9 of 10 available,
clientName: API, IOCP: (Busy=2,Free=998,Min=400,Max=1000), WORKER:
(Busy=112,Free=32655,Min=2000,Max=32767), Local-CPU: 4.5%, v:
2.1.30.38891 (Please take a look at this article for some common client-side issues that can cause timeouts:
https://stackexchange.github.io/StackExchange.Redis/Timeouts)
Looking at this error message, it appears there's tons of things in the WORKER thread pool (things waiting on a response from Redis Enterprise), but nearly nothing in the IOCP thread pool (responses from Redis waiting to be processed by our client code). So, there's some sort of bottleneck on the Redis side.
Using AppInsights, I created a graph of the busy worker threads (dark blue), busy IO threads (red), and CPU usage (light blue). We see something like this:
The CPU never really goes above 20% or so, the IO threads are barely a blip (I think the max is like 2 busy), but the worker threads kinda grow and grow until eventually everything times out and the process starts over again. A little after 7pm is when we decided to roll back to Azure Redis, so everything is great at that point. So, everything points to Redis being some sort of bottleneck. So, let's look at the Redis side of things.
During this time, Redis reported a max of around 5% CPU usage. Incoming traffic topped out around 1.4MB/s, and outgoing traffic topped out around 9.5MB/s. Ops/sec were around 4k. Latency around this time was 0.05ms, and the slowest thing in the SLOWLOG was like 15ms or so. In other words, the Redis Enterprise node was barely breaking a sweat and was easily able to keep up with the traffic being sent to it. In fact, we had 4 other nodes in the cluster that weren't even being used since Redis didn't even see the need to send anything to other nodes. Redis was basically just yawning.
From here, I was thinking maybe there were network bandwidth contraints. All of our VMs are configured for accelerated networking, and we should have 10gig connections to these machines. I decided to run an iperf between the client and the server:
I can transfer easily over 700Mbit/sec between the client and the Redis Enterprise server, yet the server is processing 9.5MB/sec easily. So, it doesn't appear the problem is network bandwidth.
So, here's where we stand:
The same code works great with Azure Redis, yet causes thousands of timeouts when we switch over to Redis Enterprise.
Redis Enterprise is handling 4,000 operations per second and sending out 9 megs a second, and can usually handle a single operation in a fraction of a ms, with the very longest being 15ms.
I can send 700+ Mb/sec between the client and server.
Yet, the WORKER thread pool builds up with pending requests to Redis and eventually times out.
I'm pretty stuck here. What's a good next step to diagnose this issue? Thanks!

RabbitMQ poor performance

We are facing bad performance in our RabbitMQ clusters. Even when idle.
Once installed the rabbitmq-top plugin, we see many processes with very high reductions/sec. 100k and more!
Questions:
What does it mean?
How to control it?
What might be causing such slowness without any errors?
Notes:
Our clusters are running on Kubernetes 1.15.11
We allocated 3 nodes, each with 8 CPU and 8 GB limits. Set vm_watermark to 7G. Actual usage is ~1.5 CPU and 1 GB RAM
RabbitMQ 3.8.2. Erlang 22.1
We don't have many consumers and producers. The slowness is also on a fairly idle environment
The rabbitmqctl status is very slow to return details (sometimes 2 minutes) but does not show any errors
After some more investigation, we found the actual reason was made up of two issues.
RabbitMQ (Erlang) run time configuration by default (using the bitnami helm chart) assigns only a single scheduler. This is good for some simple app with a few concurrent connections. Production grade with 1000s of connections have to use many more schedulers. Bumping up from 1 to 8 schedulers improved throughput dramatically.
Our monitoring that was hammering RabbitMQ with a lot of requests per seconds (about 100/sec). The monitoring hits the aliveness-test, which creates a connection, declares a queue (not mirrored), publishes a message and then consumes that message. Disabling the monitoring reduced load dramatically. 80%-90% drop in CPU usage and the reductions/sec also dropped by about 90%.
References
Performance:
https://www.rabbitmq.com/runtime.html#scheduling
https://www.rabbitmq.com/blog/2020/06/04/how-to-run-benchmarks/
https://www.rabbitmq.com/blog/2020/08/10/deploying-rabbitmq-to-kubernetes-whats-involved/
https://www.rabbitmq.com/runtime.html#cpu-reduce-idle-usage
Monitoring:
http://rabbitmq.1065348.n5.nabble.com/RabbitMQ-API-aliveness-test-td32723.html
https://groups.google.com/forum/#!topic/rabbitmq-users/9pOeHlhQoHA
https://www.rabbitmq.com/monitoring.html

Tinc/SHH/IPSec: tuning for high throughput

I have a dedicated 128GB ram server running memcached. 4 web servers connect to that one. They send a total of around
20k packets/sec.
Recently I decided to change connection from webservers to the memcached server from persistent SSH tunnels to using Tinc (for simplicity of setup and flexibility whenever I needed them to communicate on a new port).
This change has caused the overhead on the network roundtrip to increase significantly (see graphs). I noticed however, that the network overhead of using Tinc in favor of SSH-tunnels is a lot smaller (even faster than the previous SSH-tunnels!), when I use it for communicating between servers (e.g. my Postgresql database server), where the throughput is a lot lower < 10k packet per sec. I tried to distribute the memcached load between more servers, and suddenly the overhead from tinc/network dropped significantly.
Now, I do not understand WHY the tinc network overhead increases so dramatically, as the throughput goes up? It's like I hit some kind of bottle neck (and it defiantly is not CPU, since Newrelic report < 0.5% usage for the tinc process). Is there something I could tune in the Tinc setup, or is Tinc just a bad choice for high throughput? Should I use IPsec instead?

Why is Tomcat not scaling throughput with increased concurrent load?

For this test, I have a simple Java servlet that reads data in and calculates the CRC32 for it. When making serial requests of 512MB each, I get about 600MB/sec. That makes sense since I can't use all 24 cores available to me to calculate a CRC. The program driving this I/O is sitting on the local box to eliminate the possibility of networking issues. I am running Tomcat 8.0.24.0 on FreeBSD using OpenJDK 64-Bit Server VM (build 25.45-b02, mixed mode).
Next, I attempt the same test with 6 concurrent requests, expecting that the performance per request might be lower than 600MB/sec, but that the aggregate performance across all 6 requests would be significantly higher.
What I see is the CPU has some idle time at ALL times (so it doesn't appear that I'm CPU-bound). I also see that all processing threads in Tomcat are running concurrently as anticipated. However, it looks like I'm only getting around 800MB/sec in aggregate. The threads in Tomcat spend most of their time waiting to read from the socket, as shown below.
I would appreciate any thoughts on how to improve Tomcat throughput / why so much time is spent waiting for more data (which I assume is what's going on below).
at sun.misc.Unsafe.park(Native Method)
at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328)
at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:277)
at org.apache.tomcat.util.net.NioEndpoint$KeyAttachment.awaitLatch(NioEndpoint.java:1386)
at org.apache.tomcat.util.net.NioEndpoint$KeyAttachment.awaitReadLatch(NioEndpoint.java:1388)
at org.apache.tomcat.util.net.NioBlockingSelector.read(NioBlockingSelector.java:185)
at org.apache.tomcat.util.net.NioSelectorPool.read(NioSelectorPool.java:251)
at org.apache.tomcat.util.net.NioSelectorPool.read(NioSelectorPool.java:232)
at org.apache.coyote.http11.InternalNioInputBuffer.fill(InternalNioInputBuffer.java:133)
at org.apache.coyote.http11.InternalNioInputBuffer$SocketInputBuffer.doRead(InternalNioInputBuffer.java:177)
at org.apache.coyote.http11.filters.IdentityInputFilter.doRead(IdentityInputFilter.java:110)
at org.apache.coyote.http11.AbstractInputBuffer.doRead(AbstractInputBuffer.java:416)
at org.apache.coyote.Request.doRead(Request.java:469)
at org.apache.catalina.connector.InputBuffer.realReadBytes(InputBuffer.java:342)
at org.apache.tomcat.util.buf.ByteChunk.substract(ByteChunk.java:395)
at org.apache.catalina.connector.InputBuffer.read(InputBuffer.java:367)
at org.apache.catalina.connector.CoyoteInputStream.read(CoyoteInputStream.java:190)
...

NServiceBus Pub/Sub Distributor/Worker Scenario Too Slow

I am working on a proof of concept implementation of NServiceBus v4.x for work.
Right now I have two subscribers and a single publisher.
The publisher can publish over 500 message per second. It runs great.
Subscriber A runs without distributors/workers. It is a single process.
Subscriber B runs with a single distributor powering N number of workers.
In my test I hit an endpoint that creates and publishes 100,000 messages. I do this publish with the subscribers off line.
Subscriber A processes a steady 100 messages per second.
Subscriber B with 2+ workers (same result with 2, 3, or 4) struggles to top 50 messages per second gross across all workers.
It seems in my scenario that the workers (which I ramped up to 40 threads per worker) are waiting around for the distributor to give them work.
Am I missing something possibly that is causing the distributor to be throttled? All Buses are running an unlimited Dev license.
System Information:
Intel Core i5 M520 # 2.40 GHz
8 GBs of RAM
SSD Hard Drive
UPDATE 08/06/2013: I finished deploying the system to a set of servers. I am experiencing the same results. Every server with a worker that I add decreases the performance of the subscriber.
Subscriber B has a distributor on one server and two additional servers for workers. With Subscriber B and one server with an active worker I am experiencing ~80 messages/events per second. Adding in another worker on an additional physical machine decreases that to ~50 messages per second. Also, these are "dummy messages". No logic actually happens in the handlers other than a log of the message through log4net. Turning off the logging doesn't increase performance.
Suggestions?
If you're scaling out with NServiceBus master/worker nodes on one server, then trying to measure performance is meaningless. One process with multiple threads will always do better than a distributor and multiple worker nodes on the same machine because the distributor will become a bottleneck while everything is competing for the same compute resources.
If the workers are moved to separate servers, it becomes a completely different story. The distributor is very efficient at doling out messages if that's the only thing happening on the server.
Give it a try with multiple servers and see what happens.
Rather than have a dummy handler that does nothing, can you simulate actual processing by adding in some sleep time, say 5 seconds. And then compare the results of having a subscriber and through the distributor?
Scaling out (with or without a distributor) is only useful for where the work being done by a single machine takes time and therefore more computing resources helps.
To help with this, monitor the CriticalTime performance counter on the endpoint and when you have the need, add in the distributor.
Scaling out using the distributor when needed is made easy by not having to change code, just starting the same endpoint in distributor and worker profiles.
The whole chain is transactional. You are paying heavy for this. Increasing the workload across machines will really not increase performance when you do not have very fast disk storage with write through caching to speed up transactional writes.
When you have your poc scaled out to several servers just try to mark a messages as 'Express' which does not do transactional writes in the queue and disable MSDTC on the bus instance to see what kind of performance is possible without transactions. This is not really usable for production unless you know where this is not mandatory or what is capable when you have a architecture which does not require DTC.