I have simple RabbitMQ cluster with 2 physical identical linux nodes: (CentOS, RabbitMQ 3.1.5, Erlang R15B, 2GB Ram, CPU 1xCore). Mirroring and synchronization of nodes is turned on.
I have two problems which bothers me:
In a normal situation everything is fine, but after restarting one of the nodes(by stop_app and start_app in the commandline) the whole cluster becomes unavaible to producers and consumers - I can't produce or receive messages from a queue during synchronization. Is this situation normal?
During synchronization I observed very high CPU load (almost 100%) on the slave node(that which was restarted). I measured the speed of synchronization - it's dramatic low (synchronization of 2 millions of messages takes above 3 hours). It's strange because producing of such amount takes much less. Is this situation normal too?
I've recently been tasked with looking into RabbitMQ at work and so have been deep in the documentation.
When synchronising this is the case. This is an extract from the RabbitMQ HA documentation here.
If a queue is set to automatically synchronise it will synchronise
whenever a new slave joins - becoming unresponsive until it has done
so.
If the messages are being read-from and persisted-to disk (either through choice or through memory limitations) there may be overhead there. You can see a chart on this blog entry (it's the last chart before the comments) which indicates that there are performance changes when reading from and writing to queues of that many messages. These charts are for older versions of RabbitMQ but I've not seen anything more recent.
Hope this helps!
Related
I'm using Rebus to communicate with rabbitmq and I've configured to use 4 workers with 4 max parallelism. Also I notice that the prefetch count is 1 (probably to have an even balancing between consumers).
My issue is happening when I have spike with let's say 1000 messages for instance, i notice high CPU usage on a kubernetes environment with 2 containers where the CPU is limited to let's say 0.5 CPU.
I've profiled the application with dotTrace and it's showing that the cause is at cancenationToken.WaitHandle.wait method inside the DefaultBackOffStrategy class.
My question is if the initial setup of workers/parallelism is making this happen or i need to tweek something in Rebus. I've also tried to change the prefetch count for each consumer but on the rabbitmq management UI this doesn't change the default value (which is one).
Also with the CPU profiler from visual studio and looking at the .net counters I notice some lock contention counters (can this be related)
Why is the CPU usage so high is my question at this point and a way to solve this properly.
Thanks for any help given.
Need some help in diagnosing and tuning the performance of my Redis set up (2 redis-server instances on an Ubuntu 14.04 machine). Note that a write-heavy Django web application shares the VM with Redis. The machine has 8 cores and 25GB RAM.
I recently discovered that background saving was intermittently failing (with a fork() error) even when RAM wasn't exhausted. To remedy this, I applied the setting vm.overcommit_memory=1 (was previously default).
Moreover vm.swappiness=2, vm.overcommit_ratio=50. I have disabled transparent huge pages in my set up as well via echo never > /sys/kernel/mm/transparent_hugepage/enabled (although haven't done echo never > /sys/kernel/mm/transparent_hugepage/defrag).
Right after changing the overcommit_memory setting, I noticed that I/O utilization went from 13% to 36% (on average). I/O operations per second doubled, the redis-server CPU consumption has more than doubled, and the memory it's consuming has gone up 66%. Consequently, the server response time has substantially gone up . This is how abruptly things escalated after applying vm.overcommit_memory=1:
Note that redis-server is the only ingredient showing escalation - gunicorn, nginx ,celery etc. are performing like before. Moreover, redis has become very spikey.
Lastly, New Relic has started showing me 3 redis instances instead of 2 (bottom most graph). I think the forked child is counted as the 3rd:
My question is: how can I diagnose and salvage performance here? Being new to server administration, I'm unsure how to proceed. Help me find out what's going on here and how I can fix it.
free -m has the following output (in case needed):
total used free shared buffers cached
Mem: 28136 27912 224 576 68 6778
-/+ buffers/cache: 21064 7071
Swap: 0 0 0
As you don't have swap enabled in your system ( which might be worth reconsidering if you have SSDs), ( and your swappiness was set to a low value), you can't blame it on increased swapping due to memory contention.
Your caching about 6GB of data inside the VFS cache. In case of contention this cache would have depleted in favor of process working memory, so I believe it's safe to say memory is not an issue all together.
It's a shot in the dark, but my guess is that your redis-server is configured to "sync"/"save" too often ( search for in the redis config file "appendfsync"), and that by removing the memory allocation limitation, it now actually does it's job :)
If the data is not super crucial, set appendfsync to never and perhaps tweek the save settings to cause less frequent saving.
BTW, regarding the redis & forked child, I believe you are correct.
When running in a cluster, if something wrong happens, a worker generally dies (JVM shutdown). It can be caused by many factors, most of the time it is a challenge (the biggest difficulty with storm?) to find out what causes the crash.
Of course, storm-supervisor restarts dead workers and liveness is quite good within a storm cluster, still a worker crash is a mess that we should avoid as it adds overhead, latency (can be very long until a worker is found dead and respawned) and data loss if you didn't design your topology to prevent that.
Is there an easy way / tool / methodology to check when and possibly why a storm worker crashes? They are not shown in storm-ui (whereas supervisors are shown), and everything needs manual monitoring (with jstack + JVM opts for instance) with a lot of care.
Here are some cases that can happen:
timeouts and many possible reasons: slow java garbage collection, bad network, bad sizing in timeout configuration. The only output we get natively from supervisor logs is "state: timeout" or "state: disallowed" which is poor. Also when a worker dies the statistics on storm-ui are rebooted. As you get scared of timeouts you end up using long ones which does not seem to be a good solution for real-time processing.
high back pressure with unexpected behaviour, starving worker heartbeats and inducing a timeout for instance. Acking seems to be the only way to deal with back pressure and needs good crafting of bolts according to your load. Not acking seems to be a no-go as it would indeed crash workers and get bad results in the end (even less data processed than an acking topology under pressure?).
code runtime exceptions, sometimes not shown in storm-ui that need manual checking of application logs (the easiest case).
memory leaks that can be found out with JVM dumps.
The storm supervisor logs restart by timeout.
you can monitor the supervisor log, also you can monitor your bolt's execute(tuple) method's performance.
As for memory leak, since storm supervisor does kill -9 the worker, the heap dump is likely to be corrupted, so i would use tools that monitor your heap dynamically or killing the supervisor to produce heap dumps via jmap. Also, try monitoring the gc logs.
I still recommend increasing the default timeouts.
If I were to design a huge distributed system whose throughput should scale linearly with the number of subscribers and number of channels in the system, which would be better ?
1) Redis Cluster (only for Redis 3.0 alpha, if its in cluster mode, you can publish in one node and subscribe in another completely different node, and the messages will propagate and reach you). The complexity of Publish is O(N+M), where N is the number of subscribed clients and M is the number of subscribed patterns in the system, but how does it scale when in a Redis Cluster ? I accept educated guesses on this.
2) ZeroMQ since 3.x, it does server-side filtering, so it also has some time complexity there, but I have not seen anything about it in the documentation. If I wanted to scale it, I could just have lots of servers publishing to whatever channels, and each subscriber would connect to all the servers, and subscribe for the desired channel. That seems nice.
So which of those is better for horizontal scaling of a huge publisher system ? What are other solutions I should look into ? Remember, I want to minimize latency and throughput, but being able to scale horizontally.
You want to minimize latency, I guess. The number of channels is irrelevant. The key factors are the number of publishers and number of subscribers, message size, number of messages per second per publisher, number of messages received by each subscriber, roughly. ZeroMQ can do several million small messages per second from one node to another; your bottleneck will be the network long before it's the software. Most high-volume pubsub architectures therefore use something like PGM multicast, which ZeroMQ supports.
In Redis, like in ZeroMQ, the bottleneck will be the network. Redis can reach millions of messages per second, at least as much if not more than ZeroMQ.
You should be aware that the current implementation of Redis Cluster distributes PUBLISH messages across all cluster nodes using the inter-node bus. This approach assumes that PUBLISH is extremely cheap on Redis (as explained in this issue on Github).
However, there is a small overhead involved which is inter-node communication. As you scale up this overhead will be more significant. There is another Redis Cluster implementation I'm aware of - please note it's a commercial one - in which channels or patterns are distributed across cluster nodes in a similar fashion to the way Redis keys are distributed. At least according to the vendor, this should save the overhead of inter-node communication and increase performance, but I have not benchmarked it myself.
The blocking VM performance is better overall, as there is no time lost in
synchronization, spawning of threads, and resuming blocked
clients waiting for values. So if you are willing to accept an higher
latency from time to time, blocking VM can be a good pick. Especially
if swapping happens rarely and most of your often accessed data
happens to fit in your memory.
This is default mode of Redis (and the only mode going forward I believe now VM is deprecated in 2.6), leaving the OS to handle paging (if/when required). I am correct in my understanding that it will take some time to get "hot" when booted/started. When working on a 1gb RAM node with a 16gb dataset, does Redis attempt to load it all into virtual memory at boot and thus 90%+ is immediately paged out, and only after some good amount of usages does the above statement hold true?
Redis VM was already deprecated in Redis 2.4, and has been removed in Redis 2.6. It is a dead end: don't use it.
I think you are confusing the blocking VM with OS paging. They are two different things.
OS paging is the default mode of Redis when Redis VM is not configured at all (whatever the blocking mode). The OS will swap Redis memory if it does not fit in physical memory. The event loop can be frozen at any time. When it happens, performance is abysmal because none of the Redis internal data structures is designed for this (no locality, no paging system).
Redis VM can be configured in non blocking mode (using I/O threads). When I/Os are done, the event loop is not blocked, and Redis is still responsive. However, when too many I/Os pile up, the I/O threads will be completely busy, and you end up with a responsive Redis, but unable to process any queries requiring I/Os.
Redis VM can also be configured in blocking mode. In this mode all I/Os are synchronously performed in the main event loop thread. So the event loop is frozen in case of I/O (for instance in case of a key miss). All clients are impacted. However, general performance (CPU consumption and latency) is better than with the non blocking mode because some threading scheduling/synchronization is saved.
In practice, the difference between OS paging and the Redis blocking VM is the granularity level. With Redis VM, the granularity is the key. With OS paging, well it is the page (a 4 KB block which can span on several unrelated keys).
In all 3 cases, the initial load of the dump file will be extremely slow and generate a peak of random I/Os on your system. As you pointed out, most objects will be loaded and then swapped out. The warm-up time will be significant.
Except if you have extreme locality in your data, or if you do not care at all about the latencies, using 1 GB RAM for a 16 GB dataset with the Redis VM is science-fiction IMO.
There is a reason why the Redis VM was phased out. By design, it will never perform as well as a disk-based datastore (which can exploit file mapping or direct I/Os to avoid the double buffering, and use adapted data structures like B-trees).
Redis as an in-memory store is excellent. But if you need to store something which is bigger than RAM, don't use it. Other (disk-based) stores will all perform much better.