I have a datapipeline component that reads SQS messages, generated at S3 upload trigger, and parses and publishes the message for a batchpipeline component.
I have recently observed that in production system, my datapipeline keeps crashing with OutOfMemory error under heavy load but it never crashes when tested locally with similar loads? The batchpipeline never seems to crash in Production ever.
How do I go about debugging it when I can't reproduce it locally?
As I have found a solution, after 2 weeks, to my problem above, I figured I'll document it for others and my future self.
I wasn't able to replicate the issue because the aws command-
aws s3 cp --recursive dir s3://input-queue/dir
somehow wasn't uploading messages fast enough that it could stress my local datapipeline. So I brought down the datapipeline and once there were 10k SQS messages in the queue, I started it and as expected, it crashed with Out Of Memory error after processing ~3000 messages. It turns out that the pipeline was able to handle continuous throughput but it broke when it started with 10k message load.
My hypothesis was that the issue is happening because Java garbage collection is unable to properly clean up objects after execution. So, I started analyzing the generated heap dump and after some days of research, I stumbled on the possible root cause for Out of Memory error. There were ~5000 instances of my MessageHandlerTask class, when ideally they should have been GC'd after being processed and not keep on piling up.
Further investigation on that line of thought led me to the root cause- it turned out that the code was using Executors.newFixedThreadPool() to create an ExecutorService for submitting tasks to. This implementation used an unbounded queue of tasks, so if too many tasks were submitted, all of them waited in the queue, taking up huge memory.
The reality was similar- messages were being polled faster than they could be processed. This caused a lot of valid MessageHandlerTask instances to be created that filled the heap memory if there was a message backlog.
The fix was to have create a ThreadPoolExecutor with an ArrayBlockingQueue of capacity 100 so that there is a cap on number of instances of MessageHandlerTask and its member variables.
Having figured out the fix, I moved on to optimize the pipeline for maximum throughput by varying the maximumPoolSize of the ThreadPoolExecutor. It turned out there were some SQS connection exceptions happening at higher thread counts. Further investigation revealed that increasing the SQS connection pool size ameliorated this issue.
I ultimately settled on a count of 40 threads for the given Xmx heap size of 1.5G and 80 SQS connection pool size so that the task threads do not run out of SQS connections while processing. This helped me achieve a throughput of 44 messages/s with just a single instance of datapipeline.
I also found out why the batchpipeline never crashed in Production, despite suffering from a similar ExecutorService implementation- turns out the datapipeline could be stressed by too many concurrent S3 uploads but the messages for batchpipeline were produced by datapipeline in a gradual fashion. Besides, the batchpipeline had a much higher throughput that I benchmarked at 347 messages/s when using 70 maximumPoolSize.
Related
On our RabbitMQ installed in production, we have a performance issue.
To explain the context, we have an initialization batch that creates around ~60k messages. For business reasons, those messages must be treated in strict order and we can't lose any. As such, we have only one queue which is durable and lazy and one consumer (SpringBoot AMQP) with a prefetch of 10. Both are on the same virtual machine.
At first, the processing is fast enough, around 5 to 10 messages per second. But it progressively slows down until it reaches a cap of fewer than 20 messages per hour. It takes approximately 1 hour to reach this point.
After some investigations, we found out that the problem comes from RabbitMQ. When we simply stop and restart it, the performance goes back to normal and then drops slowly again. Doing the same on just the consumer doesn't change anything.
I'm thinking about some resources bottleneck but I can't manage to find which one as RAM, CPU, and disk looks fine. I am not really familiar with ERL virtual machine and managing RabbitMQ itself so I may have missed something.
Does someone as an idea of the source of the problem or where I could look for more information on what is happening?
RabbitMQ characteristics :
ERL 23.3.2
RabbitMQ 3.8.14
I am using Weblogic 10.3.6 with 8 managed servers configured with session timeout as 600 seconds. I have an issue with my application that when a session gets timed out in 600 seconds(I am receiving as STUCK alerts which is also configured) I am facing slowness in my application. My question is,
Will all threads be impacted because of one STUCK thread(STUCK thread
was due to DB transaction timeout)
I assume it will not be, but wanted to confirm.
Depends on your application. In general no, but if for example the stuck thread is holding a lock on an object (database, file, etc.) called by other requests, these may be affected too. Also, depending on what the stuck thread is doing, it may use excessive resources (cpu, memory, disk, etc.). I suggest to investigate why the thread is taking so long and if it's possible to
We are using rabbitmq (3.6.6) to send analysis (millions) to different analyzers. These are very quick and we were planning on use the rabbit-message-plugin to schedule monitorizations over the analyzed elements.
We were thinking about rabbitmq-delayed-exchange-plugin, already made some tests and we need some clarification.
Currently:
We are scheduling millions of messages
Delays range from a few minutes to 24 hours
As previously said, these are tests, so we are using a machine with one core and 4G of RAM which has also other apps running on it.
What happened with a high memory watermark set up at 2.0G:
RabbitMQ eventually (a day or so) starts consuming 100% (only one core) and does not respond to the management interface nor rabbitmqctl. This goes on for at least 18 hours (always end up killing, deleting mnesia delayed file on disk - about 100 / 200 MB - and restarting).
What happened with a high memory watermark set up at 3.6G:
RabbitMQ was killed by kernel, because of high memory usage (4 GB hardware) about a week after working like this.
Mnesia file for delayed exchange is about 1.5G
RabbitMQ cannot start anymore giving to the below trace (we are assuming that because of being terminated by a KILL messages in the delay somehow ended up corrupted or something
{could_not_start,rabbit,
rabbitmq-server[12889]: {{case_clause,{timeout,['rabbit_delayed_messagerabbit#rabbitNode']}},
rabbitmq-server[12889]: [{rabbit_boot_steps,'-run_step/2-lc$^1/1-1-',1,
And right now we are asking ourselves: Are we a little over our heads using rabbit delayed exchange plugin for this volumes of information? If we are, then end of the problem, rethink and restart, but if not, what could be an appropiate hardware and/or configuration setup?
RabbitMQ delayed exchange plugin is not properly designed to store millions of messages.
It is also documented to the plugin page
Current design of this plugin doesn't really fit scenarios with a high
number of delayed messages (e.g. 100s of thousands or millions). See
72 for details.
Read also here: https://github.com/rabbitmq/rabbitmq-delayed-message-exchange/issues/72
This plugin is often used as if RabbitMQ was a database. It is not.
As advised on the webpage
activemq-performance-module-users-manual I've tried (on an Intel i7 laptop with Windows 7 OS and SSD drive) the performance of producing persistent messages on a ActiveMQ Queue :
mvn activemq-perf:producer -Dproducer.destName=queue://TEST.FOO -Dproducer.deliveryMode=persistent
against the default installation of activemq 5.12.1
The performance which I got is around 300-400 messages per second.
On the page activemq-performance I have been reading much higher numbers:
When running the server on one box and a single producer and consumer thread in separate VMs on the other box, using a single topic we got around 21-22,000 messages/second using 1-2K messages.
On the other hand, when the messages are not persistent, the performance of the producer grows to 49000 messages per second. -Dproducer.deliveryMode=nonpersistent
When the messages are sent asynchrounously.
-Dproducer.deliveryMode=persistent -Dfactory.useAsyncSend=true
I get around 23000 messages sent per second.
From what I see here stackoverflow-activemq-persistent-performance-on-different-operatiing-systems it makes a difference when running activemq on different OS.
Can somebody give me some tips for having a better performance for writing persistent activemq messages?
Performance of sending persistent messages is all about disk based IO as the message must be written to the disk prior to the broker signalling the client that the message send completed. The faster the disk the better your throughput will be, all else being equal.
To work around some of this you can send persistent messages in transactional batches so that the send itself is complete and the synchronization point is reduced to the transaction boundary.
Depending on the size of the text messages you can also gain some performance by using compression, this can be turned on via a option in the ActiveMQConnectionFactory.
Recently, our production weblogic is taking too much time to process queues. Besides investigating into queues, db queries and other stuff I thought to look into any known memory and concurrency issues in weblogic.
Does anyone know ?
Summary about the problem:
we had like 2 queues and like 8-9 clusters. one of the queues was down for some reason and the other queue started to pile up and weblogic took forever to process it. the db io increased and cpu consumptions as well.
We had a similar production issue recently.
Check if Flow Control is set at the connection factory level. Using this setting weblogic can throttle message production when it sees that the queue is being overloaded.
Weblogic's checklist of things to do when you have a large message backlog is useful for you to compare to your own scenarios