Recently, our production weblogic is taking too much time to process queues. Besides investigating into queues, db queries and other stuff I thought to look into any known memory and concurrency issues in weblogic.
Does anyone know ?
Summary about the problem:
we had like 2 queues and like 8-9 clusters. one of the queues was down for some reason and the other queue started to pile up and weblogic took forever to process it. the db io increased and cpu consumptions as well.
We had a similar production issue recently.
Check if Flow Control is set at the connection factory level. Using this setting weblogic can throttle message production when it sees that the queue is being overloaded.
Weblogic's checklist of things to do when you have a large message backlog is useful for you to compare to your own scenarios
Related
I am trying to figure out what are the main reasons for stuck thread . Now in WebLogic Server diagnoses a thread as stuck if it is continually working (not idle) for a set period of time. And if a user wants he/she can tune a server's thread detection behavior by changing the length of time before a thread is diagnosed as stuck (Stuck Thread Max Time), and by changing the frequency with which the server checks for stuck threads. My analysis is it is either cause by contention or different reasons like slow IO , slow backends (DB queries, web services, rmi calls) … rarely it is caused by bad coding or huge data (infinite lops) .
Other than above reasons are there more reasons for a thread to stuck ?
not sure what your question is here, here's my 2 cents
Bad Coding can lead to stuck threads
say a developer using a singleton map or hash etc that all servlets need to access.. when you have high load it can lead to contention for that resource and lead to stuck threads easily.
Stuck threads can be caused by slow running server (high cpu)
Sometimes bugs in WLS can cause it to be busy with internal processes resulting in stuck threads.. like WLS stuck in cluster communication.
You can even have stuck thread when Admin server is waiting to hear from a managed server that failed..
The list can go on and on. Only by taking 3-4 thread dumps in a short span of time can one confirm the cause.
I am using Weblogic 10.3.6 with 8 managed servers configured with session timeout as 600 seconds. I have an issue with my application that when a session gets timed out in 600 seconds(I am receiving as STUCK alerts which is also configured) I am facing slowness in my application. My question is,
Will all threads be impacted because of one STUCK thread(STUCK thread
was due to DB transaction timeout)
I assume it will not be, but wanted to confirm.
Depends on your application. In general no, but if for example the stuck thread is holding a lock on an object (database, file, etc.) called by other requests, these may be affected too. Also, depending on what the stuck thread is doing, it may use excessive resources (cpu, memory, disk, etc.). I suggest to investigate why the thread is taking so long and if it's possible to
We are using rabbitmq (3.6.6) to send analysis (millions) to different analyzers. These are very quick and we were planning on use the rabbit-message-plugin to schedule monitorizations over the analyzed elements.
We were thinking about rabbitmq-delayed-exchange-plugin, already made some tests and we need some clarification.
Currently:
We are scheduling millions of messages
Delays range from a few minutes to 24 hours
As previously said, these are tests, so we are using a machine with one core and 4G of RAM which has also other apps running on it.
What happened with a high memory watermark set up at 2.0G:
RabbitMQ eventually (a day or so) starts consuming 100% (only one core) and does not respond to the management interface nor rabbitmqctl. This goes on for at least 18 hours (always end up killing, deleting mnesia delayed file on disk - about 100 / 200 MB - and restarting).
What happened with a high memory watermark set up at 3.6G:
RabbitMQ was killed by kernel, because of high memory usage (4 GB hardware) about a week after working like this.
Mnesia file for delayed exchange is about 1.5G
RabbitMQ cannot start anymore giving to the below trace (we are assuming that because of being terminated by a KILL messages in the delay somehow ended up corrupted or something
{could_not_start,rabbit,
rabbitmq-server[12889]: {{case_clause,{timeout,['rabbit_delayed_messagerabbit#rabbitNode']}},
rabbitmq-server[12889]: [{rabbit_boot_steps,'-run_step/2-lc$^1/1-1-',1,
And right now we are asking ourselves: Are we a little over our heads using rabbit delayed exchange plugin for this volumes of information? If we are, then end of the problem, rethink and restart, but if not, what could be an appropiate hardware and/or configuration setup?
RabbitMQ delayed exchange plugin is not properly designed to store millions of messages.
It is also documented to the plugin page
Current design of this plugin doesn't really fit scenarios with a high
number of delayed messages (e.g. 100s of thousands or millions). See
72 for details.
Read also here: https://github.com/rabbitmq/rabbitmq-delayed-message-exchange/issues/72
This plugin is often used as if RabbitMQ was a database. It is not.
In current project we currently use 8 worker role machines side by side that actually work a little different than azure may expect it.
Short outline of the system:
each worker start up to 8 processes that actually connect to cloud queue and processes messages
each process accesses three different cloud queues for collecting messages for different purposes (delta recognition, backup, metadata)
each message leads to a WCF call to an ERP system to gather information and finally add retreived response in an ReDis cache
this approach has been chosen over many smaller machines due to costs and performance. While 24 one-core machines would perform by 400 calls/s to the ERP system, 8 four-core machines with 8 processes do over 800 calls/s.
Now to the question: when even increasing the count of machines to increase performance to 1200 calls/s, we experienced outages of Cloud Queue. In same moment of time, 80% of the machines' processes don't process messages anymore.
Here we have two problems:
Remote debugging is not possible for these processes, but it was possible to use dile to get some information out.
We use GetMessages method of Cloud Queue to get up to 4 messages from queue. Cloud Queue always answers with 0 messages. Reconnect the cloud queue does not help.
Restarting workers does help, but shortly lead to same problem.
Are we hitting the natural end of scalability of Cloud Queue and should switch to Service Bus?
Update:
I have not been able to fully understand the problem, I described it in the natual borders of Cloud Queue.
To summarize:
Count of TCP connections have been impressive. Actually too impressive (multiple hundreds)
Going back to original memory size let the system operate normally again
In my experience I have been able to get better raw performance out of Azure Cloud Queues than service bus, but Service Bus has better enterprise features (reliable, topics, etc). Azure Cloud Queue should process up to 2K/second per queue.
https://azure.microsoft.com/en-us/documentation/articles/storage-scalability-targets/
You can also try partitioning to multiple queues if there is some natural partition key.
Make sure that your process don't have some sort of thread deadlock that is the real culprit. You can test this by connecting to the queue when it appears hung and trying to pull messages from the queue. If that works it is your process, not the queue.
Also take a look at this to setup some other monitors:
https://azure.microsoft.com/en-us/documentation/articles/storage-monitor-storage-account/
It took some time to solve this issue:
First a summarization of the usage of the storage account:
We used the blob storage once a day pretty heavily.
The "normal" diagonistics that Azure provides out of the box also used the same storage account.
Some controlling processes used small tables to store and read information once an hour for ca. 20 minutes
There may be up to 800 calls/s that try to increase a number to count calls to an ERP system.
When recognizing that the storage account is put under heavy load we split it up.
Now there are three physical storage accounts heaving 2 queues.
The original one still keeps up to 800/s calls for increasing counters
Diagnositics are still on the original one
Controlling information has been also moved
The system runs now for 2 weeks, working like a charm. There are several things we learned from that:
No, the infrastructure is "not just there" and it doesn't scale endlessly.
Even if we thought we didn't use "that much" summarized we used quite heavily and uncontrolled.
There is no "best practices" anywhere in the net that tells the complete story. Esp. when start working with the storage account a guide from MS would be quite helpful
Exception handling in storage is quite bad. Even if the storage account is overused, I would expect some kind of exception and not just returning zero message without any surrounding information
Read complete story here: natural borders of cloud storage scalability
UPDATE:
The scalability has a lot of influences. You may are interested in Azure Service Bus: Massive count of listeners and senders to be aware of some more pitfalls.
I am performance testing my piece of code working on activeMQ,
I use virtual topics in there. when I send about a 1000 Concurrent requests to en-queue my messages,it takes ages to en-queue all the messages, and sometimes it just hangs in between and starts back after sometime.
I am using JDBC message store,I know some performance effect might be because of that.
Is this hit on performance mainly due to virtual topics?,because on activemq Website they Specify a very high performance of the topic(under ideal conditions ofcourse)
P.S: 1 message takes almost 13-15 milliseconds to be enqueued and dequeued, which is way too high than what performance activeMQ claims to have
http://activemq.apache.org/performance.html
The performance hit is mainly because of the JDBC message store. Virtual Topics do not differ much in performance compared to durable subscriptions.
Please use LevelDB or KahaDB if you want performance. The JDBC store is mainly there for compability with setups that already uses fail-over secured databases with backups etc and want to use them for messages as well. You won't come even close to the numbers in the performance page with plain JDBC.