We have an infinispan cluster of two. Some days ago one of the hosts crashed, taking down one of the host-controllers.
Somehow as a result of this, we say an unacceptable increase in infinispan response time. It went from a couple of miliseconds to 15-20 seconds.
I am trying to investigate this, and cannot find anything. There are timeouts and exception in the log files, but nothing that is specific to this period.
My client timeouts are 10 second, but the spike is almost double that time.
Related
We have a gap in our apache logs for approximately 45 minutes, after which follows an unusually high burst of log activity.
Normally, we get a few hundred requests per hour in this time, early in the morning. But our traffic was normal, then the logs went quiet for 45 minutes (wherein people reported an inability to log in). After that, 4000 requests were written to the logs within a few minutes.
Is this consistent with an assumption that one or more runaway processes blocked the execution of other processes? Because apache logs after the process is completed, nothing got logged until the logjam was broken.
Is that a fair conclusion?
Yes, your assumption is indeed reasonable - we had a situation like this a couple years back at the place where I work. If all apache threads are blocked by long-running operations, this kind of behaviour may appear, until at least one thread is freed up again. It doesn't need to be a "runaway" process per se, either - a heavy load, like one incurred by a DOS attack (or maybe just intensive site traffic) may also produce this "picture".
I admit to only having very basic familiarity with Apache administration, but have you checked whether your setup is has sufficient resources to handle the >usual< traffic on the affected site?
Sessions on my ColdFusion server appear to be timing out every 20 minutes for one of my apps, even though I have high (on the order of many hours) timeouts set for both idletimeout and this.SessionTimeout in the CFC.
These timeouts occur regardless of whether I visit the pages during that 20 minute period — in other words, the sessions are not even idle for 20 minutes, it's just that 20 minutes after login, the user becomes unauthenticated again — the value of #IsUserLoggedIn()# becomes NO and the value for #GetAuthUser()# becomes blank.
I'm wondering if anyone has run into this before and if there are any fixes.
Also, it's not clear in the documentation how ColdFusion determines that the user and login session are idle. It would be great to know where this session data is stored and, ideally, to peek at it and see what might be causing this strange behavior.
Do other applications on the same server have longer timeouts that are working?
If you do not, then it probably that you can set a maximum sessionTimeout in Cold Fusion Administrator. This is likely the cause.
Configuring and using session variables (CF9)
Specify a maximum session time-out. Application code cannot set a time-out greater than this value. The default value for this time-out is two days.
Also, can you edit your question to provide some code? Show us your your application configuration.
Also, is there a chance you have an application with the same name and different timeout configuration that is causing a conflict. Honestly this is just a ballpark guess because I'm very careful with application names.
I have a dedicated server that's been running for years, with no recent code or configuration changes, but suddenly about a week ago, the MS SQL Server DB has started becoming unresponsive, and shortly thereafter, the entire site goes down due to memory issues on the server. It is sporadic, which leads me to believe it could be a malicious DDOS-like attack, but I am not sure how to confirm what's going on.
After a reboot, it can stay up for a few days, or only a few hours before I start seeing rampant occurrances of these Info messages in the Windows logs, shortly before it seizing up and failing. Research has not yielded any actionable info as of yet, please help, and thank you.
Process 52:0:2 (0xaa0) Worker 0x07E340E8 appears to be non-yielding on Scheduler 0. Thread creation time: 13053491255443. Approx Thread CPU Used: kernel 280 ms, user 35895 ms. Process Utilization 0%%. System Idle 93%%. Interval: 6505497 ms.
New queries assigned to process on Node 0 have not been picked up by a worker thread in the last 2940 seconds. Blocking or long-running queries can contribute to this condition, and may degrade client response time. Use the "max worker threads" configuration option to increase number of allowable threads, or optimize current running queries. SQL Process Utilization: 0%%. System Idle: 91%%.
Here's a blog about the issue: danieladeniji.wordpress that should help you get started.
Seems unlikely that it would be a DDOS.
I've recently been reading up on messaging systems and have specifically looked at both RabbitMQ and NServiceBus. As I have understood it, if a message fails for some reason it is tried again immidiately a number of times. Both systems then offers the possibility to try again later, for example in 5 seconds. When the five seconds have passed the message is sent again a number of times.
I quote Vaughn Vernon in Implementing Domain-Driven Design (p.502):
The other way to handle this is to simply retry the send until it succeeds, perhaps using a Capped Exponential Back-off. In the case of RabbitMQ, retries could fail for quite a while. Thus, using a combination of message NAKs and retries could be the best approach. Still, if our process retries three times every five minutes, it could be all we need.
For NServiceBus, this is called second level retries, and when the retry happens, it happens multiple times.
Why does it need to happen multiple times? Why does it not retry once every five minutes? What is the chance that the first retry after five minutes fails and the second retry, probably just milliseconds later, should succeed?
And in case it does not need to due to some configuration (does it?), why do all the examples I have found have multiple retries?
My background is NServiceBus so my answer may be couched in those terms.
First level retries are great for very transient errors. Deadlocks are a perfect example of this. You try to change the database, and your transaction is chosen as the deadlock victim. In these cases, a first level retry is perfect. Most of the time, one first level retry is all you need. If there is a lot of contention in the database, maybe 2 or 3 retries will be good enough.
Second level retries are for your less transient errors. Think about things like a web service being down for 10 seconds, or a SQL Server database in a failover cluster switching over, which can take 30-60 seconds. If you retry a few milliseconds later, it's not going to do you any good, but 10, 20, 30 seconds later you might have a good shot.
However, the crux of the question is after 5 first level retries and then a delay, why try again 5 times before an additional delay?
First, on your first second-level retry, it's still possible that you could get a deadlock or other very transient error. After all, the goal is usually not to make as slow a system as possible so it would be preferable to not have to wait an additional delay before retrying if the problem is truly transient. Of course there's no way for the infrastructure to know just how transient the problem is.
The second reason is that it's just easier to configure if they're all the same. X levels of retry and Y tries per level = X*Y total tries and only 2 numbers in the configuration file. In NServiceBus, it's these 2 values plus the back-off time span, so the config looks like this:
<SecondLevelRetriesConfigEnabled="true" TimeIncrease ="00:00:10" NumberOfRetries="3" />
<TransportConfig MaxRetries="3" />
That's fairly simple. Try 3 times. Wait 10 seconds. Try 3 times. Wait 20 seconds. Try 3 times. Wait 30 seconds. Try 3 times. Then you're done and you move on to an error queue.
Configuring different values for each level would require a much more complex config story.
First Level Retries exist to compensate for quick issues like networking and database locks. This is configurable in NSB, so if you don't want them, you can turn them off. Second Level Retries are to compensate for longer outages. For example we use SLRs to compensate for a database that recycles every night at the same time.
The OOTB functionality increases the duration between SLRs because it assumes that if it didn't work the previous time, you will need more time to fix it. There exists a Retry Policy that is overridable, so you can change how the SLRs work.
In NSB, the FLRs always come first and SLRs don't come into play unless the transaction is still failing after FLRs. In addition, you can disable SLRs altogether and build your own custom Fault Manager which have additionally functionality. We have a process where we have a Fault Manager that sends issues to a staffed help desk, as that is the only way to solve a particular subset of issues.
Since a few days ago, the SQL server (Microsoft SQL Server 2005) backing our site has started occasionally timeouting. It is happening at seemingly random times approximately every hour or two. It usually takes about 10 minutes during which we see hundreds of timeouted requests. Under normal circumstances, most of our queries take less than 50ms. A query which takes a significant fraction of a second is an exception.
I have effectively killed a day trying to figure out at least something without any real progress. Normally, the server load is about 10-20%, and when the timeouts happen, we don’t see any increased CPU load. Also, there is nothing special happening during the timeouts; no overzealous web crawler, no heavy background tasks, no increased network traffic, no increased number of connections etc. Simply, everything looks as usual.
Not making any progress, we decided to restart it (and install the latest SP since we were in it) which seems to have fixed the problem. It has been already over six hours without any incident. Also, the CPU load has gone down under 10%.
It almost seems like if the SQL server "deteriorated" overtime. Perhaps, some internal structure (some cache or statistic) got out shape and caused the occasional problems. I don’t have any other explanation.
The only thing I noticed when I was monitoring the server (and got lucky once to be present when the timeouts were happening), I saw several long running queries waiting on CXPACKET. But I learned that this is most likely just a consequence of some other problem. I wrote a script monitoring SQL requests, and so hopefully, next time it happens, I will have more information.
Has anybody had similar experience? I’m not an SQL Server guru. Any suggestions are welcome.
since everything looked normal: CPU, nothing special happening, no overzealous web crawler, no heavy background tasks, no increased network traffic, no increased number of connections etc. I'd look into locking\blocking\race condition. Use this to see what (if anything) is locking when the time-out are happening:
How to find out what SQL queries are being blocked and what's blocking them?