I have a dedicated server that's been running for years, with no recent code or configuration changes, but suddenly about a week ago, the MS SQL Server DB has started becoming unresponsive, and shortly thereafter, the entire site goes down due to memory issues on the server. It is sporadic, which leads me to believe it could be a malicious DDOS-like attack, but I am not sure how to confirm what's going on.
After a reboot, it can stay up for a few days, or only a few hours before I start seeing rampant occurrances of these Info messages in the Windows logs, shortly before it seizing up and failing. Research has not yielded any actionable info as of yet, please help, and thank you.
Process 52:0:2 (0xaa0) Worker 0x07E340E8 appears to be non-yielding on Scheduler 0. Thread creation time: 13053491255443. Approx Thread CPU Used: kernel 280 ms, user 35895 ms. Process Utilization 0%%. System Idle 93%%. Interval: 6505497 ms.
New queries assigned to process on Node 0 have not been picked up by a worker thread in the last 2940 seconds. Blocking or long-running queries can contribute to this condition, and may degrade client response time. Use the "max worker threads" configuration option to increase number of allowable threads, or optimize current running queries. SQL Process Utilization: 0%%. System Idle: 91%%.
Here's a blog about the issue: danieladeniji.wordpress that should help you get started.
Seems unlikely that it would be a DDOS.
Related
I am trying to figure out what are the main reasons for stuck thread . Now in WebLogic Server diagnoses a thread as stuck if it is continually working (not idle) for a set period of time. And if a user wants he/she can tune a server's thread detection behavior by changing the length of time before a thread is diagnosed as stuck (Stuck Thread Max Time), and by changing the frequency with which the server checks for stuck threads. My analysis is it is either cause by contention or different reasons like slow IO , slow backends (DB queries, web services, rmi calls) … rarely it is caused by bad coding or huge data (infinite lops) .
Other than above reasons are there more reasons for a thread to stuck ?
not sure what your question is here, here's my 2 cents
Bad Coding can lead to stuck threads
say a developer using a singleton map or hash etc that all servlets need to access.. when you have high load it can lead to contention for that resource and lead to stuck threads easily.
Stuck threads can be caused by slow running server (high cpu)
Sometimes bugs in WLS can cause it to be busy with internal processes resulting in stuck threads.. like WLS stuck in cluster communication.
You can even have stuck thread when Admin server is waiting to hear from a managed server that failed..
The list can go on and on. Only by taking 3-4 thread dumps in a short span of time can one confirm the cause.
For this test, I have a simple Java servlet that reads data in and calculates the CRC32 for it. When making serial requests of 512MB each, I get about 600MB/sec. That makes sense since I can't use all 24 cores available to me to calculate a CRC. The program driving this I/O is sitting on the local box to eliminate the possibility of networking issues. I am running Tomcat 8.0.24.0 on FreeBSD using OpenJDK 64-Bit Server VM (build 25.45-b02, mixed mode).
Next, I attempt the same test with 6 concurrent requests, expecting that the performance per request might be lower than 600MB/sec, but that the aggregate performance across all 6 requests would be significantly higher.
What I see is the CPU has some idle time at ALL times (so it doesn't appear that I'm CPU-bound). I also see that all processing threads in Tomcat are running concurrently as anticipated. However, it looks like I'm only getting around 800MB/sec in aggregate. The threads in Tomcat spend most of their time waiting to read from the socket, as shown below.
I would appreciate any thoughts on how to improve Tomcat throughput / why so much time is spent waiting for more data (which I assume is what's going on below).
at sun.misc.Unsafe.park(Native Method)
at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328)
at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:277)
at org.apache.tomcat.util.net.NioEndpoint$KeyAttachment.awaitLatch(NioEndpoint.java:1386)
at org.apache.tomcat.util.net.NioEndpoint$KeyAttachment.awaitReadLatch(NioEndpoint.java:1388)
at org.apache.tomcat.util.net.NioBlockingSelector.read(NioBlockingSelector.java:185)
at org.apache.tomcat.util.net.NioSelectorPool.read(NioSelectorPool.java:251)
at org.apache.tomcat.util.net.NioSelectorPool.read(NioSelectorPool.java:232)
at org.apache.coyote.http11.InternalNioInputBuffer.fill(InternalNioInputBuffer.java:133)
at org.apache.coyote.http11.InternalNioInputBuffer$SocketInputBuffer.doRead(InternalNioInputBuffer.java:177)
at org.apache.coyote.http11.filters.IdentityInputFilter.doRead(IdentityInputFilter.java:110)
at org.apache.coyote.http11.AbstractInputBuffer.doRead(AbstractInputBuffer.java:416)
at org.apache.coyote.Request.doRead(Request.java:469)
at org.apache.catalina.connector.InputBuffer.realReadBytes(InputBuffer.java:342)
at org.apache.tomcat.util.buf.ByteChunk.substract(ByteChunk.java:395)
at org.apache.catalina.connector.InputBuffer.read(InputBuffer.java:367)
at org.apache.catalina.connector.CoyoteInputStream.read(CoyoteInputStream.java:190)
...
We have a gap in our apache logs for approximately 45 minutes, after which follows an unusually high burst of log activity.
Normally, we get a few hundred requests per hour in this time, early in the morning. But our traffic was normal, then the logs went quiet for 45 minutes (wherein people reported an inability to log in). After that, 4000 requests were written to the logs within a few minutes.
Is this consistent with an assumption that one or more runaway processes blocked the execution of other processes? Because apache logs after the process is completed, nothing got logged until the logjam was broken.
Is that a fair conclusion?
Yes, your assumption is indeed reasonable - we had a situation like this a couple years back at the place where I work. If all apache threads are blocked by long-running operations, this kind of behaviour may appear, until at least one thread is freed up again. It doesn't need to be a "runaway" process per se, either - a heavy load, like one incurred by a DOS attack (or maybe just intensive site traffic) may also produce this "picture".
I admit to only having very basic familiarity with Apache administration, but have you checked whether your setup is has sufficient resources to handle the >usual< traffic on the affected site?
On our production server, for some reason at specific amount of time, thread count goes over and over to the certain point that though CPU Utilization is normal(30-50%), but the query starting to run slow we so lot more blocking statements.
I am not sure where to look at it, basically when our site runs normally, the thread count is around 150 threads, but during the specific time in a day(during 1:30 to 2:30) it come up to 270 threads.there are no extra sql transaction goes on, everything as normal as it was before but thread count grows and sql start behaving very very slow.
After restarting the SQL service immediately thread count comes to normal, and our site function fine for another 24 hours.
We are using SQL Server 2005, it is 24 core machine.
any idea?
Blocking statements steal workers (sys.dm_os_workers) so the server will spawn more workers to handle the incoming tasks. At 24 cores you'll have some 700 max worker threads out-of-the-box by default. So seeing 270 'threads' is not an issue, is well within the normal functioning parameters. You real problem must be the blocking, and you have to investigate it accordingly: who is blocking who and why. My bet is that you have a job running between 1:30 and 2:30 that is locking large portions of the database (a delete job perhaps?) and your queries block on locked rows. You'll have to investigate, find the root cause, and act accordingly. Reboot is not a solution, nor is blaming unrelated components (thread count). Use Activity Monitor, use Who Is Active, follow the methodical approach of Waits and Queues methodology. There are plenty of ways to identify the real problem. SQL Server will never appear slow due to thread count. It simply doesn't work like that.
You can control the degree of parallalism using the MAXDOP query hint. For more details please check this article:
http://blog.sqlauthority.com/2010/03/15/sql-server-maxdop-settings-to-limit-query-to-run-on-specific-cpu/
Thanks for your valuable feedback, yes it is true it is nothing which sql is behaving weiredly, it is something our Site which is based on Ektron CMS is responsible for, one of the Functionality of Ektron CMS (which is PageBuilder), while doing operation on this piece of Content was holding of the table badly, we have around 10 million users to our site, and probably since this was blocking the tables SQL Server goes nuts and does not respond very well.
We have finally eliminated the issue.
Since a few days ago, the SQL server (Microsoft SQL Server 2005) backing our site has started occasionally timeouting. It is happening at seemingly random times approximately every hour or two. It usually takes about 10 minutes during which we see hundreds of timeouted requests. Under normal circumstances, most of our queries take less than 50ms. A query which takes a significant fraction of a second is an exception.
I have effectively killed a day trying to figure out at least something without any real progress. Normally, the server load is about 10-20%, and when the timeouts happen, we don’t see any increased CPU load. Also, there is nothing special happening during the timeouts; no overzealous web crawler, no heavy background tasks, no increased network traffic, no increased number of connections etc. Simply, everything looks as usual.
Not making any progress, we decided to restart it (and install the latest SP since we were in it) which seems to have fixed the problem. It has been already over six hours without any incident. Also, the CPU load has gone down under 10%.
It almost seems like if the SQL server "deteriorated" overtime. Perhaps, some internal structure (some cache or statistic) got out shape and caused the occasional problems. I don’t have any other explanation.
The only thing I noticed when I was monitoring the server (and got lucky once to be present when the timeouts were happening), I saw several long running queries waiting on CXPACKET. But I learned that this is most likely just a consequence of some other problem. I wrote a script monitoring SQL requests, and so hopefully, next time it happens, I will have more information.
Has anybody had similar experience? I’m not an SQL Server guru. Any suggestions are welcome.
since everything looked normal: CPU, nothing special happening, no overzealous web crawler, no heavy background tasks, no increased network traffic, no increased number of connections etc. I'd look into locking\blocking\race condition. Use this to see what (if anything) is locking when the time-out are happening:
How to find out what SQL queries are being blocked and what's blocking them?