Since a few days ago, the SQL server (Microsoft SQL Server 2005) backing our site has started occasionally timeouting. It is happening at seemingly random times approximately every hour or two. It usually takes about 10 minutes during which we see hundreds of timeouted requests. Under normal circumstances, most of our queries take less than 50ms. A query which takes a significant fraction of a second is an exception.
I have effectively killed a day trying to figure out at least something without any real progress. Normally, the server load is about 10-20%, and when the timeouts happen, we don’t see any increased CPU load. Also, there is nothing special happening during the timeouts; no overzealous web crawler, no heavy background tasks, no increased network traffic, no increased number of connections etc. Simply, everything looks as usual.
Not making any progress, we decided to restart it (and install the latest SP since we were in it) which seems to have fixed the problem. It has been already over six hours without any incident. Also, the CPU load has gone down under 10%.
It almost seems like if the SQL server "deteriorated" overtime. Perhaps, some internal structure (some cache or statistic) got out shape and caused the occasional problems. I don’t have any other explanation.
The only thing I noticed when I was monitoring the server (and got lucky once to be present when the timeouts were happening), I saw several long running queries waiting on CXPACKET. But I learned that this is most likely just a consequence of some other problem. I wrote a script monitoring SQL requests, and so hopefully, next time it happens, I will have more information.
Has anybody had similar experience? I’m not an SQL Server guru. Any suggestions are welcome.
since everything looked normal: CPU, nothing special happening, no overzealous web crawler, no heavy background tasks, no increased network traffic, no increased number of connections etc. I'd look into locking\blocking\race condition. Use this to see what (if anything) is locking when the time-out are happening:
How to find out what SQL queries are being blocked and what's blocking them?
Related
I have VPS on hetzner. Server is located in Germany.
It has 256GB RAM, 6CPUs (12 threads).
I have a file which since yesterday, is requested about 30 times in one second. File has 2 Select, 2 Update, 2 Insert queries, so I assumed (not sure how this works) from this file server has about 180 requests per second. So right after this requests started, all the websites on the server just started loading poorly. I made this file run just one select query and than die. This didn't help. In WHM load is aboiut 0.02.
I've checked for error logs and there is no max_user_connection or any error there.
I have enabled slow query log and checked log file. there is nothing (I've tested it with select sleep(10) and this query was logged).
This is visit statistics, please bring your attention to may 30th:
Bandwidth stats for last 24 hours:
There are many errors like this in ssl_log (diff IPs of course):
188.121.206.150 - - [30/May/2018:19:50:03 +0200] "-" 408 - "-" "-"
I've been searching web a lot and couldn't find any solution. Could anyone at least tell what should I monitor or where. I have full access to anything there is possible inside the server. Any help is appreciated.
UPDATE 1
I have subdomain: banners.analyticson.com (access allowed for now) and there I have all the images and html5 files that are requested.
Take one image for example : https://banners.analyticson.com/img/suy8G1S6RU.jpg
It needs too much time to load. As I noticed, this sub domain has some issue.
Script, that I mentioned earlier (with 6 queries) just tries to get one of those banners to the user, so result of that script is to return one banner from banners.analyticson.com.
UPDATE 2
I've checked my script, it is fine. It takes less than 1 second to complete.
I've also checked Top command and there is a result. I'm not sure if $MEM value is fine.
You're going to have to narrow the problem down...
There are multiple potential issues.
First thing to eliminate would be the performance of your new script on a development laptop - I assume you're using PHP, so use the profiling tools to work out what is going on. If it's a database query, you'll see which one by looking at the profiler.
If your PHP script and database queries are fine, the next thing to look at: it sounds like you've hit some bottleneck resource on your infrastructure. In these cases, scripts that run fine as a single request start queueing for the bottleneck resource, and every new request adds to the queue until the whole server starts to crawl. This can be a bit of a puzzle - start with top and keep digging.
Next, I'd look at configuration of Apache to make sure everything is squeaky clean - Apache used to have a default to do a reverse DNS lookup for every request, which slows the server down rather impressively on production. You may also want to look at your SSL configuration - the error you report is linked to a load balancer issue.
If it's not as simple as memory, CPU etc., you're into more esoteric issues. You may need to ramp up a load testing rig so you can experiment without affecting the live site - typically, I do this on a machine as similar to live as possible, using Apache JMeter to generate load, and find the "inflection point". Typically, you see response times increase linearly with the number of concurrent requests, until you hit the bottleneck resource, at which point the response time increases rapidly. As a simple example, if you have 10 database connections available, response time should increase linearly up to 10 concurrent connections, and then become much larger from 11 up.
Knowing where the inflection point is and being able to recreate it allows you to use PHP profiling tools under load. This is a lot of work.
UPDATE
You're using php-cgi; this is easily the most inefficient way of running PHP scripts. Your server is barely breaking a sweat - CPU and memory basically idle. Here's a comparison for how to run PHP; consider changing to mod_php.
I have a SQL Server database in production and it has been live for 2 months. A while ago the web application associated with it loading takes too long time. And sometimes it says timeout occurred.
Found a quick fix by running a command 'exec sp_updatestats' will fixed the problem. But I need to be run that one consistently (for every 5 minutes).
So I created a Windows service with timer and started on server. My question is what are the root causes and possible permanent solutions? Anyone?
Here is a Most expensive query from Activity Monitor
WAITFOR(RECEIVE TOP (1) message_type_name, conversation_handle, cast(message_body AS XML) as message_body from [SqlQueryNotificationService-2eea594b-f994-43be-a5ed-d9a47837a391]), TIMEOUT #p2;
To diagnose a poorly performing queries you need to:
Identify the poorly performing query, e.g. via application logging, a SQL Profiler trace filtered to show only queries with a longer duration than a certain threshold etc...
Get an execution plan for the query
At that point you can start to try to figure out what the performance issue is.
Given that exec sp_updatestats fixes your issue it could be that statistics on certain tables are out of date (this is a pretty common cause of performance issues). If thats the case then you might be able to tweak your statistics or at least rebuild only those statistics that are causing issues.
Its worth noting that updating statistics will also cause cached execution plans to become invalid, and so its plausible that your issue is unrelated to statistics - you need to collect more information about the poorly performing queries before looking at solutions.
Your most expensive query looks like its waiting for a message, i.e. its in your list of long running queries because its designed to run for a long time, not because its expensive.
Thanks for everyone i found a solution for my issue . Its quite different I've enabled sql dependency module on my sql server by setting up enable broker on , thats the one causing timeout query so by setting it to false everything is fine working now.
On a client is being raised the error "Timeout" to trigger some commands against the database.
My first test option for correction is to increase the CommandTimeout to 99999 ... but I am afraid that this treatment generates further problems.
Have experienced it ...?
I wonder if my question is relevant, and/or if there is another option more robust and elegant correction.
You are correct to assume that upping the timeout is not the correct approach. Typically, I look for log running queries that are running around the timeouts. They will typically stand out in the areas of duration and reads.
Then I'll work to reduce the query run time using this method:
https://www.simple-talk.com/sql/performance/simple-query-tuning-with-statistics-io-and-execution-plans/
If it's a report causing issues and you can't get it running faster, you may need to start thinking about setting up a reporting database.
CommandTimeout is a time, that the client is waiting for a response from server. If the query is run in the main VCL thread then the whole application is "frozen" and might be marked "not responding" by Windows. So, would you expect your users to wait at frozen app for 99999 sec?
Generally, leave the Timeout values at default and rather concentrate on tunning the queries as Sam suggests. If you happen to have long running queries (ie. some background data movement, calculations etc in Stored Procedures) set the CommandTimeout to 0 (=INFINITE) but run them in a separate thread.
I decided I wanted to try out Microsoft SQL Azure, as many people have talked very highly about it. It should be fast, flexible, cheap and many other things.
I got it up and running, migrated my data to Azure and hooked up the connectionstring. I tried to run some queries on the database, and was shocked about how slow even simple queries were. A "SELECT *" from a table with 700 rows took 7 seconds. My page also seems extremely slow, compared to when I used a localhost managent studio or a database on a shared hosting.
Now, when I setup my server, I couldn't pick a physical position. However, I live in Denmark, and I can see the server is the "South Central US". This might be the issue.
I don't use any stored procedures (so I guess no parameter sniffing).. I can also see my indexes is transfered succesfully.
Any ideas on what to do? Any performance things I am missing?
I ran into this very issue the last few days. Change your database tier from basic to standard and you will see a HUGE increase in performance. I am working on a query intensive dashboard at the moment, it took a 20 sec response time down to 2 sec response.
I've used Azure now for the last many years, and my original question is pretty much solved.
My main take-aways after dealing with Azure databases for a while:
It's extremely important that your application and database is placed in the same region. If not, then you will have a slow application. Recently I had an API and app running on two different regions - it took ~1 second for every response.. After moving it to same, it was instant
If your application has a high load, it's often a good idea to upgrade. This happens earlier than you might expect
Pick the nearest region - it really matters
On our production server, for some reason at specific amount of time, thread count goes over and over to the certain point that though CPU Utilization is normal(30-50%), but the query starting to run slow we so lot more blocking statements.
I am not sure where to look at it, basically when our site runs normally, the thread count is around 150 threads, but during the specific time in a day(during 1:30 to 2:30) it come up to 270 threads.there are no extra sql transaction goes on, everything as normal as it was before but thread count grows and sql start behaving very very slow.
After restarting the SQL service immediately thread count comes to normal, and our site function fine for another 24 hours.
We are using SQL Server 2005, it is 24 core machine.
any idea?
Blocking statements steal workers (sys.dm_os_workers) so the server will spawn more workers to handle the incoming tasks. At 24 cores you'll have some 700 max worker threads out-of-the-box by default. So seeing 270 'threads' is not an issue, is well within the normal functioning parameters. You real problem must be the blocking, and you have to investigate it accordingly: who is blocking who and why. My bet is that you have a job running between 1:30 and 2:30 that is locking large portions of the database (a delete job perhaps?) and your queries block on locked rows. You'll have to investigate, find the root cause, and act accordingly. Reboot is not a solution, nor is blaming unrelated components (thread count). Use Activity Monitor, use Who Is Active, follow the methodical approach of Waits and Queues methodology. There are plenty of ways to identify the real problem. SQL Server will never appear slow due to thread count. It simply doesn't work like that.
You can control the degree of parallalism using the MAXDOP query hint. For more details please check this article:
http://blog.sqlauthority.com/2010/03/15/sql-server-maxdop-settings-to-limit-query-to-run-on-specific-cpu/
Thanks for your valuable feedback, yes it is true it is nothing which sql is behaving weiredly, it is something our Site which is based on Ektron CMS is responsible for, one of the Functionality of Ektron CMS (which is PageBuilder), while doing operation on this piece of Content was holding of the table badly, we have around 10 million users to our site, and probably since this was blocking the tables SQL Server goes nuts and does not respond very well.
We have finally eliminated the issue.