I have a problem that has puzzled me for a while now. Once in a while, say 4-5 times a week we get timeouts from the database at HH:03 (or HH:02 sometimes I think).
I've been digging into the scheduled tasks on the server to investigate if there is something that puts the server to it's knees in performance without any findings.
I've also gone so fas so that I've made a watchdog for the application so that when the query only has 1 seconds left of it's max query time it checks the processlist for the database and emails this to me. The process list always contains just one entry and that's the entry that is about to get an timeout exception.
To further add to the complexity we have many customers to this application but it's only one of the customers that get this timeout. All customers run the same code but have different databases, different application pools with different application pool identities.
The application is an ASP.net application. The database is a Microsoft SQL 2008 R2 Express edition.
Has anyone heard of something like this? Can anyone give me any pointers about what to investigate in order to resolve this issue?
Kind regards
Related
Background:
Two nights ago the old-as-hell and very poorly designed website for the company I work for got attacked by a bot that submitted about 5000+ phony orders. In the course of deleting all of those false orders from the database, SQL Management Studio crashed, and the application had to be stopped via task manager and restarted. After that I was getting optimistic concurrence control errors when trying to delete some of the fake records, and had to complete the cleanup via DELETE statement.
(yes, I KNOW it's generally bad practice to delete records from the results pane, but for people like me who aren't actually programmers but get stuck with the IT work because we're the only ones who know how to find the on switch, it makes me less paranoid that I won't delete a record I didn't mean to)
Ever since then, there is a specific page in the admin section of the site that takes a VERY long time to perform a SELECT query for a specific range. The query will complete if you sit there long enough, but here's a screenshot of the ColdFusion error box that comes up with it:
ColdFusion error message
I suspect that between the bot attack and Studio Express crashing in the middle of an DELETE query, part of the table is corrupted, which is why it exceeds the allowable time limit. I don't know if our webhost has a backup of the database (I've been in contact with them the last couple days).
What tools can I use to check for and repair errors on that table?
We've released a new game on Facebook that uses SQL Azure and we're getting intermittent connection timeouts.
I dealt with this earlier and implemented a 'retry' solution that seemed to have dealt with the transient connection issues.
However, now that the game is out I'm seeing it happen again. Not often, but it is happening. When it happens, I try logging into the SQL Azure Management web portal and I get a connection timeout there too. Same with trying SSMS.
The query itself is the first one of the game and it's a simple select on a table with 4 records.
After about 4 minutes, the timeouts stop and everything is good for a day or two.
Since these are players around the country, I don't have direct contact with the users.
I'm looking for any advice on how I can figure out what's going on.
Thanks,
Tim
FYI: http://apps.facebook.com/RelicBall/
Depending on how much compute you have in front of your database I would put in a limit on the connection pools that can be created with connection string.
Trying setting if for example you have 2 compute in front of the database.
Max Pool Size=70;
SQL Database can only handle 180 connections this is a hard limit. You can find for example when you are hitting the connection limit a retry framework will make the matter worse as it will try to connecting for a period of time leading to further downtime. This might be the reason you see several minutes as the compute retry frameworks give up.
http://msdn.microsoft.com/en-us/library/windowsazure/ff394114.aspx
Have a look with the following:
-- monitor connections
SELECT
e.connection_id,
s.session_id,
s.login_name,
s.last_request_end_time,
s.cpu_time
FROM
sys.dm_exec_sessions s
INNER JOIN sys.dm_exec_connections e
ON s.session_id = e.session_id
GO
You should try to add cache to you application design, this can greatly reduce you application over head on the database and is recommend practice with SQL Azure. Especially as you can have connection issues. I have seen this type of issue before and it was connection limits so maybe worth investigating a bit of time in that direction to see if that causes. If not I would open a ticket to MS Support.
hths, Goodluck.
EDIT: Premium Database obviously raise the limits on connections so worth of investigation also as quick fix to this issue and potentially a long run one.
http://blogs.technet.com/b/dataplatforminsider/archive/2013/07/23/premium-preview-for-windows-azure-sql-database-now-live.aspx
I have SQL job that runs every night which does various inserts/updates/deletes. The job contains 40 steps which mainly execute stored procedures.
It's been running fine up until a week ago when suddenly the run time went up from 2.5 hours to over 5 hours, sometimes even 8,9,10!
Could one you please give me any pointers?
First of all let me recommend you a valuable resource on Simple-Talk site. Is a detailed methodology of how to troubleshoot performance issues on SQL Server.
Does the insert you say was carried out was a huge bulk insert that could affect performance? Maybe if it was a huge load the query execution plans could be different and you need to re-tune your table structure, indexes, etc.
If the run time suddenlychanged and no changes where done in the queries or your database structure then I would ask myself several questions:
first, does the process is still taking so long or it was only one time it ran so slow? maybe now is running smoothly and the issue only arised once. Nevertheless, try to find what triggered that bad performance, it can happend again and take down your server
is the server a dedicated sql server? if not, check if some new tasks unrelated to the SQL engine had been configured, maybe a new tasks is doing some heavy I/O jobs and therefore your CRUD operations take longer
if it is a dedicated server, then check that no new job has been added and can take down your existing jobs. Check this SO link for details on jobs settled up from the SQL Agent
maybe low memory due to another process on same server?
And there is lot more to check, but before going deeper I would check that no external (non sql server related) was the reason of the delay on the process execution.
We have an application that has been deployed to 50+ websites. Across these sites we have noticed a piece of strange behaviour, we have now tracked this to one specific query. Very occasionally, once or twice a day usually, one of our debugging scripts reports
2006 : MySQL server has gone away
I know there are a number of reasons this error can be thrown but the thing that is most strange is that every single time it is thrown it happens from the same SQL query being run. There is nothing strange or complex about this query, it looks like this:
SELECT `advert_only` FROM `products` WHERE `id` = '6197'
This query must run tens of thousands of times a day, for various different product IDs so it certainly doesnt fail each time. It fails randomly on seemingly random sites across our 4 servers. There is seemingly no commonality, one small thing we have noticed is that it sometimes will happen on 2 or 3 page loads in a row for 1 specific person as we also track the IP of the person it has happened to.
This is on CentOS 5 servers running MySQL 5.0.81
This is kind of in left field, but you should check your harddisk SMART for any errors. If there are issues reading from "that" sector then there may be issues. If you have a raid unit I wouldn't worry too much about this. I wouldn't give a high probability to this being the problem, but if you are really stumped then it might be worth it.
on http://bugs.mysql.com/bug.php?id=1011 the second comment says that: "the 'MySQL server has gone away' error is caused by a query longer than the max_allowed_packet."
there is some more information on fixing it here: http://bogdan.org.ua/2008/12/25/how-to-fix-mysql-server-has-gone-away-error-2006.html
That means that sql connection was idle for too long. Check if there are some slow operations performed before your sql-query.
I have RO access on a SQL View. This query below times out. How to avoid this?
select
count(distinct Status)
from
[MyTable] with (NOLOCK)
where
MemberType=6
The error message I get is:
Msg 121, Level 20, State 0, Line 0
A transport-level error has occurred when receiving results from the server (provider: TCP Provider, error: 0 - The semaphore timeout period has expired.)
Your query is probably fine. "The semaphore timeout period has expired" is a Network error, not a SQL Server timeout.
There is apparently some sort of network problem between you and the SQL Server.
edit: However, apparently the query runs for 15-20 min before giving the network error. That is a very long time, so perhaps the network error could be related to the long execution time. Optimization of the underlying View might help.
If [MyTable] in your example is a View, can you post the View Definition so that we can have a go at optimizing it?
Although there is clearly some kind of network instability or something interfering with your connection (15 minutes is possible that you could be crossing a NAT boundary or something in your network is dropping the session), I would think you want such a simple?) query to return well within any anticipated timeoue (like 1s).
I would talk to your DBA and get an index created on the underlying tables on MemberType, Status. If there isn't a single underlying table or these are more complex and created by the view or UDF, and you are running SQL Server 2005 or above, have him consider indexing the view (basically materializing the view in an indexed fashion).
You could put an index on MemberType.
Please check your Windows system event log for any errors specifically for the "Event Source: Dhcp". It's very likely a networking error related to DHCP. Address lease time expired or so. It shouldn't be a problem related to the SQL Server or the query itself.
Just search the internet for "The semaphore timeout period has expired" and you'll get plenty of suggestions what might be a solution for your problem. Unfortunately there doesn't seem to be the solution for this problem.
Do you have an index defined over the Status column and MemberType column?
how many records do you have? are there any indexes on the table? try this:
;with a as (
select distinct Status
from MyTable
where MemberType=6
)
select count(Status)
from a
My team were experiencing these issues intermittently with long running SSIS packages. This has been happening since Windows server patching.
Our SSIS and SQL servers are on separate VM servers.
Working with our Wintel Servers team we rebooted both servers and for the moment, the problem appears to have gone away.
The engineer has said that they're unsure if the issue is the patches or new VMTools that they updated at the same time. We'll monitor for now and if the timeout problems recur, they'll try rolling back the VMXNET3 driver, first, then if that doesn't work, take off the June Rollup patches.
So for us the issue is nothing to do with our SQL Queries (we're loading billions of new rows so it has to be long running).
This is happen because another instance of sql server is running. So you need to kill first then you can able to login to SQL Server.
For that go to Task Manager and Kill or End Task the SQL Server service then go to Services.msc and start the SQL Server service.
While I would be tempted to blame my issues - I'm getting the same error with my query, which is much, much bigger and involves a lot of loops - on the network, I think this is not the case.
Unfortunately it's not that simple. Query runs for 3+ hours before getting that error and apparently it crashes at the same time if it's just a query in SSMS and a job on SQL Server (did not look into details of that yet, so not sure if it's the same error; definitely same spot, though).
So just in case someone comes here with similar problem, this thread:
https://www.sqlservercentral.com/Forums/569962/The-semaphore-timeout-period-has-expired
suggest that it may equally well be a hardware issue or actual timeout.
My loops aren't even (they depend on sales level in given month) in terms of time required for each, so good month takes about 20 mins to calculate (query looks at 4 years).
That way it's entirely possible I need to optimise my query. I would even say it's likely, as some changes I did included new tables, which are heaps... So another round of indexing my data before tearing into VM config and hardware tests.
Being aware that this is old question: I'm on SQL Server 2012 SE, SSMS is 2018 Beta and VM the SQL Server runs on has exclusive use of 132GB of RAM (30% total), 8 cores, and 2TB of SSD SAN.