SQL Connection Issues... next steps - sql

We've released a new game on Facebook that uses SQL Azure and we're getting intermittent connection timeouts.
I dealt with this earlier and implemented a 'retry' solution that seemed to have dealt with the transient connection issues.
However, now that the game is out I'm seeing it happen again. Not often, but it is happening. When it happens, I try logging into the SQL Azure Management web portal and I get a connection timeout there too. Same with trying SSMS.
The query itself is the first one of the game and it's a simple select on a table with 4 records.
After about 4 minutes, the timeouts stop and everything is good for a day or two.
Since these are players around the country, I don't have direct contact with the users.
I'm looking for any advice on how I can figure out what's going on.
Thanks,
Tim
FYI: http://apps.facebook.com/RelicBall/

Depending on how much compute you have in front of your database I would put in a limit on the connection pools that can be created with connection string.
Trying setting if for example you have 2 compute in front of the database.
Max Pool Size=70;
SQL Database can only handle 180 connections this is a hard limit. You can find for example when you are hitting the connection limit a retry framework will make the matter worse as it will try to connecting for a period of time leading to further downtime. This might be the reason you see several minutes as the compute retry frameworks give up.
http://msdn.microsoft.com/en-us/library/windowsazure/ff394114.aspx
Have a look with the following:
-- monitor connections
SELECT
e.connection_id,
s.session_id,
s.login_name,
s.last_request_end_time,
s.cpu_time
FROM
sys.dm_exec_sessions s
INNER JOIN sys.dm_exec_connections e
ON s.session_id = e.session_id
GO
You should try to add cache to you application design, this can greatly reduce you application over head on the database and is recommend practice with SQL Azure. Especially as you can have connection issues. I have seen this type of issue before and it was connection limits so maybe worth investigating a bit of time in that direction to see if that causes. If not I would open a ticket to MS Support.
hths, Goodluck.
EDIT: Premium Database obviously raise the limits on connections so worth of investigation also as quick fix to this issue and potentially a long run one.
http://blogs.technet.com/b/dataplatforminsider/archive/2013/07/23/premium-preview-for-windows-azure-sql-database-now-live.aspx

Related

Cannot access SQL azure

Just had a bizarre issue with SQL Azure, and it's happened in a small phase just before full go live with some users doing some data entry.
"Database 'dbname' on server 'xxx' is not currently available. Please rety the connection later. If the problem persists, contact customer support."
When I tried to connect via SQL Azure database website I got:
"Firewall check failed.
Resource ID : 1. The request minimum guarantee is 0,
maximum limit is 180 and the current usage for the database is 0.
However, the server is currently too busy to support request greater than 0 for this database."
Looking at the databases section of the Azure Management website the site reported it couldn't access the DB, but I didn't capture the exact error message unfortunately.
Bizarrely, a couple of my users were still able to login to our system website that access the DB, and view and save data. Eventually they lost connection too however.
After an hour or so, the databases came back to life and we could fully access them again.
I have looked at the servers master db event table using queries from here and there was a couple of connection failures but nothing interesting. No throttling or deadlocks, a couple of failed connections that said "Client may have timed out when establishing connection. Try increasing the connection timeout." in the description
Any ideas where else to look?
Business users have had a massive drop in confidence because of this.
What your describing normally occurs because of :
1) SQL Connection limit being hit. Assuming you don't see this often you unlikely to be the cause. But worth checking putting a limit on your connection pool can help.
2)You neighbours being extremely noisy and thus the node re-adjusts.
3) Hardware failure and Microsoft bringing your database back online in a different node. This can take some time.
Normally I have seen this when Microsoft have throttled or had problems with a box and had to recover everyone over. Because you are on a shared system you have to keep in mind that they are recovering everyone else also in that node also and thus sometimes this takes time.
The best bet if you are worried and need to get a resolution for the business is to open a support ticket with MS and give them the time and error message you saw this. They will investigate and generally they have really good back end telemetry that will point to a reason. This will allow you to give the business a resolution and then you can make a call on future plans and contingencies. You have to keep in mind though that SQL Azure is shared system and transient errors can happen, you might need to design more failover into your designs.

SQL Azure - One session locking entire DB for Update and Insert

SQL Azure issue.
I've got an issue that manifests as the following exception on our (asp.net) site:
Timeout expired. The timeout period elapsed prior to completion of
the operation or the server is not responding. The statement has been
terminated.
It also results in update and insert statements never completing in SMSS. There aren't any X or IX locks present when querying: sys.dm_tran_locks and there are no transactions when querying sys.dm_tran_active_transactions or sys.dm_tran_database_transactions.
The problem is present for every table in the database but other databases on the same instance don't cause the problem. The duration of the issue can be anywhere from 2 minutes to 2 hours and doesn't happen at any specific times of day.
The database is not full.
At one point this issue didn't resolve itself but I was able to resolve the issue by querying sys.dm_exec_connections finding the longest running session, and then killing it. The odd thing is, that the connection was 15 minutes old, but the lock issue had been present for over 3 hours.
Is there anything else I can check?
EDIT
As per Paul's answer below. I'd actually tracked down the problem before he answered. I will post the steps I used to figure this out below, in case they help anyone else.
The following queries were run when a "timeout period" was present.
select * from sys.dm_exec_requests
As we can see, all the WAIT requests are waiting on session 1021 which is the replication request! The TM Request indicates a DTC transaction and we don't use distributed transactions. You can also see the wait_type of SE_REPL_COMMIT_ACK which again implicates replication.
select * from sys.dm_tran_locks
Again waiting on session 1021
SELECT * FROM sys.dm_db_wait_stats ORDER BY wait_time_ms desc
And yes, SE_REPL_CATCHUP_THROTTLE has a total wait time of 8094034
ms, that is 134.9minutes!!!
Also see the following forum for details on this issue.
http://social.technet.microsoft.com/Forums/en-US/ssdsgetstarted/thread/c3003a28-8beb-4860-85b2-03cf6d0312a8
I've been given the following answer in my communication with
Microsoft (we've seen this issue with 4 of our 15 databases in the EU
data center):
Question: Have there been changes to these soft
throttling limits in the last three weeks ie since my problems
started?
Answer: No, there has not.
Question: Are there ways we can
prevent or be warned we are approaching a limit?
Answer: No. The issue
may not be caused by your application but can be caused by other
tenants relying on the same physical hardware. In other words, your
application can have very little load and still run into the problem.
In other words, your own traffic may be a cause of this problem, but
it can just as well be caused by other tenants relying on the same
physical hardware. There's no way to know beforehand that the issue
will soon occur - it can occur at any time without warning. The SQL
Azure operations team does not monitor this type of error, so they
won't automatically try to solve the problem for you. So if you run
into it you have two opitions:
Create a copy of your db and use that and hope the db is placed on another server with less load.
Contact Windows Azure Support and inform the about the problem and let them do Option 1 for you
You might be running into the SE_REPL* issues that are currently plaguing a lot of folks using Sql Azure (my company included).
When you experience the timeouts, try checking your wait requests for wait types of:
SE_REPL_SLOW_SECONDARY_THROTTLE
SE_REPL_COMMIT_ACK
Run the following to check your wait types on current connections:
SELECT TOP 10 r.session_id, r.plan_handle,
r.sql_handle, r.request_id,
r.start_time, r.status,
r.command, r.database_id,
r.user_id, r.wait_type,
r.wait_time, r.last_wait_type,
r.wait_resource, r.total_elapsed_time,
r.cpu_time, r.transaction_isolation_level,
r.row_count
FROM sys.dm_exec_requests r
You can also check a history of sorts for this by running:
SELECT * FROM sys.dm_db_wait_stats
ORDER BY wait_time_ms desc
If you're seeing a lot of SE_REPL* wait types and these are staying set on your connections for any length of time, then basically you're screwed.
Microsoft are aware of the problem, but I've had a support ticket open for a week with them now and they're still working on it apparently.
The SE_REPL* waits happen when the Sql Azure replication slaves fall behind.
Basically the whole db suspends queries while replication catches up :/
So essentially the aspect that makes Sql Azure highly available is causing databases to become randomly unavailable.
I'd laugh at the irony if it wasn't killing us.
Have a look at this thread for details:
http://social.technet.microsoft.com/Forums/en-US/ssdsgetstarted/thread/c3003a28-8beb-4860-85b2-03cf6d0312a8

How to avoid Sql Query Timeout

I have RO access on a SQL View. This query below times out. How to avoid this?
select
count(distinct Status)
from
[MyTable] with (NOLOCK)
where
MemberType=6
The error message I get is:
Msg 121, Level 20, State 0, Line 0
A transport-level error has occurred when receiving results from the server (provider: TCP Provider, error: 0 - The semaphore timeout period has expired.)
Your query is probably fine. "The semaphore timeout period has expired" is a Network error, not a SQL Server timeout.
There is apparently some sort of network problem between you and the SQL Server.
edit: However, apparently the query runs for 15-20 min before giving the network error. That is a very long time, so perhaps the network error could be related to the long execution time. Optimization of the underlying View might help.
If [MyTable] in your example is a View, can you post the View Definition so that we can have a go at optimizing it?
Although there is clearly some kind of network instability or something interfering with your connection (15 minutes is possible that you could be crossing a NAT boundary or something in your network is dropping the session), I would think you want such a simple?) query to return well within any anticipated timeoue (like 1s).
I would talk to your DBA and get an index created on the underlying tables on MemberType, Status. If there isn't a single underlying table or these are more complex and created by the view or UDF, and you are running SQL Server 2005 or above, have him consider indexing the view (basically materializing the view in an indexed fashion).
You could put an index on MemberType.
Please check your Windows system event log for any errors specifically for the "Event Source: Dhcp". It's very likely a networking error related to DHCP. Address lease time expired or so. It shouldn't be a problem related to the SQL Server or the query itself.
Just search the internet for "The semaphore timeout period has expired" and you'll get plenty of suggestions what might be a solution for your problem. Unfortunately there doesn't seem to be the solution for this problem.
Do you have an index defined over the Status column and MemberType column?
how many records do you have? are there any indexes on the table? try this:
;with a as (
select distinct Status
from MyTable
where MemberType=6
)
select count(Status)
from a
My team were experiencing these issues intermittently with long running SSIS packages. This has been happening since Windows server patching.
Our SSIS and SQL servers are on separate VM servers.
Working with our Wintel Servers team we rebooted both servers and for the moment, the problem appears to have gone away.
The engineer has said that they're unsure if the issue is the patches or new VMTools that they updated at the same time. We'll monitor for now and if the timeout problems recur, they'll try rolling back the VMXNET3 driver, first, then if that doesn't work, take off the June Rollup patches.
So for us the issue is nothing to do with our SQL Queries (we're loading billions of new rows so it has to be long running).
This is happen because another instance of sql server is running. So you need to kill first then you can able to login to SQL Server.
For that go to Task Manager and Kill or End Task the SQL Server service then go to Services.msc and start the SQL Server service.
While I would be tempted to blame my issues - I'm getting the same error with my query, which is much, much bigger and involves a lot of loops - on the network, I think this is not the case.
Unfortunately it's not that simple. Query runs for 3+ hours before getting that error and apparently it crashes at the same time if it's just a query in SSMS and a job on SQL Server (did not look into details of that yet, so not sure if it's the same error; definitely same spot, though).
So just in case someone comes here with similar problem, this thread:
https://www.sqlservercentral.com/Forums/569962/The-semaphore-timeout-period-has-expired
suggest that it may equally well be a hardware issue or actual timeout.
My loops aren't even (they depend on sales level in given month) in terms of time required for each, so good month takes about 20 mins to calculate (query looks at 4 years).
That way it's entirely possible I need to optimise my query. I would even say it's likely, as some changes I did included new tables, which are heaps... So another round of indexing my data before tearing into VM config and hardware tests.
Being aware that this is old question: I'm on SQL Server 2012 SE, SSMS is 2018 Beta and VM the SQL Server runs on has exclusive use of 132GB of RAM (30% total), 8 cores, and 2TB of SSD SAN.

How do I determine the optimal number of connections that can be open on my SQL Server 2000 DB?

What is the optimal number of connections that can be open on a SQL Server 2000 DB. I know in the previous company I was working for, on a tru 64 box with Oracle 8i, 8 processor machine we'd figured out that 8*12= 96 connections seemed to be a good number. Is there any such calc for SQL Server 2000. The DB runs on a 2-processor(hyper threaded 4) machine. There are a lot of transactions that run against the DB. The reason I ask is because we have an app that typically tends to leave around 100 connections open even if it is not doing anything and I am having difficulty explaining that that might be a cause for our performance issues. Maybe, SQL Server does not have such a limitation... Can any of you pour forth some wisdom on this? Much appreciate it. Thanks,
I should add it is the Standard Edition.
If you don't know if this is your performance bottleneck then you should be trying to determine that, not trying to limit the connections or something.
If you haven't, you should:
Use SQL Profiler to find long-running queries.
Monitor your db server's cpu load, memory/page file usage, and network usage
Find one of your longest running queries (see #1 above) and write a very lean test app that can throw this query at your db server during peak load and record some response times.
If #1 and #2 don't uncover anything, and #3 shows your db server has slow response times during load then you know you have a problem like "too many connections". But if you haven't done #3 then it seems advisable to do that, as mucking with connection limits and such seems like it will just create artificial bottlenecks, and not really get you to the root of your problem, IMO.
Your performance issue will not be caused by the number of connections.
As well as sliderhouserules' answer, as a quick fix I'd suggest switch off hyperthreading rather than limiting your connections.
link1, link2 (note: this guy worked on the MS SQL 2005 code)
Each connection takes a trivial amount of memory. A shared db lock is for stability only.
This blog post on MSDN indicates there is no limit - at least in the Express editions: http://blogs.msdn.com/euanga/archive/2006/03/09/545576.aspx
And this indicates that it might be 256, for lite editions - http://blogs.msdn.com/stevelasker/archive/2006/04/10/SqlEverywhereInfo.aspx
This also shows no limit: http://channel9.msdn.com/forums/TechOff/169030-The-difference-between-SQL-Server-2005-Express-and-Developer-Edition/?CommentID=299642
addition - from a comment, http://msdn.microsoft.com/en-us/library/aa196730(SQL.80).aspx indicates the max is 32767, while there is no "ideal"
If the app is a long running app and it's on the same server, if the app leaves open db handles that have created a lock this is truly bad for performance. You can check something like select * from sys.dm_tran_locks or sp_lock to give you an idea.

SQL Server Express Idle Mode Partial Data Returns?

I'm attempting to help our network engineers troubleshoot a situation for one of our clients. This client purchased a point-of-sale system from quite literally a "mom-and-pop" vendor, and said vendor recommended SQL Server Express 2005 as the back-end database to save the client from having to incur extra licensing fees. (Please don't get me started on that!)
We didn't write the app, and because it's a commercial app, we have no source code available. (Not that it would help us if we did; the thing was built in PowerBuilder, so we don't have tooling for it.) The app does none of its own logging, that we can ascertain. All we have to go on is SQL Server Express's own logging.
In the application, an end user swipes a membership card. Occasionally (a few times a day), the swipe will not return data from the database. The message on screen will say, "Member 123 not found." (The member numbers are actually six digits, "000123.") A rescan immediately afterward returns the member data correctly.
We've eliminated the scanner itself as a source of issues -- it routinely scans the full six-digit number. A scan of SQL Server Express's log indicates that it is coming back online from being idle, often at the point of the scan (but also at several other times per day). (Idle mode is explained here.)
I understand that allocating/deallocating RAM the way SQL Express does is a time-consuming process, especially if we're talking about hundreds of megabytes at a time -- which appears to be the case.
What we're not sure of is whether or not we're getting back partial data, or if the app is simply failing to connect to the database and displaying a generic error message. Since everything is so opaque, and the client is (for obvious reasons) unwilling to pay us to sit in their facility for 8 hours or so to physically see it happen (perhaps with network monitoring/packet sniffing tools), we're kind of at a loss.
At this point, our recommendation is that the client upgrade to SQL Server 2005 Workgroup Edition, with 5 CALs. But that doesn't completely sit well with me as the solution to this issue, because I'm reasonably certain that no SQL Server ever returns partial data -- if you can't connect, you can't connect. (That said, I still recommend it because it's a solution to a number of their other issues!)
I don't have much experience with Express. (I never use it for anything but local development, and there only at home; I certainly never recommend it to my clients.)
My question to those who might have experience with Express is, have you ever seen an instance of SQL Express return partial data, without the app itself being the cause of it? Specifically, have you seen this behavior when returning from idle mode?
(For what it's worth, we're inclined to believe that the app is failing to connect and merely displaying a generic error message, lopping off leading zeroes on the member ID when it does. That seems the most reasonable answer -- a third question might be, do you guys concur with that assessment?)
I've never heard of or experienced SQL Server Express returning partial data. It's essentially the same code base as the full SQL Server.
It is more likely that the application is experiencing a timeout (which defaults to 30 seconds) due to SQL Server Express going idle. The application probably receives a timeout that it does not expect and does not handle it well.
The problem and possible solutions are discussed in this forum thread: http://social.msdn.microsoft.com/forums/en-US/sqlexpress/thread/a8fbf8d6-9949-47a5-a32b-50f8131f1127/
I suspect you have a connection string that looks like this:
Data Source=.\SQLEXPRESS; Integrated Security=True;AttachDbFilename=|DataDirectory|\myDatabase.mdf;User Instance=True
From the referenced thread:
This connection string will cause an
initial connection to the main
instance (.\SQLEXPRESS) and then
instruct the main instance to spawn a
new instance of SQL Server under the
user's context and attach the database
specified to that new User Instance.
The User Instance is a completely
separate running instance of SQL
Server form the main instance that is
unique to the user and that will be
shut down when there are no longer any
connections to it.
This is totally different that
attaching a database to the main
instance, which stays running at all
times, unless you've manually shut it
down. If your question is about the
main instance going into an Idle
state, then your question is not
unique to SQL Express and you should
ask this question in the Database
Engine forum. I believe all Editions
of SQL Server have an Idle state and
the other forum would be where you can
find out how to affect that behavior.