Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
I am responsible for a third-party application (no access to source) running on IIS and SQL Server 2005 (500 concurrent users, 1TB data, 8 IIS servers). We have recently started to see significant blocking on the database (after months of running this application in production with no problems). This occurs at random intervals during the day, approximately every 30 minutes, and affects between 20 and 100 sessions each time. All of the sessions eventually hit the application time out and the sessions abort.
The problem disappears and then gradually re-emerges. The SPID responsible for the blocking always has the following features:
WAIT TYPE = ASYNC_NETWORK_IO
The SQL being run is “(#claimid
varchar(15))SELECT claimid, enrollid,
status, orgclaimid, resubclaimid,
primaryclaimid FROM claim WHERE
primaryclaimid = #claimid AND
primaryclaimid <> claimid)”. This is
relatively innocuous SQL that should
only return one or two records, not a
large dataset.
NO OTHER SQL statements have been
implicated in the blocking, only this
SQL statement.
This is parameterized SQL for which
an execution plan is cached in
sys.dm_exec_cached_plans.
This SPID has an object-level S lock on the claim table, so all UPDATEs/INSERTs to the claim table are also blocked.
HOST ID varies. Different web servers are responsible for the blocking sessions. E.g., sometimes we trace back to web server 1, sometimes web server 2.
When we trace back to the web server implicated in the blocking, we see the following:
There is always some sort of
application related error in the
Event Log on the web server, linked
to the Host ID and Host Process ID
from the SQL Session.
The error messages vary, usually some
sort of SystemOutofMemory. (These
error messages seem to be similar to
error messages that we have seen in
the past without such dramatic
consequences. We think was happening
before, but didn’t lead to blocking.
Why now?)
No known problems with the network
adapters on either the web servers or
the SQL server.
(In any event the record set returned by the offending query would be small.)
Things ruled out:
Indexes are regularly defragmented.
Statistics regularly updated.
Increased sample size of statistics
on claim.primaryclaimid.
Forced recompilation of the cached
execution plan.
Created a compound index with
primaryclaimid, claimid.
No networking problems.
No known issues on the web server.
No changes to application software on
web servers.
We hypothesize that the chain of events goes something like this:
Web server process submits SQL
above.
SQL server executes the SQL, during
which it acquires a lock on the
claim table.
Web server process gets an error and
dies.
SQL server session is hung waiting
for the web server process to read
the data set.
SQL Server sessions that need to get
X locks on parts of the claim table
(anyone processing claims) are
blocked by the lock on the claim
table and remain blocked until they
all hit the application time out.
Any suggestions for troubleshooting while waiting for the vendor's assistance would be most welcome.
Is there a way to force SQL Server to lock at the row/page level for this particular SQL statement only?
Is there a way to set a threshold on ASYNC_NETWORK_IO waits only?
ASYNC_NETWORK_IO is caused by clients not able to receive data quick enough and filling network buffers (simply put). There is no magic SQL Server setting to fix it.
reboot the client (even if it's web server)
ensure NICs are set correctly (firmware, full duplex etc)
ensure physical cables are ok (any packet losses etc?)
etc
It is not a SQL Server issue, as such...
Blog article 1
BOL:
ASYNC_NETWORK_IO Occurs on network
writes when the task is blocked behind
the network. Verify that the client is
processing data from the server.
And another with link to MS whitepaper
I had the same problem and it got solved when I disabled the Kaspersky antivirus on the client.
Related
Just had a bizarre issue with SQL Azure, and it's happened in a small phase just before full go live with some users doing some data entry.
"Database 'dbname' on server 'xxx' is not currently available. Please rety the connection later. If the problem persists, contact customer support."
When I tried to connect via SQL Azure database website I got:
"Firewall check failed.
Resource ID : 1. The request minimum guarantee is 0,
maximum limit is 180 and the current usage for the database is 0.
However, the server is currently too busy to support request greater than 0 for this database."
Looking at the databases section of the Azure Management website the site reported it couldn't access the DB, but I didn't capture the exact error message unfortunately.
Bizarrely, a couple of my users were still able to login to our system website that access the DB, and view and save data. Eventually they lost connection too however.
After an hour or so, the databases came back to life and we could fully access them again.
I have looked at the servers master db event table using queries from here and there was a couple of connection failures but nothing interesting. No throttling or deadlocks, a couple of failed connections that said "Client may have timed out when establishing connection. Try increasing the connection timeout." in the description
Any ideas where else to look?
Business users have had a massive drop in confidence because of this.
What your describing normally occurs because of :
1) SQL Connection limit being hit. Assuming you don't see this often you unlikely to be the cause. But worth checking putting a limit on your connection pool can help.
2)You neighbours being extremely noisy and thus the node re-adjusts.
3) Hardware failure and Microsoft bringing your database back online in a different node. This can take some time.
Normally I have seen this when Microsoft have throttled or had problems with a box and had to recover everyone over. Because you are on a shared system you have to keep in mind that they are recovering everyone else also in that node also and thus sometimes this takes time.
The best bet if you are worried and need to get a resolution for the business is to open a support ticket with MS and give them the time and error message you saw this. They will investigate and generally they have really good back end telemetry that will point to a reason. This will allow you to give the business a resolution and then you can make a call on future plans and contingencies. You have to keep in mind though that SQL Azure is shared system and transient errors can happen, you might need to design more failover into your designs.
SQL Azure issue.
I've got an issue that manifests as the following exception on our (asp.net) site:
Timeout expired. The timeout period elapsed prior to completion of
the operation or the server is not responding. The statement has been
terminated.
It also results in update and insert statements never completing in SMSS. There aren't any X or IX locks present when querying: sys.dm_tran_locks and there are no transactions when querying sys.dm_tran_active_transactions or sys.dm_tran_database_transactions.
The problem is present for every table in the database but other databases on the same instance don't cause the problem. The duration of the issue can be anywhere from 2 minutes to 2 hours and doesn't happen at any specific times of day.
The database is not full.
At one point this issue didn't resolve itself but I was able to resolve the issue by querying sys.dm_exec_connections finding the longest running session, and then killing it. The odd thing is, that the connection was 15 minutes old, but the lock issue had been present for over 3 hours.
Is there anything else I can check?
EDIT
As per Paul's answer below. I'd actually tracked down the problem before he answered. I will post the steps I used to figure this out below, in case they help anyone else.
The following queries were run when a "timeout period" was present.
select * from sys.dm_exec_requests
As we can see, all the WAIT requests are waiting on session 1021 which is the replication request! The TM Request indicates a DTC transaction and we don't use distributed transactions. You can also see the wait_type of SE_REPL_COMMIT_ACK which again implicates replication.
select * from sys.dm_tran_locks
Again waiting on session 1021
SELECT * FROM sys.dm_db_wait_stats ORDER BY wait_time_ms desc
And yes, SE_REPL_CATCHUP_THROTTLE has a total wait time of 8094034
ms, that is 134.9minutes!!!
Also see the following forum for details on this issue.
http://social.technet.microsoft.com/Forums/en-US/ssdsgetstarted/thread/c3003a28-8beb-4860-85b2-03cf6d0312a8
I've been given the following answer in my communication with
Microsoft (we've seen this issue with 4 of our 15 databases in the EU
data center):
Question: Have there been changes to these soft
throttling limits in the last three weeks ie since my problems
started?
Answer: No, there has not.
Question: Are there ways we can
prevent or be warned we are approaching a limit?
Answer: No. The issue
may not be caused by your application but can be caused by other
tenants relying on the same physical hardware. In other words, your
application can have very little load and still run into the problem.
In other words, your own traffic may be a cause of this problem, but
it can just as well be caused by other tenants relying on the same
physical hardware. There's no way to know beforehand that the issue
will soon occur - it can occur at any time without warning. The SQL
Azure operations team does not monitor this type of error, so they
won't automatically try to solve the problem for you. So if you run
into it you have two opitions:
Create a copy of your db and use that and hope the db is placed on another server with less load.
Contact Windows Azure Support and inform the about the problem and let them do Option 1 for you
You might be running into the SE_REPL* issues that are currently plaguing a lot of folks using Sql Azure (my company included).
When you experience the timeouts, try checking your wait requests for wait types of:
SE_REPL_SLOW_SECONDARY_THROTTLE
SE_REPL_COMMIT_ACK
Run the following to check your wait types on current connections:
SELECT TOP 10 r.session_id, r.plan_handle,
r.sql_handle, r.request_id,
r.start_time, r.status,
r.command, r.database_id,
r.user_id, r.wait_type,
r.wait_time, r.last_wait_type,
r.wait_resource, r.total_elapsed_time,
r.cpu_time, r.transaction_isolation_level,
r.row_count
FROM sys.dm_exec_requests r
You can also check a history of sorts for this by running:
SELECT * FROM sys.dm_db_wait_stats
ORDER BY wait_time_ms desc
If you're seeing a lot of SE_REPL* wait types and these are staying set on your connections for any length of time, then basically you're screwed.
Microsoft are aware of the problem, but I've had a support ticket open for a week with them now and they're still working on it apparently.
The SE_REPL* waits happen when the Sql Azure replication slaves fall behind.
Basically the whole db suspends queries while replication catches up :/
So essentially the aspect that makes Sql Azure highly available is causing databases to become randomly unavailable.
I'd laugh at the irony if it wasn't killing us.
Have a look at this thread for details:
http://social.technet.microsoft.com/Forums/en-US/ssdsgetstarted/thread/c3003a28-8beb-4860-85b2-03cf6d0312a8
I am using IIS counters to monitor "bussiness" of IIS. Specially I like 2 of them:
Current Anonymous Users (The number of users who currently have an anonymous request pending with the WWW service. In IIS 6.0, Current Users (Anonymous or NonAnonymous) is the number of requests currently being worked on by the server)
Total Anonymous User (The number of users who have established an anonymous request since the WWW service started. This counter does not increment when files are being served from the kernel cache.)
Because my bottle neck is database, which happend to be Oracle 10g, I wonder if similar counters could be taken from Oracle server (on database level).
Basicly speaking I would like to know how many request to database ABC is waiting to be served at the moment of my request, and how many request was served since (last reset, beginnig of the day...)
How I could get this data of Oracle Server ?
V$SESSION can be used to determine how many database sessions are active at the current instant in time. This query will show you the number of user sessions (rather than background sessions that the Oracle database itself creates) are active at the current instant. You may want to further restrict this to be the number of active sessions where the USERNAME is the user that your middle tier is connecting as or the MACHINE from which the session is created is one of your middle tier servers.
SELECT COUNT(*)
FROM v$session
WHERE status = 'ACTIVE'
AND type = 'USER'
There is no easy mapping in Oracle for the "number of requests served" of a web browser. From a database perspective, there aren't any markers of when a "request" begins and ends. You could potentially count transactions but the Oracle database itself is constantly issuing transactions in the background which is likely to cause problems if you wanted a measure that would map closely to the number of web pages served.
That said, however, using counters to diagnose and monitor Oracle database performance is not a particularly good idea. Oracle has much more sophisticated monitoring and tuning tools available. Depending on the edition (standard or enterprise) and whether you've licensed the Performance and Tuning Pack, you'd be much better served grabbing an AWR report from a time period when the database was the bottleneck and analyzing that to see what needs to be tuned.
Question: Is it possible to stop SSMS from monitoring the service status of registered servers?
Details:
SSMS 2008 monitors the service status of every registered server. From what I have seen it seems to reach out to every registered server every minute or so to check it's status, in my case that is over 100 servers. This process has raised issues with our Security and Network departments. Network identified it initially as suspicious traffic due to the fact that it appeard as an unknown utility was scanning the network for SQL Servers. Security was concerned because the Security Event Logs on each server are being filled up with my logon events.
I have looked all over for a setting but can't seem to find one. Am I missing it somewhere?
TIA,
Brian
I finally found an answer!!
While it is not possible (at least that I've found) to stop SSMS from checking the service status of registered servers it is possible to change the interval at which it checks it.
The short version is to create the following registry keys (DWORD):
(SQL Server 2008)
HKLM\Software\Microsoft\Microsoft SQL Server\100\Tools\Shell | PollingInterval = 600 (decimal)
(SQL Server 2005)
HKLM\Software\Microsoft\Microsoft SQL Server\90\Tools\Shell | PollingInterval = 600 (decimal)
This will make SSMS connect automatically every minute instead of every few seconds.
See this MS Connect Post for details.
Since it doesn't appear that there's any way to stop these status checks by SSMS, can you focus on helping them to see their harmlessness?
Can the network group allow certain exceptions to this particular rule (pinging servers on port 1433) in their scanning software, which would allow you and your group to monitor SQL Server uptime? Even if you weren't using SSMS, this type of sweeping monitoring activity is pretty common, and you'll know the requests will only ever come from a handful of workstations.
I don't think these SQL status checks generate any more events in the security log than any other activity, so maybe they were just concerned because it was something they weren't expecting. Could the security group be convinced that these events aren't dangerous, again as long as they're coming from certain approved workstations?
If neither of these is an option (or even if it is), you could help mitigate the problem by not connecting to all your SQL servers at once. Maybe just connect to the ones you need at the time - it looks like loading the entire list actively connects to each of them, but just connecting to the ones you intend to use in that session might help reduce the number of network sessions open.
I hope this helps - if it doesn't, or you've got some additional input that might help find a workaround, please post it!
I'm attempting to help our network engineers troubleshoot a situation for one of our clients. This client purchased a point-of-sale system from quite literally a "mom-and-pop" vendor, and said vendor recommended SQL Server Express 2005 as the back-end database to save the client from having to incur extra licensing fees. (Please don't get me started on that!)
We didn't write the app, and because it's a commercial app, we have no source code available. (Not that it would help us if we did; the thing was built in PowerBuilder, so we don't have tooling for it.) The app does none of its own logging, that we can ascertain. All we have to go on is SQL Server Express's own logging.
In the application, an end user swipes a membership card. Occasionally (a few times a day), the swipe will not return data from the database. The message on screen will say, "Member 123 not found." (The member numbers are actually six digits, "000123.") A rescan immediately afterward returns the member data correctly.
We've eliminated the scanner itself as a source of issues -- it routinely scans the full six-digit number. A scan of SQL Server Express's log indicates that it is coming back online from being idle, often at the point of the scan (but also at several other times per day). (Idle mode is explained here.)
I understand that allocating/deallocating RAM the way SQL Express does is a time-consuming process, especially if we're talking about hundreds of megabytes at a time -- which appears to be the case.
What we're not sure of is whether or not we're getting back partial data, or if the app is simply failing to connect to the database and displaying a generic error message. Since everything is so opaque, and the client is (for obvious reasons) unwilling to pay us to sit in their facility for 8 hours or so to physically see it happen (perhaps with network monitoring/packet sniffing tools), we're kind of at a loss.
At this point, our recommendation is that the client upgrade to SQL Server 2005 Workgroup Edition, with 5 CALs. But that doesn't completely sit well with me as the solution to this issue, because I'm reasonably certain that no SQL Server ever returns partial data -- if you can't connect, you can't connect. (That said, I still recommend it because it's a solution to a number of their other issues!)
I don't have much experience with Express. (I never use it for anything but local development, and there only at home; I certainly never recommend it to my clients.)
My question to those who might have experience with Express is, have you ever seen an instance of SQL Express return partial data, without the app itself being the cause of it? Specifically, have you seen this behavior when returning from idle mode?
(For what it's worth, we're inclined to believe that the app is failing to connect and merely displaying a generic error message, lopping off leading zeroes on the member ID when it does. That seems the most reasonable answer -- a third question might be, do you guys concur with that assessment?)
I've never heard of or experienced SQL Server Express returning partial data. It's essentially the same code base as the full SQL Server.
It is more likely that the application is experiencing a timeout (which defaults to 30 seconds) due to SQL Server Express going idle. The application probably receives a timeout that it does not expect and does not handle it well.
The problem and possible solutions are discussed in this forum thread: http://social.msdn.microsoft.com/forums/en-US/sqlexpress/thread/a8fbf8d6-9949-47a5-a32b-50f8131f1127/
I suspect you have a connection string that looks like this:
Data Source=.\SQLEXPRESS; Integrated Security=True;AttachDbFilename=|DataDirectory|\myDatabase.mdf;User Instance=True
From the referenced thread:
This connection string will cause an
initial connection to the main
instance (.\SQLEXPRESS) and then
instruct the main instance to spawn a
new instance of SQL Server under the
user's context and attach the database
specified to that new User Instance.
The User Instance is a completely
separate running instance of SQL
Server form the main instance that is
unique to the user and that will be
shut down when there are no longer any
connections to it.
This is totally different that
attaching a database to the main
instance, which stays running at all
times, unless you've manually shut it
down. If your question is about the
main instance going into an Idle
state, then your question is not
unique to SQL Express and you should
ask this question in the Database
Engine forum. I believe all Editions
of SQL Server have an Idle state and
the other forum would be where you can
find out how to affect that behavior.