Query execution slow when scaling DTUs in Azure SQL Database - azure-sql-database

Am doing some POC with real-time scenarios for SaaS product to handle high volume of message, this will reach peak within few seconds(send/process) and listener side processing message then storing that computed data into Azure SQL Database(Separate Elastic Pool, 100 eDTU with Standard subscription), to mimic this am sending & processing message in parallel with few nodes and threads, in this case am facing some slowness in first few seconds of database operation when DTU reached maximum level the query execution is normal
Is this expected behavior?
What will happen if executes query during scaling of DTU?
How to avoid this?

When you scale up or down the service tier of an Azure SQL Database open transactions are rolled back, server logins may be disconnected, query plans may vary because the number of threads available for query changes, and the data cache and query cache will be cleared.
Since the data cache is empty, the first time you run a query it has to do a lot of physical IO, memory allocation raises and it's slow. You may take a look at queries performing slow and they may be showing the PAGEIOLATCH_SH and MEMORY_ALLOCATION_EXT waits and that corresponds to pages being pulled from disk to the buffer. The second time you run the query the data is stored on the data cache and it runs faster.
If the database faces high DTU usage for a good period of time throttling may see connection timeouts and poor performance on queries.

Related

How to set/modify connection pool from azure web app to azure sql database - Slow App Issue

I have performance issue with my .NET app hosted on an azure web app, connecting to Azure SQL DB with a custom connection string.
The more there are users, the more the app is slow. Therefore I am wondering if there are some improvements to perform at connection pool level.
How to check the pool size currently set ? How to detect sql issues when handling requests from different users ? And how to set pool size ?
Thank you for your help.
I think it's related to SQL Database resource limits for Azure SQL Database server.
The more there are users, the more the app is slow, one of the most important reason is database resource limits are reached.
Compute (DTUs and eDTUs / vCores)
When database compute utilization (measured by DTUs and eDTUs, or vCores) becomes high, query latency increases and can even time out.
Storage
When database space used reaches the max size limit, database inserts and updates that increase the data size fail and clients receive an error message. Database SELECTS and DELETES continue to succeed.
Sessions and workers (requests)
The maximum number of sessions and workers are determined by the service tier and compute size (DTUs and eDTUs). New requests are rejected when session or worker limits are reached, and clients receive an error message. While the number of connections available can be controlled by the application, the number of concurrent workers is often harder to estimate and control. This is especially true during peak load periods when database resource limits are reached and workers pile up due to longer running queries.
Fore details, please reference: What happens when database resource limits are reached.
If you Azure SQL DB is single database, you can reference these documents:
Azure SQL Database vCore-based purchasing model limits for a single database.
Resource limits for single databases using the DTU-based purchasing model.
Choose the most appropriate service tier.
About the performance issue, you also can use the Monitoring and performance tuning. It will help troubleshoot performance issue and improve the performance.
Hope this helps.

How safe is to clear Wait stats

I have been facing performance issues in the production server; and while reading it about on internet I came across #Brent Ozar article about wait stats.
I want to try that but I am not sure how safe is it to run. My production environment is constantly occupied with SSIS jobs and I don't want to kill any job or server. So, I have few questions
Is it safe to run when queries or SQL Jobs are running on server
DBCC SQLPERF("sys.dm_os_wait_stats",CLEAR);
DBCC SQLPREF("sys.dm_os_latch_stats",CLEAR);
What is the difference between update stats and clearing wait stats?
Clearing wait stats has no affect on performance of SQL Server. It would just remove information related to accumulates wait stats. Now you should have a valid reason to do it and believe me lot of DBA's and SQL Server users do it quite often when troubleshooting performance issue. Only issue is that you loose valuable information about wait stats. But there is a way to get over with it before clearing wait stats run sys.dm_os_wait_stats and get current output of wait stats now clear it and start your monitoring. Atleast you would have statistics before clearing.
•What is the difference between update stats and clearing wait stats?
They both are not related to each other in any way. Statistics (one which you are referring via update stats) is distribution of data. It is how SQL Server data is distributed and is used by SQL Server in cardinality estimation and helps optimizer to prepare cost bases GOOD plan for a query. Clearing wait stats(statistics about on what resource the query was waiting) would not affect SQL Server data distribution statistics.

Aspstate SQL Server database mirroring high IO

We are currently having issues with aspstate database mirroring as we have around 10,000 active users online 9-5 day to day and the aspstate db is so heavy on writing and passing this to the mirror that the mirror's drive is very high on IO and keeps causing both servers to be inaccessible due to the latency of writing the data on the mirror. We're using SQL Server 2012 standard so not in asynchronous mode.
We're running the SQL Server on Amazon EC2 instances with EBS backed volumes and 1000IOPS, in your views should this be enough? As we seem to have very smooth times where we've had over 15,000 users online and then other times where only 10,000 users online and we have issues with disk queue lengths on the mirror (backup server not the principle server.)
The principle can be writing to the aspstate.mdf files at 10-20mbps constant when the disk queue length goes up.
We're going to increase the IOPS to 2000 in the mean time as currently we've had to disable mirroring, however would you expect this and has anyone handled this sort of volume before?
Regards
Liam
The bottleneck with a high transaction workload like ASPState is not the data file but the transaction log. In the case of synchronous mirroring, additional latency is introduced for both the network and synchronous commit at the mirror. This latency will not be tolerable if you have a large number of APSState requests. Keep in mind that unless specified otherwise with session state enabled, each ASP.NET page request will require 2 updates to a session state row. So if you have 10,000 active users clicking once every 15 seconds, that requires about 1,300 I/Os per second for transaction log writes alone on each database.
If you must have HA for session state, I suggest failover clustering to eliminate network latency. You might also consider tuning session state by specifying the read only or none directive for pages that don't need session state. Consider using an in-memory session state solution instead of the out-of-the box ASPSession state database if you need to support a large number of users. Also remember that session state data is temporary so you can forgo durability.

What is "Excessive resource usage" in SQL Azure?

I searched online for awhile about what is "Excessive resource usage" on SQL Azure, still cannot get an idea.
Some articles suggest query takes too long, too much memory etc will cause "Excessive resource usage". But If I use simple query, simple data structure, what will happen?
For example: I get a 1G SQL Azure as session state. Since session is a very small string, and save/delete all the time, I don't think it will grow to 1G for millions of session simultaneously. You can calculate, for 1 million session, 20 char each, only take 20M space, consider 20 minutes expire etc. Cannot even close to 1G. But the queries, should be lots and lots. Each query will be very simple and fast by index.
I wanna know, if this use will be consider as "Excessive resource usage"? Is there any hard number to limit you on the usage?
Btw, as example above, if all happen in same datacenter, so all cost is 1G database which is $10 a month, right?
Unfortunately the answer is 'it depends'. I think that probably the best reference (with guidance) on the SQL Azure Query Throttle is here: TechNet Article on SQL Azure Perormance This will povide details about the metrics that are monitored and the mechanism of the throttle.
The reason that I say it depends is that the throttle is non-deterministic for any given user. This is because the throttle will be activated based on the total load on the node (physical SQL Server in Azure DC). While the subscribers who will get throttled are the subscribers delivering the greatest load the level at which the throttle kicks in will depend on the total load on the node. SO if you are on a quiet node (where other tenant DBs are relatively inactive) then you will be able to put through a bunch more throughput than if you are on a busy node.
It is very appealing to use 1GB SQL Azure DBs for session state storage; you've identified the cost benefits. You are taking a risk though. One way to mitigate this risk is to partition across at least two SQL Azure 1GB DBs and adjust the load yourself based on whether one of the DBs starts hitting the throttle.
Another option if you want determinism for throughput is to use the WIndows Azure Cache to back your sesion state store. The Cache has hard pre-defined limits for query throughput so you can plan for it more easily Azure Caching FAQ including Limits. The Cache approach is probably a bit more expensive but with a lower risk of problems.

Is it possible to get sub-1-second latency with transactional replication?

Our database architecture consists of two Sql Server 2005 servers each with an instance of the same database structure: one for all reads, and one for all writes. We use transactional replication to keep the read database up-to-date.
The two servers are very high-spec indeed (the write server has 32GB of RAM), and are connected via a fibre network.
When deciding upon this architecture we were led to believe that the latency for data to be replicated to the read server would be in the order of a few milliseconds (depending on load, obviously). In practice we are seeing latency of around 2-5 seconds in even the simplest of cases, which is unsatisfactory. By a simplest case, I mean updating a single value in a single row in a single table on the write db and seeing how long it takes to observe the new value in the read database.
What factors should we be looking at to achieve latency below 1 second? Is this even achievable?
Alternatively, is there a different mode of replication we should consider? What is the best practice for the locations of the data and log files?
Edit
Thanks to all for the advice and insight - I believe that the latency periods we are experiencing are normal; we were mis-led by our db hosting company as to what latency times to expect!
We're using the technique described near the bottom of this MSDN article (under the heading "scaling databases"), and we'd failed to deal properly with this warning:
The consequence of creating such specialized databases is latency: a write is now going to take time to be distributed to the reader databases. But if you can deal with the latency, the scaling potential is huge.
We're now looking at implementing a change to our caching mechanism that enforces reads from the write database when an item of data is considered to be "volatile".
No. It's highly unlikely you could achieve sub-1s latency times with SQL Server transactional replication even with fast hardware.
If you can get 1 - 5 seconds latency then you are doing well.
From here:
Using transactional replication, it is
possible for a Subscriber to be a few
seconds behind the Publisher. With a
latency of only a few seconds, the
Subscriber can easily be used as a
reporting server, offloading expensive
user queries and reporting from the
Publisher to the Subscriber.
In the following scenario (using the
Customer table shown later in this
section) the Subscriber was only four
seconds behind the Publisher. Even
more impressive, 60 percent of the
time it had a latency of two seconds
or less. The time is measured from
when the record was inserted or
updated at the Publisher until it was
actually written to the subscribing
database.
I would say it's definately possible.
I would look at:
Your network
Run ping commands between the two servers and see if there are any issues
If the servers are next to each other you should have < 1 ms.
Bottlenecks on the server
This could be network traffic (volume)
Like network cards not being configured for 1GB/sec
Anti-virus or other things
Do some analysis on some queries and see if you can identify indexes or locking which might be a problem
See if any of the selects on the read database might be blocking the writes.
Add with (nolock), and see if this makes a difference on one or two queries you're analyzing.
Essentially you have a complicated system which you have a problem with, you need to determine which component is the problem and fix it.
Transactional replication is probably best if the reports / selects you need to run need to be up to date. If they don't you could look at log shipping, although that would add some down time with each import.
For data/log files, make sure they're on seperate drives so the performance is maximized.
Something to remember about transaction replication is that a single update now requires several operations to happen for that change to occur.
First you update the source table.
Next the log readers sees the change and writes the change to the distribution database.
Next the distribution agent sees the new entry in the distribution database and reads that change, then runs the correct stored procedure on the subscriber to update the row.
If you monitor the statement run times on the two servers you'll probably see that they are running in just a few milliseconds. However it is the lag time while waiting for the log reader and distribution agent to see that they need to do something which is going to kill you.
If you truly need sub second processing time then you will want to look into writing your own processing engine to handle data moving from one server to another. I would recommend using SQL Service Broker to handle this as this way everything is native to SQL Server and no third party code has to be written.