CXSYNC_PORT wait type in Azure Sql Database - azure-sql-database

I'm facing this issue intermittently now, where the query (called from stored Procedure) goes for CXSYNC_PORT wait type and continues to remain in that for longer time (sometimes 8hours in stretch). I had to kill the process and then rerun the procedure. This procedure is called every 2-hours from ADF pipeline.
What's the reason for this behavior and how do I fix the issue?

I searched a lot and there is not Microsoft documents talk about the wait type: CXSYNC_PORT. Others have asked the same question but still with no more details.
Most suggestions are that ask the same problem in more forums. Or ask professional engineer for help, and they will deal with your problem separately and confidentially.
Ask Azure support for details help: https://learn.microsoft.com/en-us/azure/azure-portal/supportability/how-to-create-azure-support-request
And here's the same question which Microsoft engineer gave more details about the issue:
As part of a fix CXPACKET waits were further broken down into
CXSYNC_CONSUMER and CXSYNC_PORT (and data transfer waits still
reported as CXPACKET) as to distinguish between different wait times
for correct diagnose of the problem.
Basically, CXPACKET is divided into 3: CXPACKET, CXSYNC_PORT,
CXSYNC_CONSUMER. CXPACKET is used for data transfer sync, while
CXSYNC_* are used for other synchronizations. CXSYNC_PORT is used for
synchronizing opening/closing of exchange port between consuming
thread and producing thread. Long waits here may indicate server load
and lack of available threads. Plans containing sort may contribute
this wait type because complete sorting may occur before port is
synchronized.
Please ref this link What is causing wait type CXSYNC_PORT and what to do about it? to get more useful messages. But for now, there isn't an exact solution.

use query hint OPTION(MAXDOP 1)
This will run your long running query in a single thread and you won't get the CX type waits. In my experience this can make a massive 10-20X decrease in execution time and will free up CPU for other tasks as there will be no context switching and thread coordination activity.

Related

boost::asio and boost::thread_group where each thread has it's own libpqxx connection

I'm trying to combine boost::Asio, boost::thread_group where each thread has its own libpqxx(Prostgres) connection to the database. I seem unable to find any examples of asio/thread_group where the thread the task runs on has connection specific information. Asio seems to be specialized on the task containing all the information required to run it. Am I looking at the wrong combination to solve my specific problem?
I have a lot of requests coming in to my program, each of these requests require SQL commands to be run agains the DB ( timescaledb in my case ). These requests must be run on a limited number of connections agains the DB ( normally 8 in total).
My plan was to set up a thread_group of 8 threads each with it's own connection to the DB, and each thread connected to the asio::run. So that I could post new queries to the asio::post, and get a callback via signal2 when the result comes in.
Asio "hide" the threads, and thanks to assio::strands you can avoid more or less the concurrency. In very short you only throw task to asio, as a thread is available your task is submitted, but asio has a learning curve, as concurrency ...
As you describe your problem thread local storage is the answer.

How can I recreate a blocking process which uses FETCH API_CURSOR?

My organization has recently had trouble with some SQL Server blocking processes. dbWarden has successfully reported blocking to us, but we often have the blocking SQL text reported as 'FETCH API_CURSOR'.
So, we're looking to alter the blocking alerts trigger in dbWarden to use sys.dm_exec_cursors and sys.dm_exec_sql_text to retrieve the text in the case where we find 'FETCH API_CURSOR' reported.
Trouble is, I cannot seem to come up with a way to recreate/simulate a blocking situation on our development server that will report as 'FETCH API_CURSOR'. I've started from the VB script here on SQL Authority to recreate the open cursor, but I cannot for the life of me figure out how to make it blocking.
I've seen many methods for recreating blocking transactions (open a transaction in one window, but do not commit/close, then try an update on same table in another), but not that would utilize FETCH API_CURSOR in a way that would allow us to successfully test. I'm somewhat at a loss here.
Has anyone had success in simulating blocking cursors in the past and can offer suggestions?
I'd suggest you to use Profiler tool to capture actual code that creates and fetches cursor. In this case, you'd see exactly what's going on in an application. It is not so difficult to reproduce similar blocking on a development server.
Let's say, one thread fetches rows from a cursor and another thread try to UPDATE same rows. See what's going on under the hood. Reading thread creates cursor to fetch result of SELECT back to an application. This technology is ancient and extremely slow and nowaday only some old (mostly) Java application use cursors for this purpose. Rows get fetched one-by-one, client is handling this process, so it takes time. During this time, reading thread holds shared locks on data it reads. It is by design, SQL Server is locker, it does use locks to function properly. If another thread tries to update a row that has been locked with shared lock, it get blocked. Because updating thread uses shared locks when searching rows for update, and tries to upgrade it to something more serious, but can't. For example, you can't upgrade your shared lock to U lock if another thread owns S lock on the same row. So I'd try to create a cursor, fetch several rows and tried to update in another tab. If you see difficulties, try to increase reading transaction serialization level.
But, seriously, I don't think you have to reproduce this or similar scenario on a development server. Stop use cursors for recordset fetching! Reads will be much faster and blocking issues will be reduced a lot. It's been a while since 1989 when cursors seen their great times. Client DB-access libraries evolved a lot, it is worth trying to pick up fruits of progress. Even in Java, it is a configuration option, use or not use them.
I apologize if cursor does get used on purpose, in this case. It is very unprobably but possible. I haven't seen such 'proper' cursor usage for ages! I'll be delighted to run into one more proper case.

TADOStoredProc/TADOQuery CommandTimeout...?‏

On a client is being raised the error "Timeout" to trigger some commands against the database.
My first test option for correction is to increase the CommandTimeout to 99999 ... but I am afraid that this treatment generates further problems.
Have experienced it ...?
I wonder if my question is relevant, and/or if there is another option more robust and elegant correction.
You are correct to assume that upping the timeout is not the correct approach. Typically, I look for log running queries that are running around the timeouts. They will typically stand out in the areas of duration and reads.
Then I'll work to reduce the query run time using this method:
https://www.simple-talk.com/sql/performance/simple-query-tuning-with-statistics-io-and-execution-plans/
If it's a report causing issues and you can't get it running faster, you may need to start thinking about setting up a reporting database.
CommandTimeout is a time, that the client is waiting for a response from server. If the query is run in the main VCL thread then the whole application is "frozen" and might be marked "not responding" by Windows. So, would you expect your users to wait at frozen app for 99999 sec?
Generally, leave the Timeout values at default and rather concentrate on tunning the queries as Sam suggests. If you happen to have long running queries (ie. some background data movement, calculations etc in Stored Procedures) set the CommandTimeout to 0 (=INFINITE) but run them in a separate thread.

SQL Azure - One session locking entire DB for Update and Insert

SQL Azure issue.
I've got an issue that manifests as the following exception on our (asp.net) site:
Timeout expired. The timeout period elapsed prior to completion of
the operation or the server is not responding. The statement has been
terminated.
It also results in update and insert statements never completing in SMSS. There aren't any X or IX locks present when querying: sys.dm_tran_locks and there are no transactions when querying sys.dm_tran_active_transactions or sys.dm_tran_database_transactions.
The problem is present for every table in the database but other databases on the same instance don't cause the problem. The duration of the issue can be anywhere from 2 minutes to 2 hours and doesn't happen at any specific times of day.
The database is not full.
At one point this issue didn't resolve itself but I was able to resolve the issue by querying sys.dm_exec_connections finding the longest running session, and then killing it. The odd thing is, that the connection was 15 minutes old, but the lock issue had been present for over 3 hours.
Is there anything else I can check?
EDIT
As per Paul's answer below. I'd actually tracked down the problem before he answered. I will post the steps I used to figure this out below, in case they help anyone else.
The following queries were run when a "timeout period" was present.
select * from sys.dm_exec_requests
As we can see, all the WAIT requests are waiting on session 1021 which is the replication request! The TM Request indicates a DTC transaction and we don't use distributed transactions. You can also see the wait_type of SE_REPL_COMMIT_ACK which again implicates replication.
select * from sys.dm_tran_locks
Again waiting on session 1021
SELECT * FROM sys.dm_db_wait_stats ORDER BY wait_time_ms desc
And yes, SE_REPL_CATCHUP_THROTTLE has a total wait time of 8094034
ms, that is 134.9minutes!!!
Also see the following forum for details on this issue.
http://social.technet.microsoft.com/Forums/en-US/ssdsgetstarted/thread/c3003a28-8beb-4860-85b2-03cf6d0312a8
I've been given the following answer in my communication with
Microsoft (we've seen this issue with 4 of our 15 databases in the EU
data center):
Question: Have there been changes to these soft
throttling limits in the last three weeks ie since my problems
started?
Answer: No, there has not.
Question: Are there ways we can
prevent or be warned we are approaching a limit?
Answer: No. The issue
may not be caused by your application but can be caused by other
tenants relying on the same physical hardware. In other words, your
application can have very little load and still run into the problem.
In other words, your own traffic may be a cause of this problem, but
it can just as well be caused by other tenants relying on the same
physical hardware. There's no way to know beforehand that the issue
will soon occur - it can occur at any time without warning. The SQL
Azure operations team does not monitor this type of error, so they
won't automatically try to solve the problem for you. So if you run
into it you have two opitions:
Create a copy of your db and use that and hope the db is placed on another server with less load.
Contact Windows Azure Support and inform the about the problem and let them do Option 1 for you
You might be running into the SE_REPL* issues that are currently plaguing a lot of folks using Sql Azure (my company included).
When you experience the timeouts, try checking your wait requests for wait types of:
SE_REPL_SLOW_SECONDARY_THROTTLE
SE_REPL_COMMIT_ACK
Run the following to check your wait types on current connections:
SELECT TOP 10 r.session_id, r.plan_handle,
r.sql_handle, r.request_id,
r.start_time, r.status,
r.command, r.database_id,
r.user_id, r.wait_type,
r.wait_time, r.last_wait_type,
r.wait_resource, r.total_elapsed_time,
r.cpu_time, r.transaction_isolation_level,
r.row_count
FROM sys.dm_exec_requests r
You can also check a history of sorts for this by running:
SELECT * FROM sys.dm_db_wait_stats
ORDER BY wait_time_ms desc
If you're seeing a lot of SE_REPL* wait types and these are staying set on your connections for any length of time, then basically you're screwed.
Microsoft are aware of the problem, but I've had a support ticket open for a week with them now and they're still working on it apparently.
The SE_REPL* waits happen when the Sql Azure replication slaves fall behind.
Basically the whole db suspends queries while replication catches up :/
So essentially the aspect that makes Sql Azure highly available is causing databases to become randomly unavailable.
I'd laugh at the irony if it wasn't killing us.
Have a look at this thread for details:
http://social.technet.microsoft.com/Forums/en-US/ssdsgetstarted/thread/c3003a28-8beb-4860-85b2-03cf6d0312a8

Batch printing exception

I get this error while printing multiple .xps documents to a physical printer
Dim defaultPrintQueue As PrintQueue = GetForwardPrintQueue(My.Settings.SelectedPrinter)
Dim xpsPrintJob As PrintSystemJobInfo
xpsPrintJob = defaultPrintQueue.AddJob(JobName, Document, False)
Documents are spooled succesfully till, a print job exception occurs
The InnerException is Insufficient memory to continue the execution of the program.
The source is PresentationCore.dll
Where should i start searching?
When attempting to perform tasks that may fail due to temporary or permanent restrictions on some resource, I tend to use a back-off strategy. This strategy has been followed on things as diverse as message queuing and socket opens.
The general process for such a strategy is as follows.
set maxdelay to 16 # maximum time period between attempts
set maxtries to 10 # maximum attempts
set delay to 0
set tries to 0
while more actions needed:
if delay is not 0:
sleep delay
attempt action
if action failed:
add 1 to tries
if tries is greater than maxtries:
exit with permanent error
if delay is 0:
set delay to 1
else:
double delay
if delay is greater than maxdelay:
set delay to maxdelay
else:
set delay to 0
set tries to 0
This allows the process to run at full speed in the vast majority of cases but backs off when errors start occurring, hopefully giving the resource provider time to recover. The gradual increase in delays allows for more serious resource restrictions to recover and the maximum tries catches what you would term permanent errors (or errors that are taking too long to recover).
I actually prefer this try-it-and-catch-failure approach to the check-if-okay-then-try one since the latter can still often fail if something changes between the check and the try. This is called the "better to seek forgiveness than ask permission" method, which also works quite well with bosses most of the time, and wives a little less often :-)
One particularly useful case was a program which opened a separate TCP session for each short-lived transaction. On older hardware, the closed sockets (those in TCP WAIT state) eventually disappeared before they were needed again.
But, as the hardware got faster, we found that we could open sessions and do work much quicker and Windows was running out of TCP handles (even when increased to the max).
Rather than having to re-engineer the communications protocol to maintain sessions, this strategy was implemented to allow graceful recovery in the event handles were starved.
Granted it's a bit of a kludge but this was legacy software approaching end-of-life, where bug fixes are often just enough to get it working and it wasn't deemed strategic enough to warrant spending a lot of money in fixing it properly.
Update: It may be that there's a (more permanent) problem with PresentationCore. This KB article states that there's a memory leak in WPF within .NET 3.5SP1 (of which your print driver may be a client).
If the backoff strategy doesn't fix your problem (it may not if it's a leak in a long lived process), you might want to try applying the hotfix. Me, I'd replicate the problem in a virtual machine and then patch that to test it (but I'm an extreme paranoid).
It was found by googling PresentationCore Insufficient memory to continue the execution of the program and checking the first link here. Search for the string "hotfix that relates to this issue" on that page.
Before adding a new job to the queue you should check the queue state. More info on PrintQueue.IsOutOfMemory property and related properties that can be queried to verify that the queue is not in an error state.
Of course pax' hint to use a defensive strategy when accessing resources like printers is best practice. For starter you may want to put the line adding the job into a try block.
You might want to consider launching a new process to handle the printing of each document, the overhead should be low compared to the effort of printing the documents.