SQL MIN_ACTIVE_ROWVERSION() value does not change for a long while - sql

We're troubleshooting a sort of Sync Framework between two SQL Server databases, in separate servers (both SQL Server 2008 Enterprise 64 bits SP2 - 10.0.4000.0), through linked server connections, and we reached to a point in which we're sort of stuck.
The logic to identify which are the records "pending to be synced" is of course based on ROWVERSION values, including the use of MIN_ACTIVE_ROWVERSION() to avoid dirty reads.
All SELECT operations are encapsulated in SPs on each "source" side. This is a schematic sample of one SP:
PROCEDURE LoaderRetrieve(#LastStamp bigint, #Rows int)
BEGIN
...
(vars handling)
...
SET TRANSACTION ISOLATION LEVEL SNAPSHOT
Select TOP (#Rows) Field1, Field2, Field3
FROM Table
WHERE [RowVersion] > #LastStampAsRowVersionDataType
AND [RowVersion] < #MinActiveVersion
Order by [RowVersion]
END
The approach works just fine, we usually sync records with the expected rate of 600k/hour (job every 30 seconds, batch size = 5k), but at some point, the sync process does not find any single record to be transferred, even though there are several thousand of records with a ROWVERSION value greater than the #LastStamp parameter.
When checking the reason, we've found that the MIN_ACTIVE_ROWVERSION() has a value less than (or slightly greater, just 5 or 10 increments) the #LastStamp being searched. This of course shouldn't be a problem since the MIN_ACTIVE_ROWVERSION() approach was introduced to avoid dirty reads and posterior issues, BUT:
The problem we see in some occasions, during the above scenario occurs, is that the value for MIN_ACTIVE_ROWVERSION() does not change during a long (really long) period of time, like 30/40 minutes, sometimes more than one hour. And this value is by far less than the ##DBTS value.
We first thought this was related to a pending DB transaction not yet committed. As per MSDN definition about the MIN_ACTIVE_ROWVERSION() (link):
Returns the lowest active rowversion value in the current database. A rowversion value is active if it is used in a transaction that has not yet been committed.
But when checking sessions (sys.sysprocesses) with open_tran > 0 during the duration of this issue, we couldn't find any session with a waittime greater than a few seconds, only one or two occurrences of +/- 5 minutes waittime sessions.
So at this point we're struggling to understand the situation: The MIN_ACTIVE_ROWVERSION() does not change during a huge period of time, and no uncommitted transactions with long waits are found within this time frame.
I'm not a DBA and could be the case that we're missing something in the picture to analyze this problem, doing some research on forums and blogs couldn't found any other clue. So far the open_tran > 0 was the valid reason, but under the circumstances I've exposed, it's clear that there's something else and don't know why.
Any feedback is appreciated.

well, I finally find the solution after digging a bit more.
The problem is that we were looking for sessions with a long waittime, but the real deal was to find sessions which have an active batch since a while.
If there's a session where open_tran = 1, to obtain exactly since when this transaction is open (and of course still active, not yet committed), the field last_batch from sys.sysprocesses shall be checked.
Using this query:
select
batchDurationMin= DATEDIFF(second,last_batch,getutcdate())/60.0,
batchDurationSecs= DATEDIFF(second,last_batch,getutcdate()),
hostname,open_tran,* from sys.sysprocesses a
where spid > 50
and a.open_tran >0
order by last_batch asc
we could identify a session with an open tran being active 30+ minutes. And with hostname values and some more checks within the web services (and also using dbcc inputbuffer) we found the responsible process.
So, the final question actually is "there's indeed an active session with an uncommitted transaction", therefore the MIN_ACTIVE_ROWVERSION() does not change. We were just looking processes with the wrong criteria.
Now that we know which process behaves like this, next step will be to improve it.
Hope this results useful to someone else.

Related

Finding statistical outliers in timestamp intervals with SQL Server

We have a bunch of devices in the field (various customer sites) that "call home" at regular intervals, configurable at the device but defaulting to 4 hours.
I have a view in SQL Server that displays the following information in descending chronological order:
DeviceInstanceId uniqueidentifier not null
AccountId int not null
CheckinTimestamp datetimeoffset(7) not null
SoftwareVersion string not null
Each time the device checks in, it will report its id and current software version which we store in a SQL Server db.
Some of these devices are in places with flaky network connectivity, which obviously prevents them from operating properly. There are also a bunch in datacenters where administrators regularly forget about it and change firewall/ proxy settings, accidentally preventing outbound communication for the device. We need to proactively identify this bad connectivity so we can start investigating the issue before finding out from an unhappy customer... because even if the problem is 99% certainly on their end, they tend to feel (and as far as we are concerned, correctly) that we should know about it and be bringing it to their attention rather than vice-versa.
I am trying to come up with a way to query all distinct DeviceInstanceId that have currently not checked in for a period of 150% their normal check-in interval. For example, let's say device 87C92D22-6C31-4091-8985-AA6877AD9B40 has, for the last 1000 checkins, checked in every 4 hours or so (give or take a few seconds)... but the last time it checked in was just a little over 6 hours ago now. This is information I would like to highlight for immediate review, along with device E117C276-9DF8-431F-A1D2-7EB7812A8350 which normally checks in every 2 hours, but it's been a little over 3 hours since the last check-in.
It seems relatively straightforward to brute-force this, looping through all the devices, examining the average interval between check-ins, seeing what the last check-in was, comparing that to current time, etc... but there's thousands of these, and the device count grows larger every day. I need an efficient query to quickly generate this list of uncommunicative devices at least every hour... I just can't picture how to write that query.
Can someone help me with this? Maybe point me in the right direction? Thanks.
I am trying to come up with a way to query all distinct DeviceInstanceId that have currently not checked in for a period of 150% their normal check-in interval.
I think you can do:
select *
from (select DeviceInstanceId,
datediff(second, min(CheckinTimestamp), max(CheckinTimestamp)) / nullif(count(*) - 1, 0) as avg_secs,
max(CheckinTimestamp) as max_CheckinTimestamp
from t
group by DeviceInstanceId
) t
where max_CheckinTimestamp < dateadd(second, - avg_secs * 1.5, getdate());

Write to the same row at the same time without locking?

What I need to do is to write to the same row from two different sources (procedures/methods/services).
The first call that comes in creates the row, and the next one just updates it.
This needs to happen without any locking taking place. And if possible I would like to be able to call either source just once (not repeatedly by dealing with locking errors)
Here is kinda what I have now in a third procedure that the others call and just inserts a row (only inserts into the xyz) or returns true if there is a row.
That way it´s just fast and unlikely that both calls arrive at the same time.
IF EXISTS(SELECT * FROM [dbo].[Wait] WHERE xyx= #xyz)
BEGIN
-- The row exists because the other datasource
-- has allready inserted a row with the same xyz
-- UPDATE THE ROW WITH DATA COMING IN
END
ELSE
BEGIN
-- No row with value xyz exists so we INSERT it with
-- the extra data.
END
I know it does´t guaranty no lock. But in my case it´s actually unlikely that both arrive at the same time and even if they would it´s user controlled so they will get an error and will just try again. BUT I wan´t to solve this.
I have been seeing Row Versioning popping up but I´m not sure if that helps or how I should use it.
Have a look at Michael J Swarts' article Mythbusting: Concurrent Update/Insert Solutions. This will show you all possible do's and don'ts. Including the fact that merge actually doesn't do a great job in solving concurrency issues.

Why is the row count increasing but the max id does not change?

This is ... interesting. I'm trying to delete a bunch of records (~2 million) in a table. After waiting for about 4 hours for a simple delete to finish, I started investigating.
delete from mytable where date > getutcdate()
If I do a few count queries, the total row count increases but the max id (identity) does not change.
select count(1) from mytable with(nolock)
select max(Id) from mytable with(nolock)
I made sure I had the only connections open by killing every session that wasn't from my IP address.
select * from sys.dm_exec_connections
kill 123
kill 124
kill 125
-- etc
Yet still, the total row count increases and the max Id stays the same. What on earth could cause this??
Update
It look like my original query is still running. I swear I already killed it, but if I try to kill it now, it says "Command(s) completed successfully", but it still shows as running if I run this query again:
SELECT * 
FROM sys.dm_exec_sessions s
LEFT JOIN sys.dm_exec_connections c
ON  s.session_id = c.session_id
LEFT JOIN sys.dm_db_task_space_usage tsu
ON  tsu.session_id = s.session_id
LEFT JOIN sys.dm_os_tasks t
ON  t.session_id = tsu.session_id
AND t.request_id = tsu.request_id
LEFT JOIN sys.dm_exec_requests r
ON  r.session_id = tsu.session_id
AND r.request_id = tsu.request_id
OUTER APPLY sys.dm_exec_sql_text(r.sql_handle) TSQL
Update 2
The row count finally stopped going up, and the locks are released, so it does appear that it was rolling back for a couple hours. With some help from my boss, who just happened to be online tonight (it's 2 AM EST), we rebuilt a few indexes and tried a different approach.
DELETE_MORE:
DELETE TOP(5000) from mytable where date > getutcdate()
IF ##ROWCOUNT > 0 GOTO DELETE_MORE
Yes, that's a GOTO. Yes, that's the only reason I've ever found in my professional career to use one. Now that that's out of the way ... this will delete rows in groups of 5000, minimizing locks and rollbacks if it fails. This seems to be working well as it is running as I type this.
If you completely ruled out anyone else modifying the table then it very well may be that the rollback is still completing. Querying the table with nolock bypasses locking so you can be reading data that is being rolled back and it may have began rolling back at the max id going backwards. Try executing the count query without the nolock. Open a new session and do a sp_who2 and sp_lock to see if your query is being blocked. Doing an sp_who2 will also show if there is a spid doing a rollback. Do a kill #spid with statusonly to get more info.

Can I avoid locking rows for unnecessary amounts of time [in Django]?

Take the following code for example (ignore the lack of usefulness in its functionality, as it's just a simple example to include the things I need):
#transaction.commit_on_success
def test_for_success(username)
person = Person.objects.select_for_update().get(username=username)
response = urllib2.urlopen(URL_TO_SOME_SLOW_API, some_data)
if '<YES>' in response.read():
person.successes += 1
person.save()
My question pertaining the example has to do with when the queries hit the database. Clearly the first query will lock the Person row, and then I'm calling a slow API, which could take 3 seconds to respond, causing the row to be locked for 3 seconds. Am I understanding this correctly, and in the case of slow API hits happening in my transaction, if I move the location of my queries so that a SELECT FOR UPDATE doesn't happen until after all the slow API requests, will this have the seemingly obvious effect of not locking my rows for seconds at a time (the case for select_for_update in my application is unavoidable)? Or, am I misunderstanding, and somehow none of the SQL actually hits the database until the end of the transaction?
Your assumptions about your code are correct. If you look at the select_for_update() docs, this action does lock those rows in the database until they are unlocked. This would in effect lock out for the duration of your urllib request.
If you were to move the database call into the conditional after the request, you are correct again that the database would be locked for a much shorter amount of time (though if that is called alot will still have some clients who block on the call due to contention).

SQL update working not insert

Ok I am going to do my best describing this. I have a SP which passes in XML and updates and inserts another table. This was working yesterday. All I changed today was loading the temp table with a OPENXML vs xml.nodes. I even changed it back and I am still getting this interesting issue. I have an update and insert in the same transaction. The update works and then the Insert hangs, no error no nothing... going on 9 minutes. Normally takes 10 seconds. No Blocking processes according to master.sys.sysprocesses. The funny thing is the Select of the Insert returns no rows as they are already in the database. The update updates 72438 in
SQL Server Execution Times:
CPU time = 1359 ms, elapsed time = 7955 ms.
ROWS AFFECTED(72438)
I am out of ideas as to what could be causing my issue? Permissions I don't think so? Space I don't think so because a Error would be returned?
queries:
UPDATE [Sales].[dbo].[WeeklySummary]
SET [CountryId] = I.CountryId
,[CurrencyId] = I.CurrencyId
,[WeeklySummaryType] = #WeeklySummaryTypeId
,[WeeklyBalanceAmt] = M.WeeklyBalanceAmt + I.WeeklyBalanceAmt
,[CurrencyFactor] = I.CurrencyFactor
,[Comment] = I.Comment
,[UserStamp] = I.UserStamp
,[DateTimeStamp] = I.DateTimeStamp
FROM
[Sales].[dbo].[WeeklySummary] M
INNER JOIN
#WeeklySummaryInserts I
ON M.EntityId = I.EntityId
AND M.EntityType = I.EntityType
AND M.WeekEndingDate = I.WeekEndingDate
AND M.BalanceId = I.BalanceId
AND M.ItemType = I.ItemType
AND M.AccountType = I.AccountType
and
INSERT INTO [Sales].[dbo].[WeeklySummary]
([EntityId]
,[EntityType]
,[WeekEndingDate]
,[BalanceId]
,[CountryId]
,[CurrencyId]
,[WeeklySummaryType]
,[ItemType]
,[AccountType]
,[WeeklyBalanceAmt]
,[CurrencyFactor]
,[Comment]
,[UserStamp]
,[DateTimeStamp])
SELECT
I.[EntityId]
, I.[EntityType]
, I.[WeekEndingDate]
, I.[BalanceId]
, I.[CountryId]
, I.[CurrencyId]
, #WeeklySummaryTypeId
, I.[ItemType]
, I.[AccountType]
, I.[WeeklyBalanceAmt]
, I.[CurrencyFactor]
, I.[Comment]
, I.[UserStamp]
, I.[DateTimeStamp]
FROM
#WeeklySummaryInserts I
LEFT OUTER JOIN
[Sales].[dbo].[WeeklySummary] M
ON I.EntityId = M.EntityId
AND I.EntityType = M.EntityType
AND I.WeekEndingDate = M.WeekEndingDate
AND I.BalanceId = M.BalanceId
AND I.ItemType = M.ItemType
AND I.AccountType = M.AccountType
WHERE M.WeeklySummaryId IS NULL
UPDATE
Trying the advice here worked for a while I run the following before my stored procedure call
UPDATE STATISTICS Sales.dbo.WeeklySummary;
UPDATE STATISTICS Sales.dbo.ARSubLedger;
UPDATE STATISTICS dbo.AccountBalance;
UPDATE STATISTICS dbo.InvoiceUnposted
UPDATE STATISTICS dbo.InvoiceItemUnposted;
UPDATE STATISTICS dbo.InvoiceItemUnpostedHistory;
UPDATE STATISTICS dbo.InvoiceUnpostedHistory;
EXEC sp_recompile N'dbo.proc_ChargeRegister'
Still stalling at the Insert Statement, which again inserts 0 rows.
There are really only a few things that can be going on, and the trick here is to eliminate them in order, from simplest to most complex.
STEP 1: Hand craft a set of XML to run that will produce exactly one insert and no updates, so you can go "back to basics" as it were and establish that the code is still doing what you expect, and the result is exactly what you expect. This may seem silly or unnecessary but you really need this reality check to start.
STEP 2: Hand craft a set of XML that will produce a medium-size set of inserts, still with no updates. Based on your experience with the routine, try to find something that will run in a 3-4 seconds. Perhaps 5000 rows. Does it continue to behave as expected?
STEP 3: Assuming steps 1 and 2 pass easily, the next most likely problem is TRANSACTION SIZE. If your update hits 74,000 rows in a single statement, then SQL Server must allocate resources to be able to roll back all 74,000 rows in the case of an abort. Generally you should assume the resources (and time) required to maintain a transaction explode exponentially as the row count goes up. So hand-craft one more set of inserts that contains 50,000 rows. You should find it takes dramatically more time. Let it finish. Is it 10 minutes, an hour? If it takes a long time but finishes, you have an issue with TRANSACTION SIZE, the server is choking trying to keep track of everything required to roll back the insert in the event of failure.
STEP 4: Determine if your entire stored procedure is operating within a single implied transaction. If it is, the matter is entirely worse, because SQL Server is tracking together everything required to roll back both the 74,000 updates and the ??? inserts in a single transaction. See this page:
http://msdn.microsoft.com/en-us/library/ms687099(v=vs.85).aspx
STEP 5: If you've got a single implicit transaction, you can either. A) Turn that off, which may help some but will not entirely fix the problem, or B) break the sproc into two separate calls, one for updates, one for inserts, so that at least the two are in separate transactions.
STEP 6: Consider "chunking". This is a technique for avoiding exploding transaction costs. Considering just the INSERT to get us started, you wrap the insert into a loop that begins and commits a transaction inside each iteration, and exits when affected rows is zero. The INSERT is modified so that you pull only the first 1000 rows from the source and insert them (that 1000 number is kind of arbitrary, you may find 5000 produces better performance, you have to experiment a bit). Once the INSERT affects zero rows, there are no more rows to handle and the loop exits.
QUICK EDIT: The "chunking" system works because the complete throughput for a large set of rows looks something like a quadratic. If you execute an INSERT that affects a huge number of rows, the total time for all rows to be handled explodes. If on the other hand you break it up and go row-by-row, the overhead of opening and committing each statement causes the total time for all rows to explode. Somewhere in the middle, when you've "chunked" out 1k rows per statement, the transaction requirements are at their minimum and the overhead of opening and committing the statement is negligible, and the total time for all rows to be handled is a minimum.
I had a problem where the stored proc was actually getting recompiled in the middle of running because it was deleting rows from a temp table. My situation doesn't look like yours, but mine was so odd that reading about it might give you some ideas.
Unexplained SQL Server Timeouts and Intermittent Blocking
I think you should post the full stored proc because the problem doesn't look to be where you think it is.