sql deadlocking and timing out almost constantly - sql

looks like today is going to be another rubbish one. we have recently updated our sql box with a complete monster, with loads of cores and ram, however we are stuck with out old DB schema which is crapola our old sql box had problems but nothing like what we are experiencing with the new one, although on the day of rolling out it was running super fast, within a week its a complete mess...
our .net app used by a couple of hundred people or so is generating a huge amount of deadlocks and timeouts on the SQL box. and we are struggling to work out why. we have - checked all the indexes and they are as good as they can be right now some of the major tables are way too wide and have a stupid amount of triggers on, but there is nothing we can do about this now.
alot of the pids seem to be the same for the same users who are trying multiple times.. so for instance..
User: user1 Time: 09:21 Error Message: Transaction (Process ID 76) was deadlocked on lock resources with another process and has been chosen as the deadlock victim. Rerun the transaction.
User: user1 Time: 09:22 Error Message: Transaction (Process ID 76) was deadlocked on lock resources with another process and has been chosen as the deadlock victim. Rerun the transaction.
etc.. when we moved the db to the new box it was backed up from the old and restored to the new...
if anyone has any suggestions as to something we can do , i will buy them multiple pints
thanks
nat

Deadlocks don't necessarily need high load to occur. They tend to be the byproduct of design issues in terms of which processes are locking data in which order, for how long, etc.
There are some useful features of the SQL profiler (2008 article here) to help you track & analyse deadlocks. I'd recommend this as the best starting point. If you're lucky, you'll find that there's just one or two culprits where you can readily, for instance, remove transactions, or reduce their longevity, to alleviate the situation.

Related

Azure SQL Managed Instance - Blocked by Negative SPID

I have a nightly "archive and delete" process that archives data outside of a 150-day sliding window to Azure Blob Storage, and then deletes the data. In the past, the deletion process ran for about two hours and of course blocked all sorts of other processes (this is a big, busy table). So, we modified the process to delete in chunks, and that helped with the blocking of other processes.
However, recently the deletion process has been taking 12+ hours to run. When checking for blocking, it's constantly blocked by SPID -5 ... and I understand, this is supposedly an orphaned DT. However, none of the queries I run to get the GUID return any rows, for example:
SELECT
DISTINCT(request_owner_guid) as UoW_Guid
FROM sys.dm_tran_locks
WHERE request_session_id =-5
Any suggestions on what I need to do here? This is becoming a real problem. Thanks.

VIEW repeatedly deadlocked by application-side commands

I have a schema-bound view (SSMS 2008 R2) running off of a set of tables maintained and updated by a front-end application. Earlier this week, after a deployment to update the application, the view suddenly deadlock-victims every time its run in Prod despite running successfully in Dev thru Staging.
Running a trace and grabbing the deadlock graph showed the competing DELETE statement came from the application (it doesn't UPDATE records; instead it DELETEs and INSERTs).
Edit1: deadlocks are being caused by competing application-side commands with IX-level locks. VIEW issues S-level locks, but the competing commands continue to deadlock, with the VIEW query consistently being the victim process. Setting isolation to 'read uncommitted' does not resolve the issue.
The VIEW recursively outer-joins on the same tables multiple times to create a linked history of records. I suspect this is the functionality which makes the VIEW too complex to evade the timing of locks. It seems to work half the days and then will consistently deadlock on others.
Is this simply a capacity issue, or is there a better way to build reporting structures that would remedy the deadlocking issues?
if you're getting a lot of deadlocking in the view it may be worthwhile breaking it down into a larger number of simpler views - where a schema bound view has an index drawn from multiple tables it can also be particularly prone to locking issues.

FirebirdSQL queries gets stuck at 12:00PM

I'm running Firebird 2.5 (and have also tried earlier versions) on Windows. Every day after 12:00PM running insert/update queries on one specific table hang, but complete successfully by 12:35 or so, no matter when started. It does seem that Firebird is doing some kind of maintenance on the table and it takes half an hour to complete, during which time the table cannot be written to (but the reads are fast). The table itself is really small, some 10000 rows, compared to millions of rows we have in other tables - and other tables do not get stuck.
I haven't been able to find any reason or solution. I tried dumping the table and restoring it, which didn't help, I tried switching between superserver and classic, changed versions with no success.
Has anyone experienced a problem like this?
No. Firebird doesn't have any internal maintenance procedures bind to some specified time of a day. Seems, there is some task on your server scheduled to run at 12:00 PM. Or there are network users of the server who start doing some heavy access at 12:00 PM.
The only maintenance FB does is "garbage collection" (geting rid of old record versions) and this is done on "when needed" basis (usually when records were selected, see the GCPolicy in firebird.conf) not on some predefined time.
Do you experience this hang only on during these certain hours or is it always slow to insert to that table? Have you checked the server load during the slowdown (ie in the task manager, is the CPU maxed out)? Anyway, here is some ideas to check:
What constraints / triggers do you have on the table? If they involve some extensive checks (ie against the other tables which contain millions of rows) this could be the reason inserts take so long.
Perhaps there is some other service which is triggered at that time? Ie do you have a cron job to make backup of the DB at that time? Or perhaps some other system service which runs at that time with higher priority slows down the server?
Do you have trace service active for the table? See fbtrace.conf in FireBird root directory. If it is active, extensive logging might be the cause of slowdown, if it isn't active, using it might help you to find the cause.
What are the setings for ForcedWrites / UnflushedWrites (see firebird.conf)? Does changing them make difference?
Is there something logged for this troublesome timeframe in firebird.log?
To me it looks like you have a process which starts at 12:00 and does something which locks the entire table. Use the monitoring table or the trace manager to see if there is any connection or active transaction which looks suspicious.
I also think your own transaction are started with the WAIT clause without a LOCK TIMEOUT, you might want to change this to NO WAIT or WAIT with a LOCK TIMEOUT, so that your transactions either fail immediately or after the timeout.
My suggestion is to use the TRACE API in 2.5 to track down what is happening near or around that time. That should help get you more information as to what is happening.
I use this for debugging http://upscene.com/products.misc.fbtm.php kinda buggy itself, but when it is working it is a god send.
Are some Client-Connections going DOWN at 12:00 PM? I had a similar problem on a 70.000 records sized table:
Client "A" has a permanently open DB Connection like "select * from TABLE". This is a "read only transaction" but reason enough for the server to generate Record-Versions. Why?
Client "B" made massive Updates to this Table, the Server tries to preserve the world like it was when "A" startet her "select". This is normal for Transaction able DB-Servers, and its implemented by creating Record Copies of the record-data before its updated.
So in my case for this TABLE 170.000 Record Versions existed. You can measure this by
gstat -r -t TABLE db.fdb | grep versions
If Client "B" goes down, the count of Record-Versions is NOT growing any more. Client "A" is the guilty one, freezing all this versions, forces the server to hold it. Finally if Client "A" goes down (or for example a firewall rule cuts all pending connections) Firebird is happy to start the process of getting rid of the now useless Record-Versions.
This "sweep"?! is bad programmed (even 2.5.2) cpu is 3% it do only <10.000 Versions / Minute so this TABLE has a performance of about 2%.

Firebird backup restore is frustrating, is there a way to avoid it?

I am using Firebird, but lately the database grows really seriously.
There is really a lot of delete statements running, as well update/inserts, and the database file size grows really fast.
After tons of deleting records the database size doesn't decrease, and even worse, i have the feeling that actually the query getting slowed down a bit.
In order to fix this a daily backup/restore process have been involved, but because of it's time to complete - i could say that it is really frustrating to use Firebird.
Any ideas on workarounds or solution on this will be welcome.
As well, I am considering switching to Interbase because I heard from a friend that it is not having this issue - it is so ?
We have a lot of huge databases on Firebird in production but never had an issue with a database growth. Yes, every time a record being deleted or updated an old version of it will be kept in the file. But sooner or later a garbage collector will sweap it away. Once both processes will balance each other the database file will grow only for the size of new data and indices.
As general precaution to prevent an enormous database growth try to make your transactions as short as possible. In our applications we use one READ ONLY transaction for reading all the data. This transaction is open through whole application life time. For every batch of insert/update/delete statements we use short separate transactions.
Slowing of database operations could be resulted from obsolete indices stats. Here you can find an example of how to recalculate statistics for all indices: http://www.firebirdfaq.org/faq167/
Check if you have unfinished transactions in your applications. If transaction is started but not committed or rolled back, database will have own revision for each transaction after the oldest active transaction.
You can check the database statistics (gstat or external tool), there's oldest transaction and the next transaction. If the difference between those numbers keeps growing, you have the stuck transaction problem.
There are also monitoring tools the check situation, one I've used is Sinatica Monitor for Firebird.
Edit: Also, database file doesn't shrink automatically ever. Parts of it get marked as unused (after sweep operation) and will be reused. http://www.firebirdfaq.org/faq41/
The space occupied by deleted records will be re-used as soon as it is garbage collected by Firebird.
If GC is not happening (transaction problems?), DB will keep growing, until GC can do its job.
Also, there is a problem when you do a massive delete in a table (ex: millions of records), the next select in that table will "trigger" the garbage collection, and the performance will drop until GC finishes. The only way to workaround this would be to do the massive deletes in a time when the server is not very used, and run a sweep after that, making sure that there are no stuck transactions.
Also, keep in mind that if you are using "standard" tables to hold temporary data (ie: info is inserted and delete several times), you can get corrupted database in some circumstances. I strongly suggest you to start using Global Temporary Tables feature.

How to kill/resolve a reeeeally long-running update in SQL Server

A colleague of mine (I promise it was a colleague!) has left an update running on our main SQL Server since last Thursday (yes that's right folks, we're pushing 100 hours now!). The SQL in question (in one transaction, I might add) is:
update daily_prices set min_date = (select min(a.date)
from daily_prices a
where a.key = daily_prices.key and
a.iid = daily_prices.iid)
(Yeah I know, heinous...)
The total cost in the query plan is coming out as 22186.7, the estimated number of rows to update is around 151 million.
We obviously need to resolve this query one way or another, we realise that if we are to kill the query we're going to generate some brutal rollback, but we've got no way of knowing how far it has gotten. The only thing we do know is this entry from sys.dm_exec_requests:
session_id status query_text cpu_time total_elapsed_time reads writes logical_reads
52 suspended update daily_prices... 2328469 408947075 13831137 42458588 151809497
So my question is, what would be our best course of action?
wait it out
kill it and roll back, and hope that it rolls back before the next ice age
something else?
I personally would want to wait it out unless I though it had no chance of finishing this week, the roll back at this stage could take far longer than the query has to date. If it's a production server, I really wouldn't take option 2 and kill it unless I absolutely had to.
In terms of regaining some control / working system if you have suitable backups, bring online another database restore the backup / tlog backups, but you will not want to restore to beyond when the transaction was started (or it will still have to roll it back.) This at least gives you a system you could continue dev work against, but unlikely to be the ideal situation for a prod system.
If it's a production server, have some kind words with the individual as to the suitability of testing queries and query plans prior to it being executed. I am sure many DBA's can suggest the less polite methods of instruction :)
So we got fed up with waiting for our transaction to complete, (after a full week on
one piece of SQL, who wouldn't?), and as it was interfering with our backup
process, we thought killing it was a necessary evil.
The database started to rollback the transaction.
5 days passed.
We noted with some posts elsewhere on the internet that sometimes some magic
happened when the database was restarted and the transaction would "go away",
although these are generally debunked*, and it makes no sense, we thought we
had nothing left to lose so we gave it a go. We knew the database would go into
recovery mode, but the database was becoming increasingly sick anyway and unable
to run anything but its current rollback work anyway, and we've seen SQL Server misbehave with hogging system resources and not diverting them to where it needs to do the work.
(* we also know enough database theory to know that the DB wouldn't just "forget"
about a transaction in progress, but we were also seeing stack dumps in the
SQL Server error logs which kind of told us that the SQL Server was getting
increasingly grumpy at the amount of rollback it was having to undertake)
So we restarted the database.
Sure enough the database went into recovery mode. However, the SQL Server event Log
was now giving us an update every 20 seconds or so as to how long it was going to
take (in all, it reckoned about 25 hours from the log messages, but it ended up being
just an hour and a half (!)).
Whether this method of recovery/rollback is faster, I would strongly doubt (as I expect
SQL Server had to do the same level of work to unwind the transaction as before), however it did finish within an hour and a half, either way, I don't want to make a habit of restarting my production database when it is halfway through a rollback). The update messages in the event log were an absolute godsend, as anyone who has written a batch program
will tell you; however inaccurate they turned out to be - at least they were a worst case.
As we had the luxury of being the only two people using this production box, choosing to
send the database into recovery mode worked for us, and gave us informational messages we
didn't have access to with just our previous rollback state (or at least nothing we could
interpret given our lacking DBA skills). Would I recommend doing this in future?
....Absolutely not, however, hopefully the concerned parties have learnt their lesson, and
we can ask the board for some money for a proper development server! (epic Joel-Test fail!)