Efficiency UPDATE WHERE ... vs CASE ... WHEN - sql

In the case of a big table (~5 millions lines) what is more efficient to update all lines that matches a condition (circa 1000 rows):
1. A simple update statement ?
UPDATE table SET last_modif_date = NOW() WHERE condition;
2. A case when that perform an update if condition matches
UPDATE table SET
last_modif_date = (CASE WHEN CONDITION THEN NOW() ELSE last_modif_date END)
And why ?
Thanks in advance

I've made a simple test - and the results was that the where version is more efficient then the case version.
Here is the test I've made:
/*
-- Create a numbers (tally) table if you don't already have one
SELECT TOP 10000 IDENTITY(int,1,1) AS Number
INTO Tally
FROM sys.objects s1
CROSS JOIN sys.objects s2
ALTER TABLE Tally ADD CONSTRAINT PK_Tally PRIMARY KEY CLUSTERED (Number)
*/
-- Create a dates table with 10000 rows
SELECT Number As Id, DATEADD(DAY, Number, DATEADD(DAY, -5000, GETDATE())) As TheDate
INTO DatesTest
FROM Tally
-- Update using a where clause
UPDATE DatesTest
SET TheDate = GETDATE()
WHERE Id % 100 = 0
-- Drop and re-create the same dates table
DROP TABLE DatesTest
SELECT Number As Id, DATEADD(DAY, Number, DATEADD(DAY, -5000, GETDATE())) As TheDate
INTO DatesTest
FROM Tally
-- Update using case
UPDATE DatesTest
SET TheDate = CASE WHEN Id % 100 = 0 THEN GETDATE() ELSE TheDate END
As you can see from the execution plan - The where clause version is only 7% of all execution cost while the case version is 34%.
I'ld say we have a winner.

A general best practice is not to update or delete without a where clause because people make mistakes and recovering from some of them can be a real pain.
Beyond that, even non-updating updates can have a significant impact on the database beyond just a poor performing query. Executing an update without a where can also lead to excessive locking and blocking of other concurrent operations.
I can not summarize this article any better than Paul White did for himself, here are some more things to consider:
SQL Server contains a number of optimisations to avoid unnecessary logging or page flushing when processing an UPDATE operation that will not result in any change to the persistent database.
Non-updating updates to a clustered table generally avoid extra logging and page flushing, unless a column that forms (part of) the cluster key is affected by the update operation.
If any part of the cluster key is ‘updated’ to the same value, the operation is logged as if data had changed, and the affected pages are marked as dirty in the buffer pool. This is a consequence of the conversion of the UPDATE to a delete-then-insert operation.
Heap tables behave the same as clustered tables, except they do not have a cluster key to cause any extra logging or page flushing. This remains the case even where a non-clustered primary key exists on the heap. Non-updating updates to a heap therefore generally avoid the extra logging and flushing (but see below).
Both heaps and clustered tables will suffer the extra logging and flushing for any row where a LOB column containing more than 8000 bytes of data is updated to the same value using any syntax other than ‘SET column_name = column_name’.
Simply enabling either type of row versioning isolation level on a database always causes the extra logging and flushing. This occurs regardless of the isolation level in effect for the update transaction.
~ The Impact of Non-Updating Updates - Paul White

The result are just different since the UPDATE SET with CASE will update ALL the rows and will triggers on update and so on EVEN if in your case your new values is in fact the previous value.
The UPDATE with a WHERE clause will UPDATE only the rows you really want to update and trigger only on these rows. Assuming you have an index, it is more efficient in most cases.
The only way for you to be sure is to analyse the actual execution plan of both queries and compare. Anyway the use of an UPDATE SET CASE over UPDATE WHERE is somewhat unnatural.

Usually, 'where' clause would be more efficient than 'Case' statement if you have an index on the column that you are using. 'Case' statement will make the query (search) sequential, meaning that every row will be checked for the condition. Where clause with index on the column (that can be used in the condition that you are applying), will only scan a subset of records and will use faster search algorithm (in most cases).
Bottom line is the actual query(condition that you are using) and the indexes can make a lot of differences in the performance of the statement. The best way to analyse the query, is to use the execution plan of the query in SQL server management studio.
You can read this post to learn more about execution plans:
https://www.simple-talk.com/sql/performance/execution-plan-basics/

Related

SQL Batched Updates to large table with many indexes

We have a table with 18 columns, 7 of them bit columns, with over 100 million rows. It has 6 non-clustered indexes, 5 of which have the column I need to update.
The primary key (clustered) is a uniqueidentifier, called EntityID
I need to update one of the bit flags on this table using a different table which contains the values I need to sync. My manager asked me to write the update to run in batches since even the smallest of updates take a while due to all the indexes and the shear number of rows in the table. He also asked that the update run based on the EntityID sorted ASC, he mentioned something about reducing the number of pages being read.
I've written probably 5 different versions of a sorted batched update, and they work, but I'm interested to see if there is already a well polished template I could use to do this.
select 1
while(##rowcount > 0)
begin
update top (100000) t
set t.bit = s.bit
from table t
join tbls s
on s.EntityID = t.EntityID
and t.bit != s.bit
end
I would advise against sorting. Let the query optimizer do its thing.
If you have any t.bit is null I would do that separate as an or slows down the update.
I suggest you disable all the indexes, update, and then enable in indexes.
You would need to do some testing, and it really depends on if you can stop other queries during this time, but quite often it's much faster to
drop the indexes
do the inserts/updates
recreate the indexes

SQL Server : how do I add a hint to every query against a table?

I have a transactional database. one of the tables is almost empty (A). it has a unique indexed column (x), and no clustered index.
2 concurrent transactions:
begin trans
insert into A (x,y,z) (1,2,3)
WAITFOR DELAY '00:00:02'; -- or manually run the first 2 lines only
select * from A where x=1; -- small tables produce a query plan of table scan here, and block against the transaction below.
commit
begin trans
insert into A (x,y,z) (2,3,4)
WAITFOR DELAY '00:00:02';
-- on a table with 3 or less pages this hint is needed to not block against the above transaction
select * from A with(forceseek) -- force query plan of index seek + rid lookup
where x=2;
commit
My problem is that when the table has very few rows the 2 transactions can deadlock, because SQL Server generates a table scan for the select, even though there is an index, and both wait on the lock held by the newly inserted row of the other transaction.
When there are lots of rows in this table, the query plan changes to an index seek, and both happily complete.
When the table is small, the WITH(FORCESEEK) hint forces the correct query plan (5% more expensive for tiny tables).
is it possible to provide a default hint for all queries on a table to pretend to have the 'forceseek' hint?
the deadlocking code above was generated by Hibernate, is it possible to have hibernate emit the needed query hints?
we can make the tables pretend to be large enough that the query optimizer selects the index seek with the undocumented features in UPDATE STATISTICS http://msdn.microsoft.com/en-AU/library/ms187348(v=sql.110).aspx . Can anyone see any downsides to making all tables with less than 1000 rows pretend they have 1000 rows over 10 pages?
You can create a Plan Guide.
Or you can enable Read Committed Snapshot isolation level in the database.
Better still: make the index clustered.
For small tables that experience high update ratio, perhaps you can apply the advice from Using tables as Queues.
Can anyone see any downsides to making all tables with less than 1000 rows pretend they have 1000 rows over 10 pages?
If the table appears in a another, more complex, query (think joins) then the cardinality estimates may cascade wildly off and produce bad plans.
You could create a view that is a copy of the table but with the hint and have queries use the view instead:
create view A2 as
select * from A with(forceseek)
If you want to preserve the table name used by queries, rename the table to something else then name the view "A":
sp_rename 'A', 'A2';
create view A as
select * from A2 with(forceseek)
Just to add another option you may consider.
You can lock entire table on update by using
ALTER TABLE MyTable SET LOCK_ESCALATION = TABLE
This workaround is fine if you do not have too many updates that will queue and slow performance.
It is table-wide and no updates to other code is needed.

Why do I get "The log file for database 'tempdb' is full"

Let we have a table of payments having 35 columns with a primary key (autoinc bigint) and 3 non-clustered, non-unique indeces (each on one int column).
Among the table's columns we have two datetime fields:
payment_date datetime NOT NULL
edit_date datetime NULL
The table has about 1 200 000 rows.
Only ~1000 of rows have edit_date column = null.
9000 of rows have edit_date not null and not equal to payment_date
Others have edit_date=payment_date
When we run the following query 1:
select top 1 *
from payments
where edit_date is not null and (payment_date=edit_date or payment_date<>edit_date)
order by payment_date desc
server needs a couple of seconds to do it. But if we run query 2:
select top 1 *
from payments
where edit_date is not null
order by payment_date desc
the execution ends up with The log file for database 'tempdb' is full. Back up the transaction log for the database to free up some log space.
If we replace * with some certain column, see query 3
select top 1 payment_date
from payments
where edit_date is not null
order by payment_date desc
it also finishes in a couple of seconds.
Where is the magic?
EDIT
I've changed query 1 so that it operates over exactly the same number of rows as the 2nd query. And still it returns in a second, while query 2 fills tempdb.
ANSWER
I followed the advice to add an index, did this for both date fields - everything started working quick, as expected. Though, the question was - why in this exact situation sql server behave differently on similar queries (query 1 vs query 2); I wanted to understand the logic of the server optimization. I would agree if both queries did used tempdb similarly, but they didn't....
In the end I mark as the answer the first one, where I saw the must-be symptoms of my problem and the first, as well, thoughts on how to avoid this (i.e. indeces)
This is happening cause certain steps in an execution plan can trigger writes to tempdb in particular certain sorts and joins involving lots of data.
Since you are sorting a table with a boat load of columns, SQL decides it would be crazy to perform the sort alone in temp db without the associated data. If it did that it would need to do a gazzilion inefficient bookmark lookups on the underlying table.
Follow these rules:
Try to select only the data you need
Size tempdb appropriately, if you need to do crazy queries that sort a gazzilion rows, you better have an appropriately sized tempdb
Usually, tempdb fills up when you are low on disk space, or when you have set an unreasonably low maximum size for database growth.
Many people think that tempdb is only used for #temp tables. When in fact, you can easily fill up tempdb without ever creating a single temp table. Some other scenarios that can cause tempdb to fill up:
any sorting that requires more memory than has been allocated to SQL
Server will be forced to do its work in tempdb;
if the sorting requires more space than you have allocated to tempdb,
one of the above errors will occur;
DBCC CheckDB('any database') will perform its work in tempdb -- on
larger databases, this can consume quite a bit of space;
DBCC DBREINDEX or similar DBCC commands with 'Sort in tempdb' option
set will also potentially fill up tempdb;
large resultsets involving unions, order by / group by, cartesian
joins, outer joins, cursors, temp tables, table variables, and
hashing can often require help from tempdb;
any transactions left uncommitted and not rolled back can leave
objects orphaned in tempdb;
use of an ODBC DSN with the option 'create temporary stored
procedures' set can leave objects there for the life of the
connection.
USE tempdb
GO
SELECT name
FROM tempdb..sysobjects
SELECT OBJECT_NAME(id), rowcnt
FROM tempdb..sysindexes
WHERE OBJECT_NAME(id) LIKE '#%'
ORDER BY rowcnt DESC
The higher rowcount, values will likely indicate the biggest temporary tables that are consuming space.
Short-term fix
DBCC OPENTRAN -- or DBCC OPENTRAN('tempdb')
DBCC INPUTBUFFER(<number>)
KILL <number>
Long-term prevention
-- SQL Server 7.0, should show 'trunc. log on chkpt.'
-- or 'recovery=SIMPLE' as part of status column:
EXEC sp_helpdb 'tempdb'
-- SQL Server 2000, should yield 'SIMPLE':
SELECT DATABASEPROPERTYEX('tempdb', 'recovery')
ALTER DATABASE tempdb SET RECOVERY SIMPLE
Reference : https://web.archive.org/web/20080509095429/http://sqlserver2000.databases.aspfaq.com:80/why-is-tempdb-full-and-how-can-i-prevent-this-from-happening.html
Other references : http://social.msdn.microsoft.com/Forums/is/transactsql/thread/af493428-2062-4445-88e4-07ac65fedb76

Delete statement in SQL is very slow

I have statements like this that are timing out:
DELETE FROM [table] WHERE [COL] IN ( '1', '2', '6', '12', '24', '7', '3', '5')
I tried doing one at a time like this:
DELETE FROM [table] WHERE [COL] IN ( '1' )
and so far it's at 22 minutes and still going.
The table has 260,000 rows in it and is four columns.
Does anyone have any ideas why this would be so slow and how to speed it up?
I do have a non-unique, non-clustered index on the [COL] that i'm doing the WHERE on.
I'm using SQL Server 2008 R2
update: I have no triggers on the table.
Things that can cause a delete to be slow:
deleting a lot of records
many indexes
missing indexes on foreign keys in child tables. (thank you to #CesarAlvaradoDiaz for mentioning this in the comments)
deadlocks and blocking
triggers
cascade delete (those ten parent records you are deleting could mean
millions of child records getting deleted)
Transaction log needing to grow
Many Foreign keys to check
So your choices are to find out what is blocking and fix it or run the deletes in off hours when they won't be interfering with the normal production load. You can run the delete in batches (useful if you have triggers, cascade delete, or a large number of records). You can drop and recreate the indexes (best if you can do that in off hours too).
Disable CONSTRAINT
ALTER TABLE [TableName] NOCHECK CONSTRAINT ALL;
Disable Index
ALTER INDEX ALL ON [TableName] DISABLE;
Rebuild Index
ALTER INDEX ALL ON [TableName] REBUILD;
Enable CONSTRAINT
ALTER TABLE [TableName] CHECK CONSTRAINT ALL;
Delete again
Deleting a lot of rows can be very slow. Try to delete a few at a time, like:
delete top (10) YourTable where col in ('1','2','3','4')
while ##rowcount > 0
begin
delete top (10) YourTable where col in ('1','2','3','4')
end
In my case the database statistics had become corrupt. The statement
delete from tablename where col1 = 'v1'
was taking 30 seconds even though there were no matching records but
delete from tablename where col1 = 'rubbish'
ran instantly
running
update statistics tablename
fixed the issue
If the table you are deleting from has BEFORE/AFTER DELETE triggers, something in there could be causing your delay.
Additionally, if you have foreign keys referencing that table, additional UPDATEs or DELETEs may be occurring.
Preventive Action
Check with the help of SQL Profiler for the root cause of this issue. There may be Triggers causing the delay in Execution. It can be anything. Don't forget to Select the Database Name and Object Name while Starting the Trace to exclude scanning unnecessary queries...
Database Name Filtering
Table/Stored Procedure/Trigger Name Filtering
Corrective Action
As you said your table contains 260,000 records...and IN Predicate contains six values. Now, each record is being search 260,000 times for each value in IN Predicate. Instead it should be the Inner Join like below...
Delete K From YourTable1 K
Inner Join YourTable2 T on T.id = K.id
Insert the IN Predicate values into a Temporary Table or Local Variable
It's possible that other tables have FK constraint to your [table].
So the DB needs to check these tables to maintain the referential integrity.
Even if you have all needed indexes corresponding these FKs, check their amount.
I had the situation when NHibernate incorrectly created duplicated FKs on the same columns, but with different names (which is allowed by SQL Server).
It has drastically slowed down running of the DELETE statement.
Check execution plan of this delete statement. Have a look if index seek is used. Also what is data type of col?
If you are using wrong data type, change update statement (like from '1' to 1 or N'1').
If index scan is used consider using some query hint..
If you're deleting all the records in the table rather than a select few it may be much faster to just drop and recreate the table.
Is [COL] really a character field that's holding numbers, or can you get rid of the single-quotes around the values? #Alex is right that IN is slower than =, so if you can do this, you'll be better off:
DELETE FROM [table] WHERE [COL] = '1'
But better still is using numbers rather than strings to find the rows (sql likes numbers):
DELETE FROM [table] WHERE [COL] = 1
Maybe try:
DELETE FROM [table] WHERE CAST([COL] AS INT) = 1
In either event, make sure you have an index on column [COL] to speed up the table scan.
I read this article it was really helpful for troubleshooting any kind of inconveniences
https://support.microsoft.com/en-us/kb/224453
this is a case of waitresource
KEY: 16:72057595075231744 (ab74b4daaf17)
-- First SQL Provider to find the SPID (Session ID)
-- Second Identify problem, check Status, Open_tran, Lastwaittype, waittype, and waittime
-- iMPORTANT Waitresource select * from sys.sysprocesses where spid = 57
select * from sys.databases where database_id=16
-- with Waitresource check this to obtain object id
select * from sys.partitions where hobt_id=72057595075231744
select * from sys.objects where object_id=2105058535
After inspecting an SSIS Package(due to a SQL Server executing commands really slow), that was set up in a client of ours about 5-4 years before the time of me writing this, I found out that there were the below tasks:
1) insert data from an XML file into a table called [Importbarcdes].
2) merge command on an another target table, using as source the above mentioned table.
3) "delete from [Importbarcodes]", to clear the table of the row that was inserted after the XML file was read by the task of the SSIS Package.
After a quick inspection all statements(SELECT, UPDATE, DELETE etc.) on the table ImportBarcodes that had only 1 row, took about 2 minutes to execute.
Extended Events showed a whole lot PAGEIOLATCH_EX wait notifications.
No indexes were present of the table and no triggers were registered.
Upon close inspection of the properties of the table, in the Storage Tab and under general section, the Data Space field showed more than 6 GIGABYTES of space allocated in pages.
What happened:
The query ran for a good portion of time each day for the last 4 years, inserting and deleting data in the table, leaving unused pagefiles behind with out freeing them up.
So, that was the main reason of the wait events that were captured by the Extended Events Session and the slowly executed commands upon the table.
Running ALTER TABLE ImportBarcodes REBUILD fixed the issue freeing up all the unused space. TRUNCATE TABLE ImportBarcodes did a similar thing, with the only difference of deleting all pagefiles and data.
Older topic but one still relevant.
Another issue occurs when an index has become fragmented to the extent of becoming more of a problem than a help. In such a case, the answer would be to rebuild or drop and recreate the index and issuing the delete statement again.
As an extension to Andomar's answer, above, I had a scenario where the first 700,000,000 records (of ~1.2 billion) processed very quickly, with chunks of 25,000 records processing per second (roughly). But, then it starting taking 15 minutes to do a batch of 25,000. I reduced the chunk size down to 5,000 records and it went back to its previous speed. I'm not certain what internal threshold I hit, but the fix was to reduce the number of records, further, to regain the speed.
open CMD and run this commands
NET STOP MSSQLSERVER
NET START MSSQLSERVER
this will restart the SQL Server instance.
try to run again after your delete command
I have this command in a batch script and run it from time to time if I'm encountering problems like this. A normal PC restart will not be the same so restarting the instance is the most effective way if you are encountering some issues with your sql server.

Unexpected #temp table performance

Bounty open:
Ok people, the boss needs an answer and I need a pay rise. It doesn't seem to be a cold caching issue.
UPDATE:
I've followed the advice below to no avail. How ever the client statistics threw up an interesting set of number.
#temp vs #temp
Number of INSERT, DELETE and UPDATE statements
0 vs 1
Rows affected by INSERT, DELETE, or UPDATE statements
0 vs 7647
Number of SELECT statements
0 vs 0
Rows returned by SELECT statements
0 vs 0
Number of transactions
0 vs 1
The most interesting being the number of rows affected and the number of transactions. To remind you, the queries below return identical results set, just into different styles of tables.
The following query are basicaly doing the same thing. They both select a set of results (about 7000) and populate this into either a temp or var table. In my mind the var table #temp should be created and populated quicker than the temp table #temp however the var table in the first example takes 1min 15sec to execute where as the temp table in the second example takes 16 seconds.
Can anyone offer an explanation?
declare #temp table (
id uniqueidentifier,
brand nvarchar(255),
field nvarchar(255),
date datetime,
lang nvarchar(5),
dtype varchar(50)
)
insert into #temp (id, brand, field, date, lang, dtype )
select id, brand, field, date, lang, dtype
from view
where brand = 'myBrand'
-- takes 1:15
vs
select id, brand, field, date, lang, dtype
into #temp
from view
where brand = 'myBrand'
DROP TABLE #temp
-- takes 16 seconds
I believe this almost completely comes down to table variable vs. temp table performance.
Table variables are optimized for having exactly one row. When the query optimizer chooses an execution plan, it does it on the (often false) assumption that that the table variable only has a single row.
I can't find a good source for this, but it is at least mentioned here:
http://technet.microsoft.com/en-us/magazine/2007.11.sqlquery.aspx
Other related sources:
http://connect.microsoft.com/SQLServer/feedback/ViewFeedback.aspx?FeedbackID=125052
http://databases.aspfaq.com/database/should-i-use-a-temp-table-or-a-table-variable.html
Run both with SET STATISTICS IO ON and SET STATISTICS TIME ON. Run 6-7 times each, discard the best and worst results for both cases, then compare the two average times.
I suspect the difference is primarily from a cold cache (first execution) vs. a warm cache (second execution). The output from STATISTICS IO would give away such a case, as a big difference in the physical reads between the runs.
And make sure you have 'lab' conditions for the test: no other tasks running (no lock contention), databases (including tempdb) and logs are pre-grown to required size so you don't hit any log growth or database growth event.
This is not uncommon. Table variables can be (and in a lot of cases ARE) slower than temp tables. Here are some of the reasons for this:
SQL Server maintains statistics for queries that use temporary tables but not for queries that use table variables. Without statistics, SQL Server might choose a poor processing plan for a query that contains a table variable
Non-clustered indexes cannot be created on table variables, other than the system indexes that are created for a PRIMARY or UNIQUE constraint. That can influence the query performance when compared to a temporary table with non-clustered indexes.
table variables use internal metadata in a way that prevents the engine from using a table variable within a parallel query (this means that it wont take advantage of multi-processor machines).
A table variable is optimized for one row, by SQL Server (it assumes 1 row will be returned).
I'm not 100% that this is the cause, but the table var will not have any statistics whereas the temp table will.
SELECT INTO is a non-logged operation, which would likely explain most of the performance difference. INSERT creates a log entry for every operation.
Additionally, SELECT INTO is creating the table as part of the operation, so SQL Server knows automatically that there are no constraints on it, which may factor in.
If it takes over a full minute to insert 7000 records into a temp table (persistent or variable), then the perf issue is almost certainly in the SELECT statement that's populating it.
Have you run DBCC FREEPROCCACHE and DBCC DROPCLEANBUFFERS before profiling? I'm thinking that maybe it's using some cached results for the second query.