SQL Batched Updates to large table with many indexes

SQL Batched Updates to large table with many indexes - sql

We have a table with 18 columns, 7 of them bit columns, with over 100 million rows. It has 6 non-clustered indexes, 5 of which have the column I need to update.
The primary key (clustered) is a uniqueidentifier, called EntityID
I need to update one of the bit flags on this table using a different table which contains the values I need to sync. My manager asked me to write the update to run in batches since even the smallest of updates take a while due to all the indexes and the shear number of rows in the table. He also asked that the update run based on the EntityID sorted ASC, he mentioned something about reducing the number of pages being read.
I've written probably 5 different versions of a sorted batched update, and they work, but I'm interested to see if there is already a well polished template I could use to do this.

select 1
while(##rowcount > 0)
begin
update top (100000) t
set t.bit = s.bit
from table t
join tbls s
on s.EntityID = t.EntityID
and t.bit != s.bit
end
I would advise against sorting. Let the query optimizer do its thing.
If you have any t.bit is null I would do that separate as an or slows down the update.
I suggest you disable all the indexes, update, and then enable in indexes.

You would need to do some testing, and it really depends on if you can stop other queries during this time, but quite often it's much faster to
drop the indexes
do the inserts/updates
recreate the indexes

Related

Efficiency UPDATE WHERE ... vs CASE ... WHEN

In the case of a big table (~5 millions lines) what is more efficient to update all lines that matches a condition (circa 1000 rows):
1. A simple update statement ?
UPDATE table SET last_modif_date = NOW() WHERE condition;
2. A case when that perform an update if condition matches
UPDATE table SET
last_modif_date = (CASE WHEN CONDITION THEN NOW() ELSE last_modif_date END)
And why ?
Thanks in advance

I've made a simple test - and the results was that the where version is more efficient then the case version.
Here is the test I've made:
/*
-- Create a numbers (tally) table if you don't already have one
SELECT TOP 10000 IDENTITY(int,1,1) AS Number
INTO Tally
FROM sys.objects s1
CROSS JOIN sys.objects s2
ALTER TABLE Tally ADD CONSTRAINT PK_Tally PRIMARY KEY CLUSTERED (Number)
*/
-- Create a dates table with 10000 rows
SELECT Number As Id, DATEADD(DAY, Number, DATEADD(DAY, -5000, GETDATE())) As TheDate
INTO DatesTest
FROM Tally
-- Update using a where clause
UPDATE DatesTest
SET TheDate = GETDATE()
WHERE Id % 100 = 0
-- Drop and re-create the same dates table
DROP TABLE DatesTest
SELECT Number As Id, DATEADD(DAY, Number, DATEADD(DAY, -5000, GETDATE())) As TheDate
INTO DatesTest
FROM Tally
-- Update using case
UPDATE DatesTest
SET TheDate = CASE WHEN Id % 100 = 0 THEN GETDATE() ELSE TheDate END
As you can see from the execution plan - The where clause version is only 7% of all execution cost while the case version is 34%.
I'ld say we have a winner.

A general best practice is not to update or delete without a where clause because people make mistakes and recovering from some of them can be a real pain.
Beyond that, even non-updating updates can have a significant impact on the database beyond just a poor performing query. Executing an update without a where can also lead to excessive locking and blocking of other concurrent operations.
I can not summarize this article any better than Paul White did for himself, here are some more things to consider:
SQL Server contains a number of optimisations to avoid unnecessary logging or page flushing when processing an UPDATE operation that will not result in any change to the persistent database.
Non-updating updates to a clustered table generally avoid extra logging and page flushing, unless a column that forms (part of) the cluster key is affected by the update operation.
If any part of the cluster key is ‘updated’ to the same value, the operation is logged as if data had changed, and the affected pages are marked as dirty in the buffer pool. This is a consequence of the conversion of the UPDATE to a delete-then-insert operation.
Heap tables behave the same as clustered tables, except they do not have a cluster key to cause any extra logging or page flushing. This remains the case even where a non-clustered primary key exists on the heap. Non-updating updates to a heap therefore generally avoid the extra logging and flushing (but see below).
Both heaps and clustered tables will suffer the extra logging and flushing for any row where a LOB column containing more than 8000 bytes of data is updated to the same value using any syntax other than ‘SET column_name = column_name’.
Simply enabling either type of row versioning isolation level on a database always causes the extra logging and flushing. This occurs regardless of the isolation level in effect for the update transaction.
~ The Impact of Non-Updating Updates - Paul White

The result are just different since the UPDATE SET with CASE will update ALL the rows and will triggers on update and so on EVEN if in your case your new values is in fact the previous value.
The UPDATE with a WHERE clause will UPDATE only the rows you really want to update and trigger only on these rows. Assuming you have an index, it is more efficient in most cases.
The only way for you to be sure is to analyse the actual execution plan of both queries and compare. Anyway the use of an UPDATE SET CASE over UPDATE WHERE is somewhat unnatural.

Usually, 'where' clause would be more efficient than 'Case' statement if you have an index on the column that you are using. 'Case' statement will make the query (search) sequential, meaning that every row will be checked for the condition. Where clause with index on the column (that can be used in the condition that you are applying), will only scan a subset of records and will use faster search algorithm (in most cases).
Bottom line is the actual query(condition that you are using) and the indexes can make a lot of differences in the performance of the statement. The best way to analyse the query, is to use the execution plan of the query in SQL server management studio.
You can read this post to learn more about execution plans:
https://www.simple-talk.com/sql/performance/execution-plan-basics/

What sort of Index for 'AND' columns?

I have a table to store people and want to select where the person is not marked as "deleted". I have a clustered primary key on the ID column (PersonID).
The 'Deleted' column is a DATETIME, nullable, and is populated when deleted.
My query looks like this:
SELECT *
FROM dbo.Person
WHERE PersonID = 100
AND Deleted IS NULL
This table can grow to around 40,000 people.
Should I have an index that covers the Deleted flag as well?
I may also query things like:
SELECT *
FROM Task t
INNER JOIN Person p
ON p.PersonID = t.PersonID
AND p.Deleted IS NULL
WHERE t.TaskTypeId = 5
AND t.Deleted IS NULL
Task table estimate is about 1.5 million rows.
I think I need one that covers both the pk and the deleted flag on both tables? i.e. on (Task.TaskId, Task.Deleted) and (Person.PersonID, Person.Deleted)?
Reasons for me investigating an index rethink, is due to a number of deadlock occurring in complex procedures. I'd like to reduce the number of rows locked on selects/writes/updates, as well as get a performance gain.

Since you are using SQL Server 2008, the fastest querying might well be using a filtered index. In this Deleted column whose type is DATETIME and nullable, you could try something like this index:
CREATE NONCLUSTERED INDEX Filtered_Deleted_Index
ON dbo.Person(Deleted)
WHERE Deleted IS NOT NULL
This will get you the smallest valid set in both use cases you listed above (for querying dbo.Person and also joining with Tasks).

Your instinct is (generally speaking) sound - an index that contains all columns needed for the query is called a covering index, which in this case would require:
CREATE INDEX Person_PersonID_Deleted ON Person(PersonID, Deleted);
You are unlikely to get much performance benefit on index lookup by adding the Deleted column, since searching for null is (usually) ignored, but having these indexes means that accessing the table can be bypassed entirely for Person.
You could also try creating:
CREATE INDEX Task_TaskTypeId_Deleted ON Task(TaskTypeId, Deleted);
which will avoid accessing Task rows that are marked as "deleted", and Task would then only accessed for non-deleted rows. However, if most of your Tasks are not deleted, I wouldn't bother with this index.
It's worth trying out various combinations of index(es) to see which combination gives the best result.

Since the primary key is PersonID, adding another index with extra columns after PersonID will not improve the "selectability" of the index, although is may prevent the need to lookup the record by rowid for filtering on deleted. With only 3% records filtered, that's nothing, so don't create another index on Person.
As for the Task table, it very much depends on the selectability of TaskTypeId = 5 AND Deleted IS NULL, i.e. how many records match the criteria. In general, a sequential search (full table scan) is faster than an index scan with row lookup if more than 20% of the records are selected. For very larger tables where the data is very distributed (e.g. physically every 10th record is selected), the performance threshold is below 10%.
So, if more than 10-20% of Task records are type 5, and only 3% of records are deleted, no index will improve performance, because the fastest access plan is likely a merge join of two full table scans.

SQL Server : how do I add a hint to every query against a table?

I have a transactional database. one of the tables is almost empty (A). it has a unique indexed column (x), and no clustered index.
2 concurrent transactions:
begin trans
insert into A (x,y,z) (1,2,3)
WAITFOR DELAY '00:00:02'; -- or manually run the first 2 lines only
select * from A where x=1; -- small tables produce a query plan of table scan here, and block against the transaction below.
commit
begin trans
insert into A (x,y,z) (2,3,4)
WAITFOR DELAY '00:00:02';
-- on a table with 3 or less pages this hint is needed to not block against the above transaction
select * from A with(forceseek) -- force query plan of index seek + rid lookup
where x=2;
commit
My problem is that when the table has very few rows the 2 transactions can deadlock, because SQL Server generates a table scan for the select, even though there is an index, and both wait on the lock held by the newly inserted row of the other transaction.
When there are lots of rows in this table, the query plan changes to an index seek, and both happily complete.
When the table is small, the WITH(FORCESEEK) hint forces the correct query plan (5% more expensive for tiny tables).
is it possible to provide a default hint for all queries on a table to pretend to have the 'forceseek' hint?
the deadlocking code above was generated by Hibernate, is it possible to have hibernate emit the needed query hints?
we can make the tables pretend to be large enough that the query optimizer selects the index seek with the undocumented features in UPDATE STATISTICS http://msdn.microsoft.com/en-AU/library/ms187348(v=sql.110).aspx . Can anyone see any downsides to making all tables with less than 1000 rows pretend they have 1000 rows over 10 pages?

You can create a Plan Guide.
Or you can enable Read Committed Snapshot isolation level in the database.
Better still: make the index clustered.
For small tables that experience high update ratio, perhaps you can apply the advice from Using tables as Queues.
Can anyone see any downsides to making all tables with less than 1000 rows pretend they have 1000 rows over 10 pages?
If the table appears in a another, more complex, query (think joins) then the cardinality estimates may cascade wildly off and produce bad plans.

You could create a view that is a copy of the table but with the hint and have queries use the view instead:
create view A2 as
select * from A with(forceseek)
If you want to preserve the table name used by queries, rename the table to something else then name the view "A":
sp_rename 'A', 'A2';
create view A as
select * from A2 with(forceseek)

Just to add another option you may consider.
You can lock entire table on update by using
ALTER TABLE MyTable SET LOCK_ESCALATION = TABLE
This workaround is fine if you do not have too many updates that will queue and slow performance.
It is table-wide and no updates to other code is needed.

Using more than one index per table is dangerous?

In a former company I worked at, the rule of thumb was that a table should have no more than one index (allowing the odd exception, and certain parent-tables holding references to nearly all other tables and thus are updated very frequently).
The idea being that often, indexes cost the same or more to uphold than they gain. Note that this question is different to indexed-view-vs-indexes-on-table as the motivation is not only reporting.
Is this true? Is this index-purism worth it?
In your career do you generally avoid using indexes?
What are the general large-scale recommendations regarding indexes?
Currently and at the last company we use SQL Server, so any product specific guidelines are welcome too.

You need to create exactly as many indexes as you need to create. No more, no less. It is as simple as that.
Everybody "knows" that an index will slow down DML statements on a table. But for some reason very few people actually bother to test just how "slow" it becomes in their context. Sometimes I get the impression that people think that adding another index will add several seconds to each inserted row, making it a game changing business tradeoff that some fictive hotshot user should decide in a board room.
I'd like to share an example that I just created on my 2 year old pc, using a standard MySQL installation. I know you tagged the question SQL Server, but the example should be easily converted. I insert 1,000,000 rows into three tables. One table without indexes, one table with one index and one table with nine indexes.
drop table numbers;
drop table one_million_rows;
drop table one_million_one_index;
drop table one_million_nine_index;
/*
|| Create a dummy table to assist in generating rows
*/
create table numbers(n int);
insert into numbers(n) values(0),(1),(2),(3),(4),(5),(6),(7),(8),(9);
/*
|| Create a table consisting of 1,000,000 consecutive integers
*/
create table one_million_rows as
select d1.n + (d2.n * 10)
+ (d3.n * 100)
+ (d4.n * 1000)
+ (d5.n * 10000)
+ (d6.n * 100000) as n
from numbers d1
,numbers d2
,numbers d3
,numbers d4
,numbers d5
,numbers d6;
/*
|| Create an empty table with 9 integer columns.
|| One column will be indexed
*/
create table one_million_one_index(
c1 int, c2 int, c3 int
,c4 int, c5 int, c6 int
,c7 int, c8 int, c9 int
,index(c1)
);
/*
|| Create an empty table with 9 integer columns.
|| All nine columns will be indexed
*/
create table one_million_nine_index(
c1 int, c2 int, c3 int
,c4 int, c5 int, c6 int
,c7 int, c8 int, c9 int
,index(c1), index(c2), index(c3)
,index(c4), index(c5), index(c6)
,index(c7), index(c8), index(c9)
);
/*
|| Insert 1,000,000 rows in the table with one index
*/
insert into one_million_one_index(c1,c2,c3,c4,c5,c6,c7,c8,c9)
select n, n, n, n, n, n, n, n, n
from one_million_rows;
/*
|| Insert 1,000,000 rows in the table with nine indexes
*/
insert into one_million_nine_index(c1,c2,c3,c4,c5,c6,c7,c8,c9)
select n, n, n, n, n, n, n, n, n
from one_million_rows;
My timings are:
1m rows into table without indexes: 0,45 seconds
1m rows into table with 1 index: 1,5 seconds
1m rows into table with 9 indexes: 6,98 seconds
I'm better with SQL than statistics and math, but I'd like to think that:
Adding 8 indexes to my table, added (6,98-1,5) 5,48 seconds in total. Each index would then have contributed 0,685 seconds (5,48 / 8) for all 1,000,000 rows. That would mean that the added overhead per row per index would have been 0,000000685 seconds. SOMEBODY CALL THE BOARD OF DIRECTORS!
In conclusion, I'd like to say that the above test case doesn't prove a shit. It just shows that tonight, I was able to insert 1,000,000 consecutive integers into in a table in a single user environment. Your results will be different.

That is utterly ridiculous. First, you need multiple indexes in order to perfom correctly. For instance, if you have a primary key, you automatically have an index. that means you can't index anything else with the rule you described. So if you don't index foreign keys, joins will be slow and if you don't index fields used in the where clause, queries will still be slow. Yes you can have too many indexes as they do take extra time to insert and update and delete records, but no more than one is not dangerous, it is a requirement to have a system that performs well. And I have found that users tolerate a longer time to insert better than they tolerate a longer time to query.
Now the exception might be for a system that takes thousands of readings per second from some automated equipment. This is a database that generally doesn't have indexes to speed inserts. But usually these types of databases are also not used for reading, the data is transferred instead daily to a reporting database which is indexed.

Yes, definitely - too many indexes on a table can be worse than no indexes at all. However, I don't think there's any good in having the "at most one index per table" rule.
For SQL Server, my rule is:
index any foreign key fields - this helps JOINs and is beneficial to other queries, too
index any other fields when it makes sense, e.g. when lots of intensive queries can benefit from it
Finding the right mix of indices - weighing the pros of speeding up queries vs. the cons of additional overhead on INSERT, UPDATE, DELETE - is not an exact science - it's more about know-how, experience, measuring, measuring, and measuring again.
Any fixed rule is bound to be more contraproductive than anything else.....
The best content on indexing comes from Kimberly Tripp - the Queen of Indexing - see her blog posts here.

Unless you like very slow reads, you should have indexes. Don't go overboard, but don't be afraid of being liberal about them either. EVERY FK should be indexed. You're going to do a look up each of these columns on inserts to other tables to make sure the references are set. The index helps. As well as the fact that indexed columns are used often in joins and selects.
We have some tables that are inserted into rarely, with millions of records. Some of these tables are also quite wide. It's not uncommon for these tables to have 15+ indexes. Other tables with heavy inserting and low reads we might only have a handful of indexes- but one index per table is crazy.

Updating an index is once per insert (per index). Speed gain is for every select. So if you update infrequently and read often, then the extra work may be well worth it.
If you do different selects (meaning the columns you filter on are different), then maintaining an index for each type of query is very useful. Provided you have a limited set of columns that you query often.
But the usual advice holds: if you want to know which is fastest: profile!

You should of course be careful not to create too many indexes per table, but only ever using a single index per table is not a useful level.
How many indexes to use depends on how the table is used. A table that is updated often would generally have less indexes than one that is read much more often than it's updated.
We have some tables that are updated regularly by a job every two minutes, but they are read often by queries that vary a lot, so they have several indexes. One table for example have 24 indexes.

So much depends on your schema and the queries that you normally run. For example: if you normally need to select above 60% of the rows of your table, indexes won't help you and it will be cheaper to table scan than to index scan and then lookup rows. Focused queries that select a small number of rows in different parts of the table or which are used for joins in queries will probably benefit from indexes. The right index in the right place can make or break a feature.
Indexes take space so making too many indexes on a table can be counter productive for the same reasons listed above. Scanning 5 indexes and then performing row lookups may be much more expensive than simply table scanning.
Good design is the synthesis about about knowing when to normalise and when not to.
If you frequently join on a particular column, check the IO plan with the index and without. As a general rule I avoid tables with more than 20 columns. This is often a sign that the data should be normalised. More than about 5 indexes on a table and you may be using more space for the indexes than the main table, be sure that is worth it. These rules are only the lightest of guidance and so much depends on how the data will be used in queries and what your data update profile looks like.
Experiment with your query plans to see how your solution improves or degrades with an index.

Every table must have a PK, which is indexed of course (generally a clustered one), then every FK should be indexed as well.
Finally you may want to index fields on which you often sort on, if their data is well differenciated: for a field with only 5 possible values in a table with 1 million records, an index will not be of a great benefit.
I tend to be minimalistic with indexes, until the db starts beeing well filled, and ...slower. It is easy to identify the bottlenecks and add just the right the indexes at that point.

Optimizing the retrieval with indexes must be carefully designed to reflect actual query patterns. Surely, for a table with Primary Key, you will have at least one clustered index (that's how data is actually stored), then any additional indexes are taking advantage of the layout of the data (clustered index).
After analyzing queries that execute against the table, you want to design an index(s) that cover them. That may mean building one or more indexes but that heavily depends on the queries themselves. That decision cannot be made just by looking at column statistics only.
For tables where it's mostly inserts, i.e. ETL tables or something, then you should not create Primary Keys, or actually drop indexes and re-create them if data changes too quickly or drop/recreated entirely.
I personally would be scared to step into an environment that has a hard-coded rule of indexes per table ratio.

SQL Server DELETE is slower with indexes

I have an SQL Server 2005 database, and I tried putting indexes on the appropriate fields in order to speed up the DELETE of records from a table with millions of rows (big_table has only 3 columns), but now the DELETE execution time is even longer! (1 hour versus 13 min for example)
I have a relationship between to tables, and the column that I filter my DELETE by is in the other table. For example
DELETE FROM big_table
WHERE big_table.id_product IN (
SELECT small_table.id_product FROM small_table
WHERE small_table.id_category = 1)
Btw, I've also tried:
DELETE FROM big_table
WHERE EXISTS
(SELECT 1 FROM small_table
WHERE small_table.id_product = big_table.id_product
AND small_table.id_category = 1)
and while it seems to run slightly faster than the first, it's still a lot slower with the indexes than without.
I created indexes on these fields:
big_table.id_product
small_table.id_product
small_table.id_category
My .ldf file grows a lot during the DELETE.
Why are my DELETE queries slower when I have indexes on my tables? I thought they were supposed to run faster.
UPDATE
Okay, consensus seems to be indexes will slow down a huge DELETE becuase the index has to be updated. Although, I still don't understand why it can't DELETE all the rows all at once, and just update the index once at the end.
I was under the impression from some of my reading that indexes would speed up DELETE by making searches for fields in the WHERE clause faster.
Odetocode.com says:
"Indexes work just as well when searching for a record in DELETE and UPDATE commands as they do for SELECT statements."
But later in the article, it says that too many indexes can hurt performance.
Answers to bobs questions:
55 million rows in table
42 million rows being deleted
Similar SELECT statement would not run (Exception of type 'System.OutOfMemoryException' was thrown)
I tried the following 2 queries:
SELECT * FROM big_table
WHERE big_table.id_product IN (
SELECT small_table.id_product FROM small_table
WHERE small_table.id_category = 1)
SELECT * FROM big_table
INNER JOIN small_table
ON small_table.id_product = big_table.id_product
WHERE small_table.id_category = 1
Both failed after running for 25 min with this error message from SQL Server 2005:
An error occurred while executing batch. Error message is: Exception of type 'System.OutOfMemoryException' was thrown.
The database server is an older dual core Xeon machine with 7.5 GB ram. It's my toy test database :) so it's not running anything else.
Do I need to do something special with my indexes after I CREATE them to make them work properly?

Indexes make lookups faster - like the index at the back of a book.
Operations that change the data (like a DELETE) are slower, as they involve manipulating the indexes. Consider the same index at the back of the book. You have more work to do if you add, remove or change pages because you have to also update the index.

I Agree with Bobs comment above - if you are deleting large volumes of data from large tables deleting the indices can take a while on top of deleting the data its the cost of doing business though. As it deletes all the data out you are causing reindexing events to happen.
With regards to the logfile growth; if you arent doing anything with your logfiles you could switch to Simple logging; but i urge you to read up on the impact that might have on your IT department before you change.
If you need to do the delete in real time; its often a good work around to flag the data as inactive either directly on the table or in another table and exclude that data from queries; then come back later and delete the data when the users aren't staring at an hourglass. There is a second reason for covering this; if you are deleting lots of data out of the table (which is what i am supposing based on your logfile issue) then you will likely want to do an indexdefrag to reorgnaise the index; doing that out of hours is the way to go if you dont like users on the phone !

JohnB is deleting about 75% of the data. I think the following would have been a possible solution and probably one of the faster ones. Instead of deleting the data, create a new table and insert the data that you need to keep. Create the indexes on that new table after inserting the data. Now drop the old table and rename the new one to the same name as the old one.
The above of course assumes that sufficient disk space is available to temporarily store the duplicated data.

Try something like this to avoid bulk delete (and thereby avoid log file growth)
declare #continue bit = 1
-- delete all ids not between starting and ending ids
while #continue = 1
begin
set #continue = 0
delete top (10000) u
from <tablename> u WITH (READPAST)
where <condition>
if ##ROWCOUNT > 0
set #continue = 1
end

You can also try TSQL extension to DELETE syntax and check whether it improves performance:
DELETE FROM big_table
FROM big_table AS b
INNER JOIN small_table AS s ON (s.id_product = b.id_product)
WHERE s.id_category =1

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas