Fast deletion of many rows in data warehouse data

Fast deletion of many rows in data warehouse data - sql

In SQL Server 2008 I have some million rows of data which needs be deleted. They are scattered across a handful of tables. Deletion takes up to 20 seconds which I think is way to slow! The data to be deleted is identified by a timestamp column. Here is what I have done so far in order to optimize:
Using isolation level read uncommitted. I don't care about transactions. If we fail the user will issue the delete operation again. And new data is ensured not to have the timestamp we are deleting.
Deleting leaf tables before parent tables.
The timestamp column is part of the PK clustered index, in fact its the first position of the PK/index.
Each table is emptied using a loop which deletes top 200000 entries in order to reduce the transaction log overhead.
Neither I/O nor CPU is maxed out on the server
What have I overlooked?
Also I am in doubt of the effect of moving the timestamp column to the first position in the PK. After doing so, must I reorganize the tables or is SQL Server smart enough to do this itself. My understanding of clustered index is that since it defines the physical layout of the rows, it is force into reorganizing the data. But we have no complaints from the customer that the changing clustered index operation took a long time to perform.

Please make sure the tables you want to delete data from has "primary key" specifically indicated.
Wrong: create table myTable (ID int)
True: create table myTable (ID int PRIMARY KEY)
In addition to that, please try to add "option (recompile)", which will help the performance:
DELETE FROM myTable
WHERE timestamp in (select timestamp from other_table)
OPTION (RECOMPILE)

Related

Am I supposed to drop and recreate indexes on tables every time I add data to it?

I'm currently using MSSQL Server, I've created a table with indexes on 4 columns. I plan on appending 1mm rows every month end. Is it customary to drop the indexes, and recreate them every time you add data to the table?

Don't recreate the index. Instead, you can use update statistics to compute the statistics for the given index or for the whole table:
UPDATE STATISTICS mytable myindex; -- statistics for the table index
UPDATE STATISTICS mytable; -- statistics for the whole table

I don't think it is customary, but it is not uncommon. Presumably the database would not be used for other tasks during the data load, otherwise, well, you'll have other problems.
It could save time and effort if you just disabled the indexes:
ALTER INDEX IX_MyIndex ON dbo.MyTable DISABLE
More info on this non-trivial topic can be found here. Note especially that disabling the clustered index will block all access to the table (i.e. don't do that). If the data being loaded is ordered in [clustered index] order, that can help some.
A last note, do some testing. 1MM rows doesn't seem like that much; the time you save may get used up by recreating the indexes.

dropping filtered statistics causes deadlocks

We have a large data warehouse database where we continuously get new rows inserted in 5 different tables, at the left(Edit: right)-hand side of the b-tree (=at end of the table)
This means that ordinary statistics very quickly gets outdated, regarding the new data.
So we've changed our insert procedure to also do a CREATE STATISTICS FST_xxx
with a WHERE clause that covers the latest two million rows.
This has ensured that we do not get incorrect execution plans.
Now we are stranded with hundreds of these.
We have a cleanup job that runs once a day that will drop unneeded statistics.
but this causes a lot of Deadlocks.
Is there a way to disable Filtered Statistics or to drop Filtered Statistics without causing deadlocks?
(edit:) The table is clustered on a Bigint Identity(1,1) asending.

Can you clarify where the rows are getting inserted? You said left-hand side of the b-tree, but you also said end of the table. Is it correct to assume this a clustered index you're talking about? And whether it's clustered or not, is the index key ascending? Thanks.

Why not try to update the statistics instead of creating a new one every time?

How to speed up a slow UPDATE query

I have the following UPDATE query:
UPDATE Indexer.Pages SET LastError=NULL where LastError is not null;
Right now, this query takes about 93 minutes to complete. I'd like to find ways to make this a bit faster.
The Indexer.Pages table has around 506,000 rows, and about 490,000 of them contain a value for LastError, so I doubt I can take advantage of any indexes here.
The table (when uncompressed) has about 46 gigs of data in it, however the majority of that data is in a text field called html. I believe simply loading and unloading that many pages is causing the slowdown. One idea would be to make a new table with just the Id and the html field, and keep Indexer.Pages as small as possible. However, testing this theory would be a decent amount of work since I actually don't have the hard disk space to create a copy of the table. I'd have to copy it over to another machine, drop the table, then copy the data back which would probably take all evening.
Ideas? I'm using Postgres 9.0.0.
UPDATE:
Here's the schema:
CREATE TABLE indexer.pages
(
id uuid NOT NULL,
url character varying(1024) NOT NULL,
firstcrawled timestamp with time zone NOT NULL,
lastcrawled timestamp with time zone NOT NULL,
recipeid uuid,
html text NOT NULL,
lasterror character varying(1024),
missingings smallint,
CONSTRAINT pages_pkey PRIMARY KEY (id ),
CONSTRAINT indexer_pages_uniqueurl UNIQUE (url )
);
I also have two indexes:
CREATE INDEX idx_indexer_pages_missingings
ON indexer.pages
USING btree
(missingings )
WHERE missingings > 0;
and
CREATE INDEX idx_indexer_pages_null
ON indexer.pages
USING btree
(recipeid )
WHERE NULL::boolean;
There are no triggers on this table, and there is one other table that has a FK constraint on Pages.PageId.

What #kgrittn posted as comment is the best answer so far. I am merely filling in details.
Before you do anything else, you should upgrade PostgreSQL to a current version, at least to the last security release of your major version. See guidelines on the project.
I also want to stress what Kevin mentioned about indexes involving the column LastError. Normally, HOT updates can recycle dead rows on a data page and make UPDATEs a lot faster - effectively removing (most of) the need for vacuuming. Related:
Redundant data in update statements
If your column is used in any index in any way, HOT UPDATEs are disabled, because it would break the index(es). If that is the case, you should be able to speed up the query a lot by deleting all of these indexes before you UPDATE and recreate them later.
In this context it would help to run multiple smaller UPDATEs:
If ...
... the updated column is not involved in any indexes (enabling HOT updates).
... the UPDATE is easily divided into multiple patches in multiple transactions.
... the rows in those patches are spread out over the table (physically, not logically).
... there are no other concurrent transactions keeping dead tuples from being reused.
Then you would not need to VACCUUM in between multiple patches, because HOT updates can reuse dead tuples directly - only dead tuples from previous transactions, not from the same or concurrent ones. You may want to schedule a VACUUM at the end of the operation, or just let auto-vacuuming do its job.
The same could be done with any other index that is not needed for the UPDATE - and judging from your numbers the UPDATE is not going to use an index anyway. If you update large parts of your table, building new indexes from scratch is much faster than incrementally updating indexes with every changed row.
Also, your update is not likely to break any foreign key constraints. You could try to delete & recreate those, too. This does open a time slot where referential integrity would not be enforced. If the integrity is violated during the UPDATE you get an error when trying to recreate the FK. If you do it all within one transaction, concurrent transactions never get to see the dropped FK, but you take a write lock on the table - same as with dropping / recreating indexes or triggers)
Lastly, disable & enable triggers that are not needed for the update.
Be sure to do all of this in one transaction. Maybe do it in a number of smaller patches, so it does not block concurrent operations for too long.
So:
BEGIN;
ALTER TABLE tbl DISABLE TRIGGER user; -- disable all self-made triggers
-- DROP indexes (& fk constraints ?)
-- UPDATE ...
-- RECREATE indexes (& fk constraints ?)
ALTER TABLE tbl ENABLE TRIGGER user;
COMMIT;
You cannot run VACUUM inside a transaction block. Per documentation:
VACUUM cannot be executed inside a transaction block.
You could split your operation into a few big chunks and run in between:
VACUUM ANALYZE tbl;
If you don't have to deal with concurrent transactions you could (even more effectively):
ALTER TABLE tbl DISABLE TRIGGER user; -- disable all self-made triggers
-- DROP indexes (& fk constraints ?)
-- Multiple UPDATEs with logical slices of the table
-- each slice in its own transaction.
-- VACUUM ANALYZE tbl; -- optionally in between, or autovacuum kicks in
-- RECREATE indexes (& fk constraints ?)
ALTER TABLE tbl ENABLE TRIGGER user;

UPDATE Indexer.Pages
SET LastError=NULL
;
The where clause is not needed since the NULL fields are already NULL, so it won't harm to set them to NULL again (I don't think this would affect performance significantly).
Given your number_of_rows = 500K and your table size=46G, I conclude that your average rowsize is 90KB. That is huge. Maybe you could move {unused, sparse} columns of your table to other tables?

Your theory is probably correct. Reading the full table (and then doing anything) is probably causing the slow-down.
Why don't you just create another table that has PageId and LastError? Initialize this with the data in the table you have now (which should take less than 93 minutes). Then, use the LastError from the new table.
At your leisure, you can remove LastError from your existing table.
By the way, I don't normally recommend keeping two copies of a column in two separate tables. In this case, though, you sound like you are stuck and need a way to proceed.

Is ROWID internally indexed unique by an SQL DBMS?

It's my understanding that the quickest way to access a particular row is by its ROWID. In INFORMIX-SE 7.3, when I do a SELECT ROWID FROM table I notice that its values are type SERIAL[INT]. In Oracle, they are SERIAL[HEX]. Has anyone ever used ROWID for any practical use? If I wanted to locate the most recent row added to a table, would SELECT MAX(ROWID) FROM table be quicker and more reliable than say SELECT MAX(pk_id) FROM table, where pk_id is a user-created SERIAL column? What other practical use have you ever put ROWID to work for you?

Your understanding is not necessarily correct. The ROWID property in SQL Server is primarily intended for replication as a way to guarantee that the table has a single-field unique index value. This way the replication system does not have to account for any specific primary key semantics that your design might employ, while still being able to identify every row by a single column. No table is required to have a ROWID column unless it is part of a merge replication publication, so it's not something that every table has, unlike Oracle. It also doesn't serve the same purpose (they're Guid's--or uniqueidentifier in T-SQL parlance--on SQL Server and are random, not sequential integers like they are on Oracle).
The quickest way to retrieve a row from a table is by accessing the row via the clustered index. A table can only have one clustered index, as it's what determines the physical ordering of the rows on the disk. Furthermore, if the table has a primary key, the primary key is the clustered index. While it's possible to declare a table without a primary key and assign the clustered index to something else, I can't (off the top of my head) fathom a reason why you'd want to do this (or, for practical purposes, how you can justify having a table without a primary key).
In short, that means that the quickest way to retireve a row is by using the primary key of the table. Unless the ROWID column is the primary key (which is certainly possible to do), then it isn't the fastest way.

Well, I can only really tell how it works in Oracle, using it for 19+ years :-)
Put simply, ROWID is an internel identification, that acts like an physical address. It can be split into database file no, block no, and row no. So obtaining the ROWID makes the db engine able to look the data up in a single direct IO.
In an index the B* tree will have ROWIDs on the leaf nodes pointing directly the location of the data, e.g. in a primary index.
Being an physical address it is submit to change on relocation on disk, which can happen after restoring a backup, rebuilding a table, or export/import of data.
The db engine can do some tricks, e.g. when moving a pluggable tablespace from one instance to another to avoid rebuilding indexes, however this is strickly db engine internals.
So to keep out of trouble leave the ROWID for internal use for the db engine. Storing the ROWID for your own usage will eventually lead to inconsistency.

In Informix-SE, the ROWID is basically the record number within the C-ISAM file that is used to hold the table. SE only deals in fixed size records, of course (no VARCHAR data).
In Informix Dynamic Server, the ROWID is (a) more complex (page number plus slot number) and (b) not always present (fragmented tables do not expose the ROWID, unless the table was created WITH ROWIDS, in which case the ROWID is a physical column that is indexed after all) - be aware!
If no data is ever deleted and you are using SE, then selecting the row with the maximum ROWID will be the most recently added row. If a row is deleted, then that space will eventually be reused, and then the most recently added row ceases to be the one with the maximum ROWID. (IDS does not make that promise for a variety of complex reasons.)
The SE implementation of ROWID does not store the ROWID in the table, and does not create an index on it, but it does not need an index because it knows the formula for where to go to find the data (offset in data file = ROWID * RowSize), give or take a plus one on RowSize or a minus one ROWID or both.
As to practical use for ROWID, the style that was used before fragmentation was added to IDS was to select a list of ROWID values for the records of interest in the table, maintaining that list in memory:
SELECT ROWID
FROM InterestingTable
WHERE SomeColumn = xxx
AND AnotherColumn < yyy;
Then, the program could present these rows one at time, fetching the current data via the stored ROWID. The ROWID for a record would not change while a program was running. This ensured that the current data - whether edits from the current user or someone else - was shown when the record was displayed.
There's a program you're familiar with, ISQL Perform, that behaves like this. And it does not work with fragmented tables (necessarily in IDS; SE does not support fragmented tables) unless they are created with a physical ROWID column with the WITH ROWIDS clause.

Perhaps the term "RDBMS" rather than "an SQL server"?
Attaching any purpose to a ROWID is a bad idea. Particularly if you're in the habit of dropping and recreating tables. If your table needs a SERIAL PK, then that's what it should have. No good can come of using ROWIDs within your application.

Slow bulk insert for table with many indexes

I try to insert millions of records into a table that has more than 20 indexes.
In the last run it took more than 4 hours per 100.000 rows, and the query was cancelled after 3½ days...
Do you have any suggestions about how to speed this up.
(I suspect the many indexes to be the cause. If you also think so, how can I automatically drop indexes before the operation, and then create the same indexes afterwards again?)
Extra info:
The space used by the indexes is about 4 times the space used by the data alone
The inserts are wrapped in a transaction per 100.000 rows.
Update on status:
The accepted answer helped me make it much faster.

You can disable and enable the indexes. Note that disabling them can have unwanted side-effects (such as having duplicate primary keys or unique indices etc.) which will only be found when re-enabling the indexes.
--Disable Index
ALTER INDEX [IXYourIndex] ON YourTable DISABLE
GO
--Enable Index
ALTER INDEX [IXYourIndex] ON YourTable REBUILD
GO

This sounds like a data warehouse operation.
It would be normal to drop the indexes before the insert and rebuild them afterwards.
When you rebuild the indexes, build the clustered index first, and conversely drop it last. They should all have fillfactor 100%.
Code should be something like this
if object_id('Index') is not null drop table IndexList
select name into Index from dbo.sysindexes where id = object_id('Fact')
if exists (select name from Index where name = 'id1') drop index Fact.id1
if exists (select name from Index where name = 'id2') drop index Fact.id2
if exists (select name from Index where name = 'id3') drop index Fact.id3
.
.
BIG INSERT
RECREATE THE INDEXES

As noted by another answer disabling indexes will be a very good start.
4 hours per 100.000 rows
[...]
The inserts are wrapped in a transaction per 100.000 rows.
You should look at reducing the number, the server has to maintain a huge amount of state while in a transaction (so it can be rolled back), this (along with the indexes) means adding data is very hard work.
Why not wrap each insert statement in its own transaction?
Also look at the nature of the SQL you are using, are you adding one row per statement (and network roundtrip), or adding many?

Disabling and then re-enabling indices is frequently suggested in those cases. I have my doubts about this approach though, because:
(1) The application's DB user needs schema alteration privileges, which it normally should not possess.
(2) The chosen insert approach and/or index schema might be less then optimal in the first place, otherwise rebuilding complete index trees should not be faster then some decent batch-inserting (e.g. the client issuing one insert statement at a time, causing thousands of server-roundtrips; or a poor choice on the clustered index, leading to constant index node splits).
That's why my suggestions look a little bit different:
Increase ADO.NET BatchSize
Choose the target table's clustered index wisely, so that inserts won't lead to clustered index node splits. Usually an identity column is a good choice
Let the client insert into a temporary heap table first (heap tables don't have any clustered index); then, issue one big "insert-into-select" statement to push all that staging table data into the actual target table
Apply SqlBulkCopy
Decrease transaction logging by choosing bulk-logged recovery model
You might find more detailled information in this article.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas