Oracle select statement caching during large insert has erratic performance - sql

In a mid-tier server that implements an API with granular store methods, we map each store call to either an insert or update statement, based on the existence of the entity, and to determine that, we issue a select across 3 tables joining on natural keys. (our 2nd implementation uses a zero-caching architecture in the mid-tier server, for easy parallelism, so we always go back to the DB).
For a given import, we load for example 180,000 entities into one of our leaf tables, each entity has a parent foreign-key into a parent table (roughly 9,000 entities), and each parent row is owned by a 3rd top-level table with only 200 rows. All 3 tables use a GUID surrogate key (as the primary key), and they have a (project, parent, name) composite natural key (top-level table has no parent column/key).
The parent columns are proper indexed foreign-keys. And the natural keys all have UNIQUE indexes. We are using Oracle 12c, using OCI in C/C++.
The ideal plan for that join top-parent-leaf select is to use the NK indexes, since they always yield a single row, but in the past we've seen bad performance when the query planner would choose the less selective FK indexes.
The scalar select normally takes 0.25 milliseconds on average (1/4 ms). All those select statements are cached with bind variables, and we just update the bound values and execute() to get the result. This takes for example 500s to import 240,000 rows across 6 tables when working normally (in schema A). But in a different schema (B), it's taking forever, and under the debugger (Release mode) I can see that 1/4 ms query take around 80ms, way more than normal. (180,000 * 80ms is over 4 hours... just for the selects to map NKs to the SK/PK and know if it's an insert or update we need).
So obviously the query planner is using a wrong plan, not the normal highly selective one using the NK indexes.
My question is how do we avoid the terrible plan?
In particular, those select statements (6 of them, all cached) are prepared at the beginning of the import, and at that time the schema could be empty, but have had plenty of rows before, and in any case, the schema will be filled with rows in all cases, sometimes in the millions. So a plan decided on an empty schema may be bad for when the schema starts to be quite full, no?
The import is transactional, and all part of a single transaction.
Are we supposed to re-prepare the cached select statements as more rows are added? Would the plan be any different, i.e. would the stats be any different, given that we are inside a large transaction anyway?
Should use force the use of the NK-indexes (always the best ones) as query-hints, even though Tom Kyte says never to use them?
In general, when doing a mix of inserts/updates with selects in a large transaction that can significantly affects schema statistics, how to we keep the select plans performance to be optimal?
I guess updating stats is a DDL and thus cannot be performed inside that transaction too. Is there a known strategy to not fall into the performance chasm we sometimes fall into? (from 500s / 9min to several hours...)
Thanks for any insight about the above.

Related

Selecting one column from a table that has 100 columns

I have a table with 100 columns (yes, code smell and arguably a potentially less optimized design). The table has an 'id' as PK. No other column is indexed.
So, if I fire a query like:
SELECT first_name from EMP where id = 10
Will SQL Server (or any other RDBMS) have to load the entire row (all columns) in memory and then return only the first_name?
(In other words - the page that contains the row id = 10 if it isn't in the memory already)
I think the answer is yes! unless it has column markers within a row. I understand there might be optimization techniques, but is it a default behavior?
[EDIT]
After reading some of your comments, I realized I asked an XY question unintentionally. Basically, we have tables with 100s of millions of rows with 100 columns each and receive all sorts of SELECT queries on them. The WHERE clause also changes but no incoming request needs all columns. Many of those cell values are also NULL.
So, I was thinking of exploring a column-oriented database to achieve better compression and faster retrieval. My understanding is that column-oriented databases will load only the requested columns. Yes! Compression will help too to save space and hopefully performance as well.
For MySQL: Indexes and data are stored in "blocks" of 16KB. Each level of the B+Tree holding the PRIMARY KEY in your case needs to be accessed. For example a million rows, that is 3 blocks. Within the leaf block, there are probably dozens of rows, with all their columns (unless a column is "too big"; but that is a different discussion).
For MariaDB's Columnstore: The contents of one columns for 64K rows is held in a packed, compressed structure that varies in size and structure. Before getting to that, the clump of 64K rows must be located. After getting it, it must be unpacked.
In both cases, the structure of the data on disk is a compromises between speed and space for both simple and complex queries.
Your simple query is easy and efficient to doing a regular RDBMS, but messier to do in a Columnstore. Columnstore is a niche market in which your query is abnormal.
Be aware that fetching blocks are typically the slowest part of performing the query, especially when I/O is required. There is a cache of blocks in RAM.

Is it possible to do usual atomic INSERT operation but update Indexes asynchronously?

indexes make read fast but write slower. But why can't you have single writes and have db add indexes asynchronously with time, also cache in the INSERT until it's indexed?
Is there any database like that?
Converting my comments to an answer:
indexes make read fast but write slower
That's an oversimplification and it's also misleading.
Indexes make data lookups faster because the DBMS doesn't need to do a table-scan to find rows matching a predicate (the WHERE part of a query). Indexes don't make "reads" any faster (that's entirely dependent on the characteristics of your disk IO) and when used improperly they can sometimes even make queries slower (for reasons I won't get into).
I want to stress that the additional cost of writing to a single index, or even multiple indexes, when executing a DML statement (INSERT/UPDATE/DELETE/MERGE/etc) is negligible, really! (In actuality: foreign-key constraints are a much bigger culprit - and I note you can practically eliminate the cost of foreign-key constraint checking by adding additional indexes!). Indexes are primarily implemented using B-trees (a B-tree is essentially like a binary-tree, except rather than each node having only 2 children it can have many children because each tree-node comes with unused space for all those child node pointers, so inserting into the middle of a B-tree won't require data to be moved-around on-disk unlike with other kinds of trees, like a heap-tree).
Consider this QA where a Postgres user (like yourself) reports inserting 10,000 rows into a table. Without an index it took 78ms, with an index it took 84ms, that's only a 7.5% increase, which at that scale (6ms!) is so small it may as well be a rounding error or caused by IO scheduling. That should be proof enough it shouldn't be something you should worry about without actual hard data showing it's a problem for you and your application.
I assume you have this negative impression about indexes after reading an article like this one, which certainly gives the impression that "indexes are bad" - but while the points mentioned in that article are not wrong, there's a LOT of problems with that article so you shouldn't take it dogmatically. (I'll list my concerns with that article in the footer).
But why can't you have single writes and have db add indexes asynchronously with time
By this I assume you mean you'd like a DMBS to do a single-row INSERT by simply appending a new record to the end of a table and then immediately returning and then at an arbitrary point later the DBMS' housekeeping system would update the indexes afterwards.
The problem with that is that it breaks the A, C, and I parts of the the A.C.I.D. model.
Indexes are used for more than just avoiding table-scans: they're also used to store copies of table data for the benefit of queries that would use the index and which also need (for example) a small subset of the table's data, this significantly reduces disk reads. For this reason, RDBMS (and ISO SQL) allow indexes to include non-indexed data using the INCLUDES clause.
Consider this scenario:
CREATE INDEX IX_Owners ON cars ( ownerId ) INCLUDE ( colour );
CREATE INDEX IX_Names ON people ( name ) INCLUDE ( personId, hairColour );
GO;
SELECT
people.name,
people.hairColour,
cars.colour
FROM
cars
INNER JOIN people ON people.personId = cars.ownerId
WHERE
people.name LIKE 'Steve%'
The above query will not need to read either the cars or people tables on-disk. The DBMS will be able to fully answer the query using data only in the index - which is great because indexes tend to exist in a small number of pages on-disk which tend to be in proximal-locality which is good for performance because it means it will use sequential IO which scales much better than random IO.
The RDBMS will perform a string-prefix index-scan of the people.IX_Names index to get all of the personId (and hairColour) values, then it will look-up those personId values in the cars.IX_Owners index and be able to get the car.colour from the copy of the data inside the IX_Owners index without needing to read the tables directly.
Now, assuming that another database client has just completed inserted a load of records into the cars and/or people table with a COMMIT TRANSACTION just for good measure, and the RDMBS uses your idea of only updating indexes later whenever it feels like it, then if that same database client re-runs the query from above it would return stale data (i.e. wrong data) because the query uses the index, but the index is old.
In addition to using index tree nodes to store copies of table data to avoid non-proximal disk IO, many RDBMS also use index-trees to store entire copies - even multiple copies of table data, to enable other scenarios, such as columnar data storage and indexed-VIEWs - both of these features absolutely require that indexes are updated atomically with table data.
Is there any database like that?
Yes, they exist - but they're not widely used (or they're niche) because for the vast majority of applications it's entirely undesirable behaviour for the reasons described above.
There are distributed databases that are designed around eventual consistency, but clients (and entire application code) needs to be designed with that in-mind, and it's a huge PITA to have to redesign a data-centric application to support eventual-consistency which is why you only really see them being used in truly massive systems (like Facebook, Google, etc) where availability (uptime) is more important than users seeing stale-data for a few minutes.
Footnote:
Regarding this article: https://use-the-index-luke.com/sql/dml/insert
The number of indexes on a table is the most dominant factor for insert performance. The more indexes a table has, the slower the execution becomes. The insert statement is the only operation that cannot directly benefit from indexing because it has no where clause.
I disagree. I'd argue that foreign-key constraints (and triggers) are far more likely to have a larger detrimental effect on DML operations.
Adding a new row to a table involves several steps. First, the database must find a place to store the row. For a regular heap table—which has no particular row order—the database can take any table block that has enough free space. This is a very simple and quick process, mostly executed in main memory. All the database has to do afterwards is to add the new entry to the respective data block.
I agree with this.
If there are indexes on the table, the database must make sure the new entry is also found via these indexes. For this reason it has to add the new entry to each and every index on that table. The number of indexes is therefore a multiplier for the cost of an insert statement.
This is true, but I don't know if I agree that it's a "multiplier" of the cost of an insert.
For example, consider a table with hundreds of nvarchar(1000) columns and several int columns - and there's separate indexes for each int column (with no INCLUDE columns). If you're inserting 100x megabyte-sized rows all-at-once (using an INSERT INTO ... SELECT FROM statement) the cost of updating those int indexes is very likely to require much less IO than the table data.
Moreover, adding an entry to an index is much more expensive than inserting one into a heap structure because the database has to keep the index order and tree balance. That means the new entry cannot be written to any block—it belongs to a specific leaf node. Although the database uses the index tree itself to find the correct leaf node, it still has to read a few index blocks for the tree traversal.
I strongly disagree with this, especially the first sentence: "adding an entry to an index is much more expensive than inserting one into a heap structure".
Indexes in RDBMS today are invariably based on B-trees, not binary-trees or heap-trees. B-trees are essentially like binary-trees except each node has built-in space for dozens of child node pointers and B-trees are only rebalanced when a node fills its internal child pointer list, so a B-tree node insert will be considerably cheaper than the article is saying because each node will have plenty of empty space for a new insertion without needing to re-balance itself or any other relatively expensive operation (besides, DBMS can and do index maintenance separately and independently of any DML statement).
The article is correct about how the DBMS will need to traverse the B-tree to find the node to insert into, but index nodes are efficently arranged on-disk, such as keeping related nodes in the same disk page which minimizes index IO reads (assuming they aren't already loaded into memory first). If an index tree is too big to store in-memory the RDBMS can always keep a "meta-indexes" in-memory so it could potentially instantly find the correct B-tree index without needing to traverse the B-tree from the root.
Once the correct leaf node has been identified, the database confirms that there is enough free space left in this node. If not, the database splits the leaf node and distributes the entries between the old and a new node. This process also affects the reference in the corresponding branch node as that must be duplicated as well. Needless to say, the branch node can run out of space as well so it might have to be split too. In the worst case, the database has to split all nodes up to the root node. This is the only case in which the tree gains an additional layer and grows in depth.
In practice this isn't a problem, because the RDBMS's index maintenance will ensure there's sufficient free space in each index node.
The index maintenance is, after all, the most expensive part of the insert operation. That is also visible in Figure 8.1, “Insert Performance by Number of Indexes”: the execution time is hardly visible if the table does not have any indexes. Nevertheless, adding a single index is enough to increase the execute time by a factor of a hundred. Each additional index slows the execution down further.
I feel the article is being dishonest by suggesting (implying? stating?) that index-maintenance happens with every DML. This is not true. This may have been the case with some early dBase-era databases, but this is certainly not the case with modern RDBMS like Postgres, MS SQL Server, Oracle and others.
Considering insert statements only, it would be best to avoid indexes entirely—this yields by far the best insert performance.
Again, this claim in the article is not wrong, but it's basically saying if you want a clean and tidy house you should get rid of all of your possessions. Indexes are a fact of life.
However tables without indexes are rather unrealistic in real world applications. You usually want to retrieve the stored data again so that you need indexes to improve query speed. Even write-only log tables often have a primary key and a respective index.
Indeed.
Nevertheless, the performance without indexes is so good that it can make sense to temporarily drop all indexes while loading large amounts of data—provided the indexes are not needed by any other SQL statements in the meantime. This can unleash a dramatic speed-up which is visible in the chart and is, in fact, a common practice in data warehouses.
Again, with modern RDBMS this isn't necessary. If you do a batch insert then a RDBMS won't update indexes until after the table-data has finished being modified, as a batch index update is cheaper than many individual updates. Similarly I expect that multiple DML statements and queries inside an explicit BEGIN TRANSACTION may cause an index-update deferral provided no subsequent query in the transaction relies on an updated index.
But my biggest issue with that article is that the author is making these bold claims about detrimental IO performance without providing any citations or even benchmarks they've run themselves. It's even more galling that they posted a bar-chart with arbitrary numbers on, again, without any citation or raw benchmark data and instructions for how to reproduce their results. Always demand citations and evidence from anything you read making claims: because the only claims anyone should accept without evidence are logical axioms - and a quantitative claim about database index IO cost is not a logical axiom :)
For PostgreSQL GIN indexes, there is the fastupdate feature. This stores new index entries into a unordered unconsolidated area waiting for some other process to file them away into the main index structure. But this doesn't directly match up with what you want. It is mostly designed so that the index updates are done in bulk (which can be more IO efficient), rather than in the background. Once the unconsolidated area gets large enough, then a foreground process might take on the task of filing them away, and it can be hard to tune the settings in a way to get this to always be done by a background process instead of a foreground process. And it only applies to GIN indexes. (With the use of the btree_gin extension, you can create GIN indexes on regular scalar columns rather than the array-like columns it usually works with.) While waiting for the entries to be consolidated, every query will have to sequential scan the unconsolidated buffer area, so delaying the updates for the sake of INSERT can come at a high cost for SELECTs.
There are more general techniques to do something like this, such as fractal tree indexes. But these are not implemented in PostgreSQL, and wherever they are implemented they seem to be proprietary.

What is the best way to ensure consistent ordering in an Oracle query?

I have an program that needs to run queries on a number of very large Oracle tables (the largest with tens of millions of rows). The output of these queries is fed into another process which (as a side effect) can record the progress of the query (i.e., the last row fetched).
It would be nice if, in the event that the task stopped half way through for some reason, it could be restarted. For this to happen, the query has to return rows in a consistent order, so it has to be sorted. The obvious thing to do is to sort on the primary key; however, there is probably going to be a penalty for this in terms of performance (an index access) versus a non-sorted solution. Given that a restart may never happen this is not desirable.
Is there some trick to ensure consistent ordering in another way? Any other suggestions for maintaining performance in this case?
EDIT: I have been looking around and seen "order by rowid" mentioned. Is this useful or even possible?
EDIT2: I am adding some benchmarks:
With no order by: 17 seconds.
With order by PK: 46 seconds.
With order by rowid: 43 seconds.
So any order by has a savage effect on performance, and using rowid makes little difference. Accepted answer is - there is no easy way to do it.
The best advice I can think of is to reduce the chance of a problem occurring that might stop the process, and that means keeping the code simple. No cursors, no commits, no trying to move part of the data, just straight SQL statements.
Unless a complete restart would be a completely unacceptable disaster, I'd go for simplicity without any part-way restart code at all.
If you want some order and queried data is unsorted then you need to sort it anyway, and spend some resources to do sorting.
So, there are at least two variants for optimization:
Minimize resources spent on sorting;
Query already sorted data.
For the first variant Oracle on its own calculates a best variant to minimize data access and overall query time. It may be possible to choose sorting order involved in unique index which already used by optimizer, but it's a very questionable tactic.
Second variant is about index-organized tables and about forcing Oracle with hints to use some specific index. It seems Ok if you need to process nearly all records in some specific table, but if selectivity of query is high it's significantly slows a process, even on a single table.
Think about a table with surrogate primary key which holds data with 10-year transaction history. If you need data only for previous year and you force order by primary key then Oracle need to process records in all 10 years one-by-one to find all records which belongs to a single year.
But if you need data for 9 years from this table then full table scan may be faster than index-based choice.
So selectivity of your query is a key to choose between full table scan and result sorting.
For storing results and restarting query a good solution is to use Oracle Streams Advanced Queuing to fed another process.
All unprocessed messages in queue redirected to Exception Queue where it may be processed separately.
Because you don't specify exact ordering for selected messages I suppose that you need ordering only to maintain unprocessed part of records. If it's true then with AQ you don't need ordering at all and may, even, process records in parallel.
So, finally, from my point of view Buffered Queue is what you really need.
You could skip ordering and just update the records you processed with something like SET is_processed = 'Y' or SET date_processed = sysdate. Complete restartability and no ordering.
For performance you can partition by is_processed. Yes, partition key changes might be slow, but it is all about trade-offs.

SQL Server Efficiently dropping a group of rows with millions and millions of rows

I recently asked this question:
MS SQL share identity seed amongst tables
(Many people wondered why)
I have the following layout of a table:
Table: Stars
starId bigint
categoryId bigint
starname varchar(200)
But my problem is that I have millions and millions of rows. So when I want to delete stars from the table Stars it is too intense on SQL Server.
I cannot use built in partitioning for 2005+ because I do not have an enterprise license.
When I do delete though, I always delete a whole category Id at a time.
I thought of doing a design like this:
Table: Star_1
starId bigint
CategoryId bigint constaint rock=1
starname varchar(200)
Table: Star_2
starId bigint
CategoryId bigint constaint rock=2
starname varchar(200)
In this way I can delete a whole category and hence millions of rows in O(1) by doing a simple drop table.
My question is, is it a problem to have hundreds of thousands of tables in your SQL Server? The drop in O(1) is extremely desirable to me. Maybe there's a completely different solution I'm not thinking of?
Edit:
Is a star ever modified once it is inserted? No.
Do you ever have to query across star categories? I never have to query across star categories.
If you are looking for data on a particular star, would you know which table to query? Yes
When entering data, how will the application decide which table to put the data into? The insertion of star data is done all at once at the start when the categoryId is created.
How many categories will there be? You can assume there will be infinite star categories. Let's say up to 100 star categories per day and up to 30 star categories not needed per day.
Truly do you need to delete the whole category or only the star that the data changed for? Yes the whole star category.
Have you tried deleting in batches? Yes we do that today, but it is not good enough.
od enough.
Another technique is mark the record for deletion? There is no need to mark a star as deleted because we know the whole star category is eligible to be deleted.
What proportion of them never get used? Typically we keep each star category data for a couple weeks but sometimes need to keep more.
When you decide one is useful is that good for ever or might it still need to be deleted later?
Not forever, but until a manual request to delete the category is issued.
If so what % of the time does that happen? Not that often.
What kind of disc arrangement are you using? Single filegroup storage and no partitioning currently.
Can you use sql enterprise ? No. There are many people that run this software and they only have sql standard. It is outside of their budget to get ms sql enterprise.
My question is, is it a problem to have hundreds of thousands of tables in your SQL Server?
Yes. It is a huge problem to have this many tables in your SQL Server. Every object has to be tracked by SQL Server as metadata, and once you include indexes, referential constraints, primary keys, defaults, and so on, then you are talking about millions of database objects.
While SQL Server may theoretically be able to handle 232 objects, rest assured that it will start buckling under the load much sooner than that.
And if the database doesn't collapse, your developers and IT staff almost certainly will. I get nervous when I see more than a thousand tables or so; show me a database with hundreds of thousands and I will run away screaming.
Creating hundreds of thousands of tables as a poor-man's partitioning strategy will eliminate your ability to do any of the following:
Write efficient queries (how do you SELECT multiple categories?)
Maintain unique identities (as you've already discovered)
Maintain referential integrity (unless you like managing 300,000 foreign keys)
Perform ranged updates
Write clean application code
Maintain any sort of history
Enforce proper security (it seems evident that users would have to be able to initiate these create/drops - very dangerous)
Cache properly - 100,000 tables means 100,000 different execution plans all competing for the same memory, which you likely don't have enough of;
Hire a DBA (because rest assured, they will quit as soon as they see your database).
On the other hand, it's not a problem at all to have hundreds of thousands of rows, or even millions of rows, in a single table - that's the way SQL Server and other SQL RDBMSes were designed to be used and they are very well-optimized for this case.
The drop in O(1) is extremely desirable to me. Maybe there's a completely different solution I'm not thinking of?
The typical solution to performance problems in databases is, in order of preference:
Run a profiler to determine what the slowest parts of the query are;
Improve the query, if possible (i.e. by eliminating non-sargable predicates);
Normalize or add indexes to eliminate those bottlenecks;
Denormalize when necessary (not generally applicable to deletes);
If cascade constraints or triggers are involved, disable those for the duration of the transaction and blow out the cascades manually.
But the reality here is that you don't need a "solution."
"Millions and millions of rows" is not a lot in a SQL Server database. It is very quick to delete a few thousand rows from a table of millions by simply indexing on the column you wish to delete from - in this case CategoryID. SQL Server can do this without breaking a sweat.
In fact, deletions normally have an O(M log N) complexity (N = number of rows, M = number of rows to delete). In order to achieve an O(1) deletion time, you'd be sacrificing almost every benefit that SQL Server provides in the first place.
O(M log N) may not be as fast as O(1), but the kind of slowdowns you're talking about (several minutes to delete) must have a secondary cause. The numbers do not add up, and to demonstrate this, I've gone ahead and produced a benchmark:
Table Schema:
CREATE TABLE Stars
(
StarID int NOT NULL IDENTITY(1, 1)
CONSTRAINT PK_Stars PRIMARY KEY CLUSTERED,
CategoryID smallint NOT NULL,
StarName varchar(200)
)
CREATE INDEX IX_Stars_Category
ON Stars (CategoryID)
Note that this schema is not even really optimized for DELETE operations, it's a fairly run-of-the-mill table schema you might see in SQL server. If this table has no relationships, then we don't need the surrogate key or clustered index (or we could put the clustered index on the category). I'll come back to that later.
Sample Data:
This will populate the table with 10 million rows, using 500 categories (i.e. a cardinality of 1:20,000 per category). You can tweak the parameters to change the amount of data and/or cardinality.
SET NOCOUNT ON
DECLARE
#BatchSize int,
#BatchNum int,
#BatchCount int,
#StatusMsg nvarchar(100)
SET #BatchSize = 1000
SET #BatchCount = 10000
SET #BatchNum = 1
WHILE (#BatchNum <= #BatchCount)
BEGIN
SET #StatusMsg =
N'Inserting rows - batch #' + CAST(#BatchNum AS nvarchar(5))
RAISERROR(#StatusMsg, 0, 1) WITH NOWAIT
INSERT Stars2 (CategoryID, StarName)
SELECT
v.number % 500,
CAST(RAND() * v.number AS varchar(200))
FROM master.dbo.spt_values v
WHERE v.type = 'P'
AND v.number >= 1
AND v.number <= #BatchSize
SET #BatchNum = #BatchNum + 1
END
Profile Script
The simplest of them all...
DELETE FROM Stars
WHERE CategoryID = 50
Results:
This was tested on an 5-year old workstation machine running, IIRC, a 32-bit dual-core AMD Athlon and a cheap 7200 RPM SATA drive.
I ran the test 10 times using different CategoryIDs. The slowest time (cold cache) was about 5 seconds. The fastest time was 1 second.
Perhaps not as fast as simply dropping the table, but nowhere near the multi-minute deletion times you mentioned. And remember, this isn't even on a decent machine!
But we can do better...
Everything about your question implies that this data isn't related. If you don't have relations, you don't need the surrogate key, and can get rid of one of the indexes, moving the clustered index to the CategoryID column.
Now, as a rule, clustered indexes on non-unique/non-sequential columns are not a good practice. But we're just benchmarking here, so we'll do it anyway:
CREATE TABLE Stars
(
CategoryID smallint NOT NULL,
StarName varchar(200)
)
CREATE CLUSTERED INDEX IX_Stars_Category
ON Stars (CategoryID)
Run the same test data generator on this (incurring a mind-boggling number of page splits) and the same deletion took an average of just 62 milliseconds, and 190 from a cold cache (outlier). And for reference, if the index is made nonclustered (no clustered index at all) then the delete time only goes up to an average of 606 ms.
Conclusion:
If you're seeing delete times of several minutes - or even several seconds then something is very, very wrong.
Possible factors are:
Statistics aren't up to date (shouldn't be an issue here, but if it is, just run sp_updatestats);
Lack of indexing (although, curiously, removing the IX_Stars_Category index in the first example actually leads to a faster overall delete, because the clustered index scan is faster than the nonclustered index delete);
Improperly-chosen data types. If you only have millions of rows, as opposed to billions, then you do not need a bigint on the StarID. You definitely don't need it on the CategoryID - if you have fewer than 32,768 categories then you can even do with a smallint. Every byte of unnecessary data in each row adds an I/O cost.
Lock contention. Maybe the problem isn't actually delete speed at all; maybe some other script or process is holding locks on Star rows and the DELETE just sits around waiting for them to let go.
Extremely poor hardware. I was able to run this without any problems on a pretty lousy machine, but if you're running this database on a '90s-era Presario or some similar machine that's preposterously unsuitable for hosting an instance of SQL Server, and it's heavily-loaded, then you're obviously going to run into problems.
Very expensive foreign keys, triggers, constraints, or other database objects which you haven't included in your example, which might be adding a high cost. Your execution plan should clearly show this (in the optimized example above, it's just a single Clustered Index Delete).
I honestly cannot think of any other possibilities. Deletes in SQL Server just aren't that slow.
If you're able to run these benchmarks and see roughly the same performance I saw (or better), then it means the problem is with your database design and optimization strategy, not with SQL Server or the asymptotic complexity of deletions. I would suggest, as a starting point, to read a little about optimization:
SQL Server Optimization Tips (Database Journal)
SQL Server Optimization (MSDN)
Improving SQL Server Performance (MSDN)
SQL Server Query Processing Team Blog
SQL Server Performance (particularly their tips on indexes)
If this still doesn't help you, then I can offer the following additional suggestions:
Upgrade to SQL Server 2008, which gives you a myriad of compression options that can vastly improve I/O performance;
Consider pre-compressing the per-category Star data into a compact serialized list (using the BinaryWriter class in .NET), and store it in a varbinary column. This way you can have one row per category. This violates 1NF rules, but since you don't seem to be doing anything with individual Star data from within the database anyway anyway, I doubt you'd be losing much.
Consider using a non-relational database or storage format, such as db4o or Cassandra. Instead of implementing a known database anti-pattern (the infamous "data dump"), use a tool that is actually designed for that kind of storage and access pattern.
Must you delete them? Often it is better to just set an IsDeleted bit column to 1, and then do the actual deletion asynchronously during off hours.
Edit:
This is a shot in the dark, but adding a clustered index on CategoryId may speed up deletes. It may also impact other queries adversely. Is this something you can test?
This was the old technique in SQL 2000 , partitioned views and remains a valid option for SQL 2005. The problem does come in from having large quantity of tables and the maintenance overheads associated with them.
As you say, partitioning is an enterprise feature, but is designed for this large scale data removal / rolling window effect.
One other option would be running batched deletes to avoid creating 1 very large transaction, creating hundreds of far smaller transactions, to avoid lock escalations and keep each transaction small.
Having separate tables is partitioning - you are just managing it manually and do not get any management assistance or unified access (without a view or partitioned view).
Is the cost of Enterprise Edition more expensive than the cost of separately building and maintaining a partitioning scheme?
Alternatives to the long-running delete also include populating a replacement table with identical schema and simply excluding the rows to be deleted and then swapping the table out with sp_rename.
I'm not understanding why whole categories of stars are being deleted on a regular basis? Presumably you are having new categories created all the time, which means your number of categories must be huge and partitioning on (manually or not) that would be very intensive.
Maybe on the Stars table set the PK to non-clustered and add a clustered index on categoryid.
Other than that, is the server setup well done regarding best practices for performance? That is using separate physical disks for data and logs, not using RAID5, etc.
When you say deleting millions of rows is "too intense for SQL server", what do you mean? Do you mean that the log file grows too much during the delete?
All you should have to do is execute the delete in batches of a fixed size:
DECLARE #i INT
SET #i = 1
WHILE #i > 0
BEGIN
DELETE TOP 10000 FROM dbo.SuperBigTable
WHERE CategoryID = 743
SELECT #i = ##ROWCOUNT
END
If your database is in full recovery mode, you will have to run frequent transaction log backups during this process so that it can reuse the space in the log. If the database is in simple mode, you shouldn't have to do anything.
My only other recommendation is to make sure that you have an appropriate index in CategoryId. I might even recommend that this be the clustered index.
If you want to optimize on a category delete clustered composite index with category at the first place might do more good than damage.
Also you could describe the relationships on the table.
It sounds like the transaction log is struggling with the size of the delete. The transaction log grows in units, and this takes time whilst it allocates more disk space.
It is not possible to delete rows from a table without enlisting a transaction, although it is possible to truncate a table using the TRUNCATE command. However this will remove all rows in the table without condition.
I can offer the following suggestions:
Switch to a non-transactional database or possibly flat files. It doesn't sound like you need atomicity of a transactional database.
Attempt the following. After every x deletes (depending on size) issue the following statement
BACKUP LOG WITH TRUNCATE_ONLY;
This simply truncates the transaction log, the space remains for the log to refill. However Im not sure howmuch time this will add to the operation.
What do you do with the star data? If you only look at data for one category at any given time this might work, but it is hard to maintain. Every time you have a new category, you will have to build a new table. If you want to query across categories, it becomes more complex and possibly more expensive in terms of time. If you do this and do want to query across categories a view is probably best (but do not pile views on top of views). If you are looking for data on a particular star, would you know which table to query? If not then how are you going to determine which table or are you goign to query them all? When entering data, how will the application decide which table to put the data into? How many categories will there be? And incidentally relating to each having a separate id, use the bigint identities and combine the identity with the category type for your unique identifier.
Truly do you need to delete the whole category or only the star that the data changed for?
And do you need to delete at all, maybe you only need to update information.
Have you tried deleting in batches (1000 records or so at a time in a loop). This is often much faster than deleting a million records in one delete statement. It often keeps the table from getting locked during the delete as well.
Another technique is mark the record for deletion. Then you can run a batch process when usage is low to delete those records and your queries can run on a view that excludes the records marked for deletion.
Given your answers, I think your proposal may be reasonable.
I know this is a bit of a tangent, but is SQL Server (or any relational database) really a good tool for this job? What relation database features are you actually using?
If you are dropping whole categories at a time, you can't have much referential integrity depending on it. The data is read only, so you don't need ACID for data updates.
Sounds to me like you are using basic SELECT query features?
Just taking your idea of many tables - how can you realise that...
What about using dynamic queries.
create the table of categories that have identity category_id column.
create the trigger on insert for this tale - in it create table for stars with the name dynamically made from category_id.
create the trigger on delete - in it drop the corresponding stars table also with the help of dynamically created sql.
to select stars of concrete category you can use function that returns table. It will take category_id as a parameter and return result also through dynamic query.
to insert stars of new category you firstly insert new row in categories table and then insert stars to appropriate table.
Another direction in which I would make some researches is using xml typed column for storing stars data. The main idea here is if you need to operate stars only by categories than why not to store all stars of concrete category in one cell of the table in xml format. Unfortunately I absolutely cannot imaging what will be the performance of such decision.
Both this variants are just like ideas in brainstorm.
As Cade pointed out, adding a table for each category is manually partitioning the data, without the benefits of the unified access.
There will never be any deletions for millions of rows that happen as fast as dropping a table, without the use of partitions.
Therefore, it seems like using a separate table for each category may be a valid solution. However, since you've stated that some of these categories are kept, and some are deleted, here is a solution:
Create a new stars table for each new
category.
Wait for the time period to expire where you decide whether the stars for the category are kept or not.
Roll the records into the main stars table if you plan on keeping them.
Drop the table.
This way, you will have a finite number of tables, depending on the rate you add categories and the time period where you decide if you want them or not.
Ultimately, for the categories that you keep, you're doubling the work, but the extra work is distributed over time. Inserts to the end of the clustered index may be experienced less by the users than deletes from the middle. However, for those categories that you're not keeping, you're saving tons of time.
Even if you're not technically saving work, perception is often the bigger issue.
I didn't get an answer to my comment on the original post, so I am going under some assumptions...
Here's my idea: use multiple databases, one for each category.
You can use the managed ESE database that ships with every version of Windows, for free.
Use the PersistentDictionary object, and keep track of the starid, starname pairs that way. If you need to delete a category, just delete the PersistentDictionary object for that category.
PersistentDictionary<int, string> starsForCategory = new PersistentDictionary<int, string>("Category1");
This will create a database called "Category1", on which you can use standard .NET dictionary methods (add, exists, foreach, etc).

SQL - Optimizing performance of bulk inserts and large joins?

I am doing ETL for log files into a PostgreSQL database, and want to learn more about the various approaches used to optimize performance of loading data into a simple star schema.
To put the question in context, here's an overview of what I do currently:
Drop all foreign key and unique
constraints
Import the data (~100 million records)
Re-create the constraints and run analyze on the fact table.
Importing the data is done by loading from files. For each file:
1) Load the data from into a temporary table using COPY (the PostgreSQL bulk upload tool)
2) Update each of the 9 dimension tables with any new data using an insert for each such as:
INSERT INTO host (name)
SELECT DISTINCT host_name FROM temp_table
EXCEPT
SELECT name FROM host;
ANALYZE host;
The analyze is run at the end of the INSERT with the idea of keeping the statistics up to date over the course of tens of millions of updates (Is this advisable or necessary? At minimum it does not seem to significantly reduce performance).
3) The fact table is then updated with an unholy 9-way join:
INSERT INTO event (time, status, fk_host, fk_etype, ... )
SELECT t.time, t.status, host.id, etype.id ...
FROM temp_table as t
JOIN host ON t.host_name = host.name
JOIN url ON t.etype = etype.name
... and 7 more joins, one for each dimension table
Are there better approaches I'm overlooking?
I've tried several different approaches to trying to normalize the data incoming from a source as such and generally I've found the approach you're using now to be my choice. Its easy to follow and minor changes stay minor. Trying to return the generated id from one of the dimension tables during stage 2 only complicated things and usually generates far too many small queries to be efficient for large data sets. Postgres should be very efficient with your "unholy join" in modern versions and using "select distinct except select" works well for me. Other folks may know better, but I've found your current method to be my perferred method.
During stage 2 you know the primary key of each dimension you're inserting data into (after you've inserted it), but you're throwing this information away and rediscovering it in stage 3 with your "unholy" 9-way join.
Instead I'd recommend creating one sproc to insert into your fact table; e.g. insertXXXFact(...), which calls a number of other sprocs (one per dimension) following the naming convention getOrInsertXXXDim, where XXX is the dimension in question. Each of these sprocs will either look-up or insert a new row for the given dimension (thus ensuring referential integrity), and should return the primary key of the dimension your fact table should reference. This will significantly reduce the work you need to do in stage 3, which is now reduced to a call of the form insert into XXXFact values (DimPKey1, DimPKey2, ... etc.)
The approach we've adopted in our getOrInsertXXX sprocs is to insert a dummy value if one is not available and have a separate cleanse process to identify and enrich these values later on.