Checking Duplicates before inserting into SQL database - sql

So I've been doing some research and I need to write up an INSERT statement to insert unique client names into a table on my server. However the default standard of the database already has thousands of clients in it, and when inserting new clients we need to check if they already exist before attempting to add it to the system.
My question is what would be the best/fastest way to do this? Would it be better to run a simple select query on the clients table (ordered by ASC), and do a binary search or something on the results, or perhaps just do a SQL query similar to the one below?
IF NOT EXISTS (SELECT 1 FROM clients AS c WHERE c.clientname = ?)
BEGIN
INSERT INTO clients (clientname, address, ...)
VALUES (?, ?, ...)
END
Is this a slow statement? I may have to run the insert several hundred times per each submission.

The standard advice is to create a UNIQUE constraint if you want a given column to be unique.
ALTER TABLE clients ADD UNIQUE KEY (clientname);
Then try to do the INSERT, and it'll succeed if there is no matching row, and it'll fail if there is a duplicate. No SELECT is necessary.

It is not too uncommon to calculate the cost of a SQL in query in terms of disk operations (usually that means reading/writing a block (typically 8 KB) is the unit for your costs). (In Memory-DBs should change something about this line of thought).
If you have hundreds, possible thousands of items and each item is... Say 20 Bytes, then your full database will possibly fit in a single block on disk (400 items/block). Maybe it needs a couple more blocks, but hurray: It is a neglectable small number. With such a small database, your database will probably lounge around in your database' memory cache and you will only need to pay for write access.
As your database grows the number of block accesses that you need can be exponentially reduced if you have an index.
Both your solution and Bill's solution will not cause any write access if an item is already present in the database and thus it should both be equally fast.
The interesting part would be:
I may have to run the insert several hundred times per each
submission.
That would mean that you might write one and the same block on disk hundreds of times. It would be faster if you could do this in a single step. However, this is indeed a problem as I am not aware of any SQL function that allows this behavior. MySQL's INSERT offers a way to specify a number of values in a single statment. This MIGHT be a considerable plus (I don't know how smart MySQL handles this situation) but it is specific to MySQL and it is not portable.
Another way to speed things up is to not wait until the blocks you have changed are written to disk. This comes at the risk of loosing data without notice, but can be a significant performance boost. Again this is specific to the DBMS that you use. E.g. if you use MySQL with InnoDB you can set the option innodb_flush_log_at_trx_commit=0 in your my.ini to archieve this behaviour.
Would it be better to run a simple select query on the clients table
(ordered by ASC), and do a binary search or something on the results
This would needlessly copy large amounts of data from your DBMS to a client (which may possibly be on different machines, communicating over a network protocol). This would still be OK for your small DB, but it does not scale well. It may only be of use if it helps you to save data in a single operation to disk.

Related

Does it affect performance to frequently repopulate a highly read database table?

I have a database table with about 2500 rows in, which is frequently read by my web application. Will it affect the performance of reading from that table if all of the data in it is frequently (e.g. every 1-5 minutes) deleted and re-inserted?
By that I mean:
DELETE FROM MyTable
INSERT INTO MyTable SELECT ...
Probably not, at the given numbers ...
However, if you have one or more index(es) on your table (to help with read/select, or automatically on any PK/UK ...) you should consider that every delete/insert may result in re-calculation of any such index (on top of the delete/insert as such), not directly affecting table-reads as such, but adding to the overall load on the DB server.
There is no sourcecode, but it appears you are using this table as intermediate/interface to sth. else, so while 'updating' you'd probably want to make sure to bundle your delete(s)/insert(s) in transactions, best you can, rather than e.g. executing them all individually, like in a loop. Or see if you can keep your PKs and rather just update ...?
This could also help reduce fragmentation in the underlying storage ...

Speed-up SQL Insert Statements

I am facing an issue with an ever slowing process which runs every hour and inserts around 3-4 million rows daily into an SQL Server 2008 Database.
The schema consists of a large table which contains all of the above data and has a clustered index on a datetime field (by day), a unique index on a combination of fields in order to exclude duplicate inserts, and a couple more indexes on 2 varchar fields.
The typical behavior as of late, is that the insert statements get suspended for a while before they complete. The overall process used to take 4-5 mins and now it's usually well over 40 mins.
The inserts are executed by a .net service which parses a series of xml files, performs some data transformations and then inserts the data to the DB. The service has not changed at all, it's just that the inserts take longer than they use to.
At this point I'm willing to try everything. Please, let me know whether you need any more info and feel free to suggest anything.
Thanks in advance.
Sounds like you have exhausted the buffer pools ability to cache all the pages needed for the insert process. Append-style inserts (like with your date table) have a very small working set of just a few pages. Random-style inserts have basically the entire index as their working set. If you insert a row at a random location the existing page that row is supposed to be written to must be read first.
This probably means tons of disk seeks for inserts.
Make sure to insert all rows in one statement. Use bulk insert or TVPs. This allows SQL Server to optimize the query plan by sorting the inserts by key value making IO much more efficient.
This will, however, not realize a big speedup (I have seen 5x in similar situations). To regain the original performance you must bring the working set back into memory. Add RAM, purge old data, or partition such that you only need to touch very few partitions.
drop index's before insert and set them up on completion

What Are the Performance Differences Between Running One vs Many Inserts

I'm currently in a situation where I'm building a script that I know will need to insert multiple rows. I'm doing this in Perl, so in terms of parameterization, it's much easier to insert each row individually. In terms of speed, I'm guessing running just one insert statement will be faster (although latency will be relatively low as I'm quite close to the database itself). I'm thinking the number of rows per run of the script will be about 20-40 on average. That said, what would be the approximate performance differences between running just 1 INSERT INTO statement v.s. running one for each row? Note: The server is running SQL 2008.
[EDIT]Since there seems to be a lot of confusion, I'd like to clarify that what I'm really asking for is the theory behind how a multi-row insert is handled by SQL Server 2008. Does it essentially just convert it internally into a bunch of individual insert statements and run those over one connection, or does it do something more intelligent?
Yes, I know I can run timed loops. No, that's not what I'm asking for. [/EDIT]
Combining multiple inserts into one command is always going to execute much more quickly than executing separate inserts. The reasons are:
A lot of work is done parsing the SQL - with multi version, there's only one parsing effort
More work is done checking permissions - again, only done once
Database connections are "chatty" - with multi version, handshaking only done once. You really notice this issue when using a poor network connection
Finally, multi version gives opportunity for server to optimize the operation
There is a general idea to let the SQL database do its thing and not try to treat the database as some sort of disk read. I've seen many times where a developer will read from one table, then another, or do a general query and then run through each row to see if it's the one they want. Generally, it's better to let the SQL database do its thing.
In this case, I can't really see an advantage of doing a single vs. multiple row insert. I guess there might be some because you don't have to do multiple prepares, and commits.
It shouldn't be too difficult to actual create a temporary database and try this out. Create a database with two columns, and have the program generate data to toss into the tables. Give yourself a decent amount to do. For example, how many items will this table have? And, how many do you think you'll be inserting at once? Say create a table of 1,000,000 items, and insert into this table 1000 items at a time, 100 items at a time, and one item at a time. Just generate data using the increment operator. There may be a "sweetspot" of the number of items you can insert at once.
In my unbiased, and always correct opinion, you'll probably find that the difference isn't worth fretting over, and you should instead employ the method that makes your code the easiest to maintain.
I've have a programming dictum: The place where you want to optimize your code is probably the wrong place. We like efficiency, but we usually attack the wrong item. And, whatever we've squeezed out in terms of efficiency, we end up wasting in maintenance.
So, just program what is the easiest to understand and don't fret about being overly efficient.
Just to add a couple of other performance differentiators to think about on insertion:
Foreign Keys - If the table you are inserting into has foreign keys, SQL Server effectively needs to join to the foreign key tables on insert. When you do your inserts in one query, SQL server can be more efficient in doing these joins.
Transactions - As you don't mention transactions, I assume you must be using SQL Server auto-commit mode. With such a small number of rows, it is likely that the overhead of creating 40 transactions vs. 1 transaction would be higher than maintaining the log to allow rollback. However, if you were inserting 400000 rows, it would likely be more expensive to insert in one statement/transaction than insert 400000 separate rows as the cost to be prepared to roll back up to 400000 rows is very high (if you were to insert 400000 rows, it usually is best to insert in batches -> the optimal batch size can be determined through testing). Also, above a certain row count, it may become more efficient to disable the foreign keys, insert the rows, then re-enable them.

Is this INSERT likely to cause any locking/concurrency issues?

In an effort to avoid auto sequence numbers and the like for one reason or another in this particular database, I wondered if anyone could see any problems with this:
INSERT INTO user (label, username, password, user_id)
SELECT 'Test', 'test', 'test', COALESCE(MAX(user_id)+1, 1) FROM user;
I'm using PostgreSQL (but also trying to be as database agnostic as possible)..
EDIT:
There's two reasons for me wanting to do this.
Keeping dependency on any particular RDBMS low.
Not having to worry about updating sequences if the data is batch-updated to a central database.
Insert performance is not an issue as the only tables where this will be needed are set-up tables.
EDIT-2:
The idea I'm playing with is that each table in the database have a human-generated SiteCode as part of their key, so we always have a compound key. This effectively partitions the data on SiteCode and would allow taking the data from a particular site and putting it somewhere else (obviously on the same database structure). For instance, this would allow backing up of various operational sites onto one central database, but also allow that central database to have operational sites using it.
I could still use sequences, but it seems messy. The actual INSERT would look more like this:
INSERT INTO user (sitecode, label, username, password, user_id)
SELECT 'SITE001', 'Test', 'test', 'test', COALESCE(MAX(user_id)+1, 1)
FROM user
WHERE sitecode='SITE001';
If that makes sense..
I've done something similar before and it worked fine, however the central database in that case was never operational (it was more of a way of centrally viewing data / analyzing) so it did not need to generate ids.
EDIT-3:
I'm starting to think it'd be simpler to only ever allow the centralised database to be either active-only or backup-only, thus avoiding the problem completely and allowing a more simple design.
Oh well back to the drawing board!
There are a couple of points:
Postgres uses Multi-Version Concurrency Control (MVCC) so Readers are never waiting on writers and vice versa. But there is of course a serialization that happens upon each write. If you are going to load a bulk of data into the system, then look at the COPY command. It is much faster than running a large swab of INSERT statements.
The MAX(user_id) can be answered with an index, and probably is, if there is an index on the user_id column. But the real problem is that if two transactions start at the same time, they will see the same MAX(user_id) value. It leads me to the next point:
The canonical way of handling numbers like user_id's is by using SEQUENCE's. These essentially are a place where you can draw the next user id from. If you are really worried about performance on generating the next sequence number, you can generate a batch of them per thread and then only request a new batch when it is exhausted (sometimes called a HiLo sequence).
You may be wanting to have user_id's packed up nice and tight as increasing numbers, but I think you should try to get rid of that. The reason is that deleting a user_id will create a hole anyway. So i'd not worry too much if the sequences were not strictly increasing.
Yes, I can see a huge problem. Don't do it.
Multiple connections can get the EXACT SAME id at the same time. I was going to add "under load" but it doesn't even need to be - just need the right timing between two queries.
To avoid it, you can use transactions or locking mechanisms or isolation levels specific to each DB, but once we get to that stage, you might as well use the dbms-specific sequence/identity/autonumber etc.
EDIT
For question edit2, there is no reason to fear gaps in the user_id, so you have one sequence across all sites. If gaps are ok, some options are
use guaranteed update statements, such as (in SQL Server)
update tblsitesequenceno set #nextnum = nextnum = nextnum + 1
Multiple callers to this statement are each guaranteed to get a unique number.
use a single table that produces the identity/sequence/autonumber (db specific)
If you cannot have gaps at all, consider using a transaction mechanism that will restrict access while you are running the max() query. Either that or use a proliferation of (sequences/tables with identity columns/tables with autonumber) that you manipulate using dynamic SQL using the same technique for a single sequence.
By all means use a sequence to generate unique numbers. They are fast, transaction safe and reliable.
Any self-written implemention of a "sequence generator" is either not scalable for a multi-user environment (because you need to do heavy locking) or simply not correct.
If you do need to be DBMS independent, then create an abstraction layer that uses sequences for those DBMS that support them (Posgres, Oracle, Firebird, DB2, Ingres, Informix, ...) and a self written generator on those that don't.
Trying to create a system than is DBMS independent, simply means it will run equally slow on all systems if you don't exploit the advantages of each DBMS.
Your goal is a good one. Avoiding IDENTITY and AUTOINCREMENT columns means avoiding a whole plethora of administration problems. Here is just one example of the many.
However most responders at SO will not appreciate it, the popular (as opposed to technical) response is "always stick an Id AUTOINCREMENT column on everything that moves".
A next-sequential number is fine, all vendors have optimised it.
As long as this code is inside a Transaction, as it should be, two users will not get the same MAX()+1 value. There is a concept called Isolation Level which needs to be understood when coding Transactions.
Getting away from user_id and onto a more meaningful key such as ShortName or State plus UserNo is even better (the former spreads the contention, latter avoids the next-sequential contention altogether, relevant for high volume systems).
What MVCC promises, and what it actually delivers, are two different things. Just surf the net or search SO to view the hundreds of problems re PostcreSQL/MVCC. In the realm of computers, the laws of physics applies, nothing is free. MVCC stores private copies of all rows touched, and resolves collisions at the end of the Transaction, resulting in far more Rollbacks. Whereas 2PL blocks at the beginning of the Transaction, and waits, without the massive storage of copies.
most people with actual experience of MVCC do not recommend it for high contention, high volume systems.
The first example code block is fine.
As per Comments, this item no longer applies: The second example code block has an issue. "SITE001" is not a compound key, it is a compounded column. Do not do that, separate "SITE" and "001" into two discrete columns. And if "SITE" is a fixed, repeatingvalue, it can be eliminated.
Different users can have the same user_id, concurrent SELECT-statements will see the same MAX(user_id).
If you don't want to use a SEQUENCE, you have to use an extra table with a single record and update this single record every time you need a new unique id:
CREATE TABLE my_sequence(id INT);
BEGIN;
UPDATE my_sequence SET id = COALESCE(id, 0) + 1;
INSERT INTO
user (label, username, password, user_id)
SELECT 'Test', 'test', 'test', id FROM my_sequence;
COMMIT;
I agree with maksymko, but not because I dislike sequences or autoincrementing numbers, as they have their place. If you need a value to be unique throughout your "various operational sites" i.e. not only within the confines of the single database instance, a globally unique identifier is a robust, simple solution.

SQL Server Efficiently dropping a group of rows with millions and millions of rows

I recently asked this question:
MS SQL share identity seed amongst tables
(Many people wondered why)
I have the following layout of a table:
Table: Stars
starId bigint
categoryId bigint
starname varchar(200)
But my problem is that I have millions and millions of rows. So when I want to delete stars from the table Stars it is too intense on SQL Server.
I cannot use built in partitioning for 2005+ because I do not have an enterprise license.
When I do delete though, I always delete a whole category Id at a time.
I thought of doing a design like this:
Table: Star_1
starId bigint
CategoryId bigint constaint rock=1
starname varchar(200)
Table: Star_2
starId bigint
CategoryId bigint constaint rock=2
starname varchar(200)
In this way I can delete a whole category and hence millions of rows in O(1) by doing a simple drop table.
My question is, is it a problem to have hundreds of thousands of tables in your SQL Server? The drop in O(1) is extremely desirable to me. Maybe there's a completely different solution I'm not thinking of?
Edit:
Is a star ever modified once it is inserted? No.
Do you ever have to query across star categories? I never have to query across star categories.
If you are looking for data on a particular star, would you know which table to query? Yes
When entering data, how will the application decide which table to put the data into? The insertion of star data is done all at once at the start when the categoryId is created.
How many categories will there be? You can assume there will be infinite star categories. Let's say up to 100 star categories per day and up to 30 star categories not needed per day.
Truly do you need to delete the whole category or only the star that the data changed for? Yes the whole star category.
Have you tried deleting in batches? Yes we do that today, but it is not good enough.
od enough.
Another technique is mark the record for deletion? There is no need to mark a star as deleted because we know the whole star category is eligible to be deleted.
What proportion of them never get used? Typically we keep each star category data for a couple weeks but sometimes need to keep more.
When you decide one is useful is that good for ever or might it still need to be deleted later?
Not forever, but until a manual request to delete the category is issued.
If so what % of the time does that happen? Not that often.
What kind of disc arrangement are you using? Single filegroup storage and no partitioning currently.
Can you use sql enterprise ? No. There are many people that run this software and they only have sql standard. It is outside of their budget to get ms sql enterprise.
My question is, is it a problem to have hundreds of thousands of tables in your SQL Server?
Yes. It is a huge problem to have this many tables in your SQL Server. Every object has to be tracked by SQL Server as metadata, and once you include indexes, referential constraints, primary keys, defaults, and so on, then you are talking about millions of database objects.
While SQL Server may theoretically be able to handle 232 objects, rest assured that it will start buckling under the load much sooner than that.
And if the database doesn't collapse, your developers and IT staff almost certainly will. I get nervous when I see more than a thousand tables or so; show me a database with hundreds of thousands and I will run away screaming.
Creating hundreds of thousands of tables as a poor-man's partitioning strategy will eliminate your ability to do any of the following:
Write efficient queries (how do you SELECT multiple categories?)
Maintain unique identities (as you've already discovered)
Maintain referential integrity (unless you like managing 300,000 foreign keys)
Perform ranged updates
Write clean application code
Maintain any sort of history
Enforce proper security (it seems evident that users would have to be able to initiate these create/drops - very dangerous)
Cache properly - 100,000 tables means 100,000 different execution plans all competing for the same memory, which you likely don't have enough of;
Hire a DBA (because rest assured, they will quit as soon as they see your database).
On the other hand, it's not a problem at all to have hundreds of thousands of rows, or even millions of rows, in a single table - that's the way SQL Server and other SQL RDBMSes were designed to be used and they are very well-optimized for this case.
The drop in O(1) is extremely desirable to me. Maybe there's a completely different solution I'm not thinking of?
The typical solution to performance problems in databases is, in order of preference:
Run a profiler to determine what the slowest parts of the query are;
Improve the query, if possible (i.e. by eliminating non-sargable predicates);
Normalize or add indexes to eliminate those bottlenecks;
Denormalize when necessary (not generally applicable to deletes);
If cascade constraints or triggers are involved, disable those for the duration of the transaction and blow out the cascades manually.
But the reality here is that you don't need a "solution."
"Millions and millions of rows" is not a lot in a SQL Server database. It is very quick to delete a few thousand rows from a table of millions by simply indexing on the column you wish to delete from - in this case CategoryID. SQL Server can do this without breaking a sweat.
In fact, deletions normally have an O(M log N) complexity (N = number of rows, M = number of rows to delete). In order to achieve an O(1) deletion time, you'd be sacrificing almost every benefit that SQL Server provides in the first place.
O(M log N) may not be as fast as O(1), but the kind of slowdowns you're talking about (several minutes to delete) must have a secondary cause. The numbers do not add up, and to demonstrate this, I've gone ahead and produced a benchmark:
Table Schema:
CREATE TABLE Stars
(
StarID int NOT NULL IDENTITY(1, 1)
CONSTRAINT PK_Stars PRIMARY KEY CLUSTERED,
CategoryID smallint NOT NULL,
StarName varchar(200)
)
CREATE INDEX IX_Stars_Category
ON Stars (CategoryID)
Note that this schema is not even really optimized for DELETE operations, it's a fairly run-of-the-mill table schema you might see in SQL server. If this table has no relationships, then we don't need the surrogate key or clustered index (or we could put the clustered index on the category). I'll come back to that later.
Sample Data:
This will populate the table with 10 million rows, using 500 categories (i.e. a cardinality of 1:20,000 per category). You can tweak the parameters to change the amount of data and/or cardinality.
SET NOCOUNT ON
DECLARE
#BatchSize int,
#BatchNum int,
#BatchCount int,
#StatusMsg nvarchar(100)
SET #BatchSize = 1000
SET #BatchCount = 10000
SET #BatchNum = 1
WHILE (#BatchNum <= #BatchCount)
BEGIN
SET #StatusMsg =
N'Inserting rows - batch #' + CAST(#BatchNum AS nvarchar(5))
RAISERROR(#StatusMsg, 0, 1) WITH NOWAIT
INSERT Stars2 (CategoryID, StarName)
SELECT
v.number % 500,
CAST(RAND() * v.number AS varchar(200))
FROM master.dbo.spt_values v
WHERE v.type = 'P'
AND v.number >= 1
AND v.number <= #BatchSize
SET #BatchNum = #BatchNum + 1
END
Profile Script
The simplest of them all...
DELETE FROM Stars
WHERE CategoryID = 50
Results:
This was tested on an 5-year old workstation machine running, IIRC, a 32-bit dual-core AMD Athlon and a cheap 7200 RPM SATA drive.
I ran the test 10 times using different CategoryIDs. The slowest time (cold cache) was about 5 seconds. The fastest time was 1 second.
Perhaps not as fast as simply dropping the table, but nowhere near the multi-minute deletion times you mentioned. And remember, this isn't even on a decent machine!
But we can do better...
Everything about your question implies that this data isn't related. If you don't have relations, you don't need the surrogate key, and can get rid of one of the indexes, moving the clustered index to the CategoryID column.
Now, as a rule, clustered indexes on non-unique/non-sequential columns are not a good practice. But we're just benchmarking here, so we'll do it anyway:
CREATE TABLE Stars
(
CategoryID smallint NOT NULL,
StarName varchar(200)
)
CREATE CLUSTERED INDEX IX_Stars_Category
ON Stars (CategoryID)
Run the same test data generator on this (incurring a mind-boggling number of page splits) and the same deletion took an average of just 62 milliseconds, and 190 from a cold cache (outlier). And for reference, if the index is made nonclustered (no clustered index at all) then the delete time only goes up to an average of 606 ms.
Conclusion:
If you're seeing delete times of several minutes - or even several seconds then something is very, very wrong.
Possible factors are:
Statistics aren't up to date (shouldn't be an issue here, but if it is, just run sp_updatestats);
Lack of indexing (although, curiously, removing the IX_Stars_Category index in the first example actually leads to a faster overall delete, because the clustered index scan is faster than the nonclustered index delete);
Improperly-chosen data types. If you only have millions of rows, as opposed to billions, then you do not need a bigint on the StarID. You definitely don't need it on the CategoryID - if you have fewer than 32,768 categories then you can even do with a smallint. Every byte of unnecessary data in each row adds an I/O cost.
Lock contention. Maybe the problem isn't actually delete speed at all; maybe some other script or process is holding locks on Star rows and the DELETE just sits around waiting for them to let go.
Extremely poor hardware. I was able to run this without any problems on a pretty lousy machine, but if you're running this database on a '90s-era Presario or some similar machine that's preposterously unsuitable for hosting an instance of SQL Server, and it's heavily-loaded, then you're obviously going to run into problems.
Very expensive foreign keys, triggers, constraints, or other database objects which you haven't included in your example, which might be adding a high cost. Your execution plan should clearly show this (in the optimized example above, it's just a single Clustered Index Delete).
I honestly cannot think of any other possibilities. Deletes in SQL Server just aren't that slow.
If you're able to run these benchmarks and see roughly the same performance I saw (or better), then it means the problem is with your database design and optimization strategy, not with SQL Server or the asymptotic complexity of deletions. I would suggest, as a starting point, to read a little about optimization:
SQL Server Optimization Tips (Database Journal)
SQL Server Optimization (MSDN)
Improving SQL Server Performance (MSDN)
SQL Server Query Processing Team Blog
SQL Server Performance (particularly their tips on indexes)
If this still doesn't help you, then I can offer the following additional suggestions:
Upgrade to SQL Server 2008, which gives you a myriad of compression options that can vastly improve I/O performance;
Consider pre-compressing the per-category Star data into a compact serialized list (using the BinaryWriter class in .NET), and store it in a varbinary column. This way you can have one row per category. This violates 1NF rules, but since you don't seem to be doing anything with individual Star data from within the database anyway anyway, I doubt you'd be losing much.
Consider using a non-relational database or storage format, such as db4o or Cassandra. Instead of implementing a known database anti-pattern (the infamous "data dump"), use a tool that is actually designed for that kind of storage and access pattern.
Must you delete them? Often it is better to just set an IsDeleted bit column to 1, and then do the actual deletion asynchronously during off hours.
Edit:
This is a shot in the dark, but adding a clustered index on CategoryId may speed up deletes. It may also impact other queries adversely. Is this something you can test?
This was the old technique in SQL 2000 , partitioned views and remains a valid option for SQL 2005. The problem does come in from having large quantity of tables and the maintenance overheads associated with them.
As you say, partitioning is an enterprise feature, but is designed for this large scale data removal / rolling window effect.
One other option would be running batched deletes to avoid creating 1 very large transaction, creating hundreds of far smaller transactions, to avoid lock escalations and keep each transaction small.
Having separate tables is partitioning - you are just managing it manually and do not get any management assistance or unified access (without a view or partitioned view).
Is the cost of Enterprise Edition more expensive than the cost of separately building and maintaining a partitioning scheme?
Alternatives to the long-running delete also include populating a replacement table with identical schema and simply excluding the rows to be deleted and then swapping the table out with sp_rename.
I'm not understanding why whole categories of stars are being deleted on a regular basis? Presumably you are having new categories created all the time, which means your number of categories must be huge and partitioning on (manually or not) that would be very intensive.
Maybe on the Stars table set the PK to non-clustered and add a clustered index on categoryid.
Other than that, is the server setup well done regarding best practices for performance? That is using separate physical disks for data and logs, not using RAID5, etc.
When you say deleting millions of rows is "too intense for SQL server", what do you mean? Do you mean that the log file grows too much during the delete?
All you should have to do is execute the delete in batches of a fixed size:
DECLARE #i INT
SET #i = 1
WHILE #i > 0
BEGIN
DELETE TOP 10000 FROM dbo.SuperBigTable
WHERE CategoryID = 743
SELECT #i = ##ROWCOUNT
END
If your database is in full recovery mode, you will have to run frequent transaction log backups during this process so that it can reuse the space in the log. If the database is in simple mode, you shouldn't have to do anything.
My only other recommendation is to make sure that you have an appropriate index in CategoryId. I might even recommend that this be the clustered index.
If you want to optimize on a category delete clustered composite index with category at the first place might do more good than damage.
Also you could describe the relationships on the table.
It sounds like the transaction log is struggling with the size of the delete. The transaction log grows in units, and this takes time whilst it allocates more disk space.
It is not possible to delete rows from a table without enlisting a transaction, although it is possible to truncate a table using the TRUNCATE command. However this will remove all rows in the table without condition.
I can offer the following suggestions:
Switch to a non-transactional database or possibly flat files. It doesn't sound like you need atomicity of a transactional database.
Attempt the following. After every x deletes (depending on size) issue the following statement
BACKUP LOG WITH TRUNCATE_ONLY;
This simply truncates the transaction log, the space remains for the log to refill. However Im not sure howmuch time this will add to the operation.
What do you do with the star data? If you only look at data for one category at any given time this might work, but it is hard to maintain. Every time you have a new category, you will have to build a new table. If you want to query across categories, it becomes more complex and possibly more expensive in terms of time. If you do this and do want to query across categories a view is probably best (but do not pile views on top of views). If you are looking for data on a particular star, would you know which table to query? If not then how are you going to determine which table or are you goign to query them all? When entering data, how will the application decide which table to put the data into? How many categories will there be? And incidentally relating to each having a separate id, use the bigint identities and combine the identity with the category type for your unique identifier.
Truly do you need to delete the whole category or only the star that the data changed for?
And do you need to delete at all, maybe you only need to update information.
Have you tried deleting in batches (1000 records or so at a time in a loop). This is often much faster than deleting a million records in one delete statement. It often keeps the table from getting locked during the delete as well.
Another technique is mark the record for deletion. Then you can run a batch process when usage is low to delete those records and your queries can run on a view that excludes the records marked for deletion.
Given your answers, I think your proposal may be reasonable.
I know this is a bit of a tangent, but is SQL Server (or any relational database) really a good tool for this job? What relation database features are you actually using?
If you are dropping whole categories at a time, you can't have much referential integrity depending on it. The data is read only, so you don't need ACID for data updates.
Sounds to me like you are using basic SELECT query features?
Just taking your idea of many tables - how can you realise that...
What about using dynamic queries.
create the table of categories that have identity category_id column.
create the trigger on insert for this tale - in it create table for stars with the name dynamically made from category_id.
create the trigger on delete - in it drop the corresponding stars table also with the help of dynamically created sql.
to select stars of concrete category you can use function that returns table. It will take category_id as a parameter and return result also through dynamic query.
to insert stars of new category you firstly insert new row in categories table and then insert stars to appropriate table.
Another direction in which I would make some researches is using xml typed column for storing stars data. The main idea here is if you need to operate stars only by categories than why not to store all stars of concrete category in one cell of the table in xml format. Unfortunately I absolutely cannot imaging what will be the performance of such decision.
Both this variants are just like ideas in brainstorm.
As Cade pointed out, adding a table for each category is manually partitioning the data, without the benefits of the unified access.
There will never be any deletions for millions of rows that happen as fast as dropping a table, without the use of partitions.
Therefore, it seems like using a separate table for each category may be a valid solution. However, since you've stated that some of these categories are kept, and some are deleted, here is a solution:
Create a new stars table for each new
category.
Wait for the time period to expire where you decide whether the stars for the category are kept or not.
Roll the records into the main stars table if you plan on keeping them.
Drop the table.
This way, you will have a finite number of tables, depending on the rate you add categories and the time period where you decide if you want them or not.
Ultimately, for the categories that you keep, you're doubling the work, but the extra work is distributed over time. Inserts to the end of the clustered index may be experienced less by the users than deletes from the middle. However, for those categories that you're not keeping, you're saving tons of time.
Even if you're not technically saving work, perception is often the bigger issue.
I didn't get an answer to my comment on the original post, so I am going under some assumptions...
Here's my idea: use multiple databases, one for each category.
You can use the managed ESE database that ships with every version of Windows, for free.
Use the PersistentDictionary object, and keep track of the starid, starname pairs that way. If you need to delete a category, just delete the PersistentDictionary object for that category.
PersistentDictionary<int, string> starsForCategory = new PersistentDictionary<int, string>("Category1");
This will create a database called "Category1", on which you can use standard .NET dictionary methods (add, exists, foreach, etc).