Your Opinion? - linking tables by LinkTypeId - sql

I have a web app that links files and notes to customers, users, projects, etc. Initially, I had tables such as customerNote, userNote, projectNote...etc.
Design considerations:
1: I don't want to manage N squared tables (over 100 customerNote, userNote, projectNote,... customerFile, projectFile...etc tables)
2: I don't want to use dynamic SQL (TableName from LinkType)
3: I don't see a clean way to use linktables (without 100+ N squared linktables)
Now, I have one Note table that has LinkId and LinkTypeId. LinkId of course, is the PK of the client|User|Project|etc tables; and LinkTypeId points to the type of link.
So this:
SELECT * FROM customerNote WHERE Id = 1210
SELECT * FROM userNote WHERE Id = 3281
Has now become this:
SELECT * FROM Note WHERE LinkId = 1210 AND LinkTypeId = 2 (2 being customer)
SELECT * FROM Note WHERE LinkId = 3281 AND LinkTypeId = 3 (3 being user)
I enjoy the simplicity of this approach, and I have wrapped them into functions that I call all over the place.
My questions are:
1: Without referential integrity what performance or other issues would I have?
2: Does this cause scalability problems?
3: Is there an elegant solution?
This is my first SO post, and I thank you all in advance for your help.

1: Without referential integrity what performance or other issues would I have?
You'll potentially have all the problems referential integrity is intended to eliminate. You'll have to either live with those problems, or implement an imitation of all the referential integrity constraints in application code or through administrative procedures (like reports).
You'll also greatly increase the size of your "notes" table. 100 thousand rows in each of 100 tables of free-form notes is pretty manageable. But 10 million rows in one table of notes might make you reconsider whether life is worth living.
Implementing an imitation of integrity constraints in application code means that, sooner or later, somebody (probably you) will side-step the application, and change rows through the dbms command-line client or gui client. You can do a lot of damage that way. The sane thing to do is to checkpoint or dump the database before you take such a risk, but 10 million rows of notes makes it less likely that you'll do that.
2: Does this cause scalability problems?
It can. If you have 100 separate notes tables, each of them can grow to 100,000 rows and still be fast to query. Put them all in one table, and now you've got 10 million rows. And because they're notes, fewer will fit on a page. That usually means slower speed. With this design, the single table of notes becomes a cold spot (or hot spot, depending on how you look at it), slowing every table that uses notes, not just one or two heavily annotated tables.
And after living with slower speed on all tables for a couple of months, you're likely to split that monster table up into the original tables again.
3: Is there an elegant solution?
If every note is supposed to have the same maximum length--a pretty unlikely requirement for 100 notes tables--then create a domain for the notes, and create one table of notes for each annotated table.
create domain note_text as varchar(1000) not null;
create table user_notes (
user_id integer not null references users (user_id) on delete cascade,
note_timestamp timestamp not null default current_timestamp,
user_note note_text,
primary key (user_id, note_timestamp)
);
As an aside, you need to be really careful allowing users to annotate rows. They will often (usually?) use note columns in lieu of putting data where it belongs. For example, if you have a table of users' phone numbers, then a note column will almost certainly end up with data line this.
Call 123-456-7890 between 8:00 am and 5:00 pm. (And that will match
none of this user's phone numbers.)
Toll-free orders at 1-800-123-4567.
He eats lunch at McDonald's on Tuesdays.

Related

Why database designers do not make IDENTITY columns start from the min value rather than 1?

As we know, In Sql Server, The IDENTITY (n,m) means that the values will start from n, and the increment value is m, but I noticed that all database designers make Identity columns as IDENTITY(1,1) , without taking advantage of all values of int data type which are from (-2,147,483,648) to (2,147,483,647),
I am planning to make all Identity columns as IDENTITY (-2,147,483,648, 1), (the identity columns are hidden from the application user).
Is that a good idea ?
If you find that 2billion values isn't enough, you're going to find out that 4billion isn't enough either (needing more than twice as many of anything over the lifetime of a project, than it was first designed for, is hardly rare*), so you need to take a different approach entirely (possibly long values, possibly something totally different).
Otherwise you're just being strange and unreadable for no gain.
Also, who doesn't have a database where they know that e.g. item 312 is the one with some nice characteristics for testing particular things? I know I have some arbitrary ids burned in my head. They may call it "so good they named it twice", but I'll always know New York as "city 657, covers most of our test cases". It's only a shorthand, but -2147482991 wouldn't be as handy.
*To add a bit to that. With some things you might say "ah about 100" and find it's actually 110, okay. With others you'll find actually it's actually 100,000 - you were out by orders of magnitude. The higher the number, the more often the mistake is of this sort due to the sort of problems that end up with estimates in the billions being different to those that end up with answers in the dozens. If you estimate 200 is your max in a given case, you should probably leave room for maybe a few hundred more. If you estimate 2billion in a given case, you should probably leave room for a few quadrillion more. That said, the only time I saw someone actually start an id at minus 2billion they ended up having about 3,000 rows.
If you have a class that represent your table in your code (which is very likely to happen), everytime you will create a new object it will be assigned the ID 0 by default. It could lead to mistakes that overwrite data in the database if the ID 0 is already assigned. This also makes it easy to determine if an object is new or if it came from the database by just doing if (myObject.ID != 0)
On the SQL Server side the negative ID-s are ok, handled like positive numbers, so you could do that.
The others are right, you should think about different suggestion, but the major problems are the applications connected to your database.
Let say take MS Enviroment. Here is an example:
.NET DataSet is using negative ID-s on autoincremented id-s to track changes in a code. So may you will have trouble, because:
The negative keys are used for temporary instances for the rows.
Here is the reference : MSDN
So definietly it is not a good idea to design a database like this for MSSQL in an MS Enviroment.
One of the 'nice' side effects of working with integers close to zero is that they are easy on the eye and easy for devs, testers etc to remember, especially with debugging and unit testing in mind.
Also, surrogate keys have a nasty habit of creeping into business terminology, e.g. users may be able to see the PK sitting in the URL querystring at the top of the browser - the more digits in the number, the more likely they are to misquote something in a helpdesk query.
So this is one of the reasons why I'm quite happy to seed my identities at 1, and not at -2147483231, and instead, as will, as #Jon suggests, move up to a BIGINT anytime that I may ever need more than 2 billion rows in my table.
As you transition from the negative numbers to positive ids, you will cross zero. That means (assuming you are actually inserting a couple of billion rows) that you will eventually have an identifier of zero. This is not intrinsically bad, but could present a potential edge-case for ORM tools or simply sloppy applicaiton code that has difficulty differentiating between a zero and a null.
Negative IDs can be useful for testing in a live environment where dummy data needs to be mixed with real data for testing purposes, then disposed of once the testing is complete. This should be done only with very good reason - I've only used the technique once.
Negative IDs can also be useful for Administrator purposes in single-row, read-only tables (i.e., no transactions, no executable SQL run on the table).
Aside from those specific purposes, identity values <= 0 will generate more heat than light.
And just to add, according to MS Docs (https://learn.microsoft.com/en-us/sql/relational-databases/in-memory-oltp/implementing-identity-in-a-memory-optimized-table?view=sql-server-2014), IDENTITY(1, 1) is supported on a memory-optimized table. However, identity columns with definition of IDENTITY(x, y) where x != 1 or y != 1 are NOT supported on memory-optimized tables.
So in my opinion, and the other reasons alluded to by the other users, IDENTITY(1, 1) is more practical and memory efficient for memory-optimized tables.
Start with INT IDENTITY(1, 1), then when you have maxed-out you can follow these steps to 'upgraded' to BIGINT:
Drop current PK constraint:
ALTER TABLE [dbo].[tbName] DROP CONSTRAINT [PK_tbName_Id]
GO
Upgrade PK to BIGINT:
ALTER TABLE [dbo].[tbName] ALTER COLUMN [dbo.Id] BIGINT
Recreate the PK Constraint:
ALTER TABLE [dbo].[tbName] ADD CONSTRAINT [PK_tbName_Id] PRIMARY KEY
CLUSTERED
(
[Id] ASC
)
Hope this helps, and good lucky!

Database Design: Partitioned Table vs Normalized Table

I have two tables: tblIssue and tblIssueSubscriber for my newsletter application.
This is my normalized design:
tblIssues (newsletter issues masterlist)
--------------------
IssueId int PK
PublisherCode varchar(10)
IssueDesc varchar(50)
tblIssueSubscribers (newsletter subscribers)
-----------------
IssueId int FK
EmailAddress varchar(100)
but tblIssueSubscriber is expected to hold hundred thousands or even millions of record per week and it will be accessed frequently that's why Im leaning towards Table partitioning. My design is to partition the tblIssueSubscriber per PublisherCode (We have 8 publisherCode on our masterlist).
tblIssues
--------------
IssueId int PK
PublisherCode varchar(10)
IssueDesc varchar(50)
tblIssueSubscribers
-----------------
IssueId int FK
PublisherCode varchar(10)
EmailAddress varchar(100)
and then partitioned it per PublisherCode
CREATE PARTITION FUNCTION [PartitionPublisher] (varchar(10)) AS RANGE RIGHT FOR VALUES ('PUBLISHER1', 'PUBLISHER2', 'PUBLISHER3', 'PUBLISHER4', 'PUBLISHER5', 'PUBLISHER6', 'PUBLISHER7', 'PUBLISHER8');
I know that table partitioning adds complexity so my question is,
Is it worth partitioning tblIssueSubscriber, or should I stick to
the normalized design?
First I think Size is a Red Herring. It's not a very useful argument since all size is relative and there are reasons to use partition irrespective of size.
Performance is only part of the reason. Ronnis makes some good points but it doesn't stop there.
There are two reasons to partition a table. One is performance, one is maintenance.
Let's start with maintenance.
In general DELETE is a 'bad' thing to do in a database. Say you mistakenly insert 1 million rows and then delete 1 million rows. Each of those deletes is logged generating UNDO and REDO records, which waste space and take time not only to make while deleting but again when 'played' for a point-in-time recovery. So what's better than delete? Truncate (or drop). When you have tables as you describe that are constantly growing, at some point you'd like to get rid of old records. This is why I say size is irrelevant - if you want to keep a year in that table, you'll need to remove records that are more than 12 months old - NO MATTER WHAT THAT SIZE IS. You could have a 300MB table or a 500GB table after 1 year of adding records - regardless you'll need/want to start deleting. So you can always just delete the rows with insert_dt < sysdate - 365. Or you could just drop or truncate that month/day's partition. A not logged transaction that will be less resource intensive.
There are other maintenance benefits like individually backing up partition or rebuilding indexes or moving to new tablespaces etc. Not sure what RDBMS you're using but you can load data via partition swaps in most. This allows you to make no changes to your final tables until all of the data is loaded and ready to go.
As far as performance goes...
The key here is that any query that doesn't include the partition key in the where clause will most likely perform worse than it did before partitioning. This isn't a GO_FASTER = TRUE type of setting. I've seen people implement partitioning and crush their systems. Ronnis' post is the basics of performance guideline in a single partitioned tables. If you have more than one table partitioned on the same key, some RDBMS's can parallelize the joins between them.
The query patterns will determine whether you will benefit from partitioning.
If your application is mostly about single row queries (typically primary key or indexed access), you will not see a performance gain from partitioning the table.
If your application is mostly about processing all the data publisher-wise, then you would benefit from partitioning by eliminating larger parts of the table when performing table scans.
It really depends on how large that database file is going to become and how many records you are going to have in there and what machine you are using. Do a rough calculation of how large you think it will become.
Roughly, lets say that database file will grow to 300 MB?
That is nothing... I would personally not partition it. I know some of our database clients who use partitioning, and they started partitioning when they expected the database to grow beyond 500 GB and that it ultimately may reach 4 TB. In that case, yes partition. But I suspect you are not going to go anywhere near that.
Plus, you can always partition later, no?
I would recommend a 64-bit machine, running Linux or Windows server 2008/Win7. And more memory is always good.

Continuation - Viewing FIRST_ROWS before query completes

I have identified the query constructs my users normally use. Would it make sense for me to create composite indexes to support those constructs and provide FIRST_ROWS capability?
If I migrate from SE to IDS, I will lose the ability to write low-level functions with C-ISAM calls, but gain FIRST_ROWS along with other goodies like: SET-READS for index scans (onconfig USE_[KO]BATCHEDREAD), optimizer directives, parallel queries, etc.
Information from Comments
Pawnshop production tables are queried by: customer.name char(30) using wildcards (LASSURF* to find LASTNAME SURNAME, FIRSTNAME) or queried by pawns.ticket_number INT. Customer and pawns are joined by: customer.name = pawns.name, not customer.serial = pawns.fk. Pawns with trx date older than 1 year are moved to historical table (>500K nrows) in a different database, on another hard disk. Index on historical is by trx_date descending. This is where the ad-hoc composite query constructs come into play.
Once a customer's pawn transaction is found, the row is updated when an intrest or redeem pymt is made by the customer. If customers don't make a pymt in 90 days, users will mananually update which pawns they will forfeit. pawns.status changes to inactive when a customer redeems a pawn or forfeits it for lack of pymt. inactives are moved out of pawns table into historical table when their trx dates are older than 1 year, so no mass-updating occurs in this app. Pawnshops run this proc every morning before opening business.
{ISQL 2.10.06E (SE-DOS16M protected mode) pawns table optimization -
once-daily, before start of business, procedure}
unload to "U:\UNL\ACTIVES.UNL"
select * from pawns where pawns.status = "A"
order by pawns.cust_name, pawns.trx_date;
unload to "U:\UNL\INACTIVE.UNL"
select * from pawns
where pawns.status <> "A"
and pawns.trx_date >= (today - 365)
order by pawns.cust_name, pawns.trx_date desc;
unload to "U:\UNL\HISTORIC.UNL"
select * from pawns
where pawns.status <> "A"
and pawns.trx_date < (today - 365)
order by pawns.trx_date desc;
drop table pawns;
create table pawns
(
trx_num serial,
cust_name char(30),
status char(1),
trx_date date,
. . . ) in "S:\PAWNSHOP.DBS\PAWNS";
load from "U:\UNL\ACTIVES.UNL" insert into pawns; {500:600 nrows avg.}
load from "U:\UNL\INACTIVE.UNL" insert into pawns; {6500:7000 nrows avg.}
load from "U:\UNL\HISTORIC.UNL" insert into dss:historic; {>500K nrows}
create cluster index pa_cust_idx on pawns (cust_name);
{this groups each customers pawns together, actives in
oldest trx_date order first, then inactive pawns within the last year in most
recent trx_date order. inactives older than 1 year are loaded into historic
table in a separate database, on a separate hard disk. historic table
optimization is done on a weekly basis for DSS queries.}
create unique index pa_trx_num_idx on pawns (trx_num);
create index pa_trx_date_idx on pawns (trx_date);
create index pa_status_idx on pawns (status);
{grant statements...}
update statistics;
There isn't a simple yes/no answer - it is a balancing act, as with so many performance issues.
There are two main costs associated with indexes which must be balanced against the benefits.
Indexes must be maintained as rows are added, deleted, modified in the table. The cost is not huge, but neither is it negligible.
Indexes occupy disk space.
There is also a small overhead when queries are optimized simply because there are more indexes to consider.
The primary benefit of good indexes is vastly improved performance on selecting data when the index can be used to good effect.
If your tables are not very volatile and are frequently searched with criteria where the indexes can help, then it probably makes sense to create the composite indexes, assuming that disk space is not an issue.
If your tables are very volatile, or if a specific index will seldom be used (but is beneficial on those few occasions when it is used), then you should perhaps weigh the almost one-off cost of a slower query against the cost of storing and maintaining the index for those few occasions when it can be used.
There is a quite good book on the subject of index design: Relational Database Index Design and the Optimizers by Lahdenmäki and Leach (it is also fairly expensive).
In the latest comment, Frank says:
[L]ooking for a couple of things. As its already been said, the simplest thing to do is to allow Informix to start returning rows once it has them. (Oracle does this by default.) The larger picture to what Frank is asking for is something similar to what Google has. Ok it really goes back to Alta Vista and the 90's when talking about search indexes on the web. The idea is that you can do a quick search, pick up the first n things while reporting a "number" of rows returned in the search. (As if the number reported by Google is accurate.)
This additional comment from Frank makes more sense in the context of the question for which this is a continuation.
Obviously, unless the SQL statement forces Informix to do a sort, it makes results available as soon as it has them; it always has. The FIRST_ROWS optimization hint indicates to IDS that if it has a choice of two query plans and one will let it produce the first rows more quickly than the other, then it should prefer the one that produces the first rows quickly, even if it is more expensive overall than the alternative. Even in the absence of the hint, IDS still tries to make the data available as quickly as possible - it just also tries to do it as efficiently as possible too.
When the query is prepared, you get an estimate of how many rows may be returned - you could use that as an indicator (a few, quite a lot, very many). Separately, you can quickly and independently discover the number of rows in the main table you are searching. Given this metadata, you can certainly use a technique with a scroll cursor to give you a backing store in the database that contains the primary key values of the rows you are interested in. At any time, you can load an array with the display data for a set of interesting rows for display to the user. On user request, you can arrange to display another page full of information. At some point in the proceedings, you will find that you've reached the end of the data in the scroll cursor. Clearly, if you do FETCH LAST, you force that to happen. If you just do a few more FETCH NEXTs, then you will eventually get a NOTFOUND condition.
All of this has been possible with Informix (IDS and its prior incarnations, OnLine, Turbo, SE, plus I4GL) since the late 80s. The FIRST_ROWS optimization is more recent; it is still just a hint to the optimizer, and usually makes little difference to what the optimizer does.

SQL Server Efficiently dropping a group of rows with millions and millions of rows

I recently asked this question:
MS SQL share identity seed amongst tables
(Many people wondered why)
I have the following layout of a table:
Table: Stars
starId bigint
categoryId bigint
starname varchar(200)
But my problem is that I have millions and millions of rows. So when I want to delete stars from the table Stars it is too intense on SQL Server.
I cannot use built in partitioning for 2005+ because I do not have an enterprise license.
When I do delete though, I always delete a whole category Id at a time.
I thought of doing a design like this:
Table: Star_1
starId bigint
CategoryId bigint constaint rock=1
starname varchar(200)
Table: Star_2
starId bigint
CategoryId bigint constaint rock=2
starname varchar(200)
In this way I can delete a whole category and hence millions of rows in O(1) by doing a simple drop table.
My question is, is it a problem to have hundreds of thousands of tables in your SQL Server? The drop in O(1) is extremely desirable to me. Maybe there's a completely different solution I'm not thinking of?
Edit:
Is a star ever modified once it is inserted? No.
Do you ever have to query across star categories? I never have to query across star categories.
If you are looking for data on a particular star, would you know which table to query? Yes
When entering data, how will the application decide which table to put the data into? The insertion of star data is done all at once at the start when the categoryId is created.
How many categories will there be? You can assume there will be infinite star categories. Let's say up to 100 star categories per day and up to 30 star categories not needed per day.
Truly do you need to delete the whole category or only the star that the data changed for? Yes the whole star category.
Have you tried deleting in batches? Yes we do that today, but it is not good enough.
od enough.
Another technique is mark the record for deletion? There is no need to mark a star as deleted because we know the whole star category is eligible to be deleted.
What proportion of them never get used? Typically we keep each star category data for a couple weeks but sometimes need to keep more.
When you decide one is useful is that good for ever or might it still need to be deleted later?
Not forever, but until a manual request to delete the category is issued.
If so what % of the time does that happen? Not that often.
What kind of disc arrangement are you using? Single filegroup storage and no partitioning currently.
Can you use sql enterprise ? No. There are many people that run this software and they only have sql standard. It is outside of their budget to get ms sql enterprise.
My question is, is it a problem to have hundreds of thousands of tables in your SQL Server?
Yes. It is a huge problem to have this many tables in your SQL Server. Every object has to be tracked by SQL Server as metadata, and once you include indexes, referential constraints, primary keys, defaults, and so on, then you are talking about millions of database objects.
While SQL Server may theoretically be able to handle 232 objects, rest assured that it will start buckling under the load much sooner than that.
And if the database doesn't collapse, your developers and IT staff almost certainly will. I get nervous when I see more than a thousand tables or so; show me a database with hundreds of thousands and I will run away screaming.
Creating hundreds of thousands of tables as a poor-man's partitioning strategy will eliminate your ability to do any of the following:
Write efficient queries (how do you SELECT multiple categories?)
Maintain unique identities (as you've already discovered)
Maintain referential integrity (unless you like managing 300,000 foreign keys)
Perform ranged updates
Write clean application code
Maintain any sort of history
Enforce proper security (it seems evident that users would have to be able to initiate these create/drops - very dangerous)
Cache properly - 100,000 tables means 100,000 different execution plans all competing for the same memory, which you likely don't have enough of;
Hire a DBA (because rest assured, they will quit as soon as they see your database).
On the other hand, it's not a problem at all to have hundreds of thousands of rows, or even millions of rows, in a single table - that's the way SQL Server and other SQL RDBMSes were designed to be used and they are very well-optimized for this case.
The drop in O(1) is extremely desirable to me. Maybe there's a completely different solution I'm not thinking of?
The typical solution to performance problems in databases is, in order of preference:
Run a profiler to determine what the slowest parts of the query are;
Improve the query, if possible (i.e. by eliminating non-sargable predicates);
Normalize or add indexes to eliminate those bottlenecks;
Denormalize when necessary (not generally applicable to deletes);
If cascade constraints or triggers are involved, disable those for the duration of the transaction and blow out the cascades manually.
But the reality here is that you don't need a "solution."
"Millions and millions of rows" is not a lot in a SQL Server database. It is very quick to delete a few thousand rows from a table of millions by simply indexing on the column you wish to delete from - in this case CategoryID. SQL Server can do this without breaking a sweat.
In fact, deletions normally have an O(M log N) complexity (N = number of rows, M = number of rows to delete). In order to achieve an O(1) deletion time, you'd be sacrificing almost every benefit that SQL Server provides in the first place.
O(M log N) may not be as fast as O(1), but the kind of slowdowns you're talking about (several minutes to delete) must have a secondary cause. The numbers do not add up, and to demonstrate this, I've gone ahead and produced a benchmark:
Table Schema:
CREATE TABLE Stars
(
StarID int NOT NULL IDENTITY(1, 1)
CONSTRAINT PK_Stars PRIMARY KEY CLUSTERED,
CategoryID smallint NOT NULL,
StarName varchar(200)
)
CREATE INDEX IX_Stars_Category
ON Stars (CategoryID)
Note that this schema is not even really optimized for DELETE operations, it's a fairly run-of-the-mill table schema you might see in SQL server. If this table has no relationships, then we don't need the surrogate key or clustered index (or we could put the clustered index on the category). I'll come back to that later.
Sample Data:
This will populate the table with 10 million rows, using 500 categories (i.e. a cardinality of 1:20,000 per category). You can tweak the parameters to change the amount of data and/or cardinality.
SET NOCOUNT ON
DECLARE
#BatchSize int,
#BatchNum int,
#BatchCount int,
#StatusMsg nvarchar(100)
SET #BatchSize = 1000
SET #BatchCount = 10000
SET #BatchNum = 1
WHILE (#BatchNum <= #BatchCount)
BEGIN
SET #StatusMsg =
N'Inserting rows - batch #' + CAST(#BatchNum AS nvarchar(5))
RAISERROR(#StatusMsg, 0, 1) WITH NOWAIT
INSERT Stars2 (CategoryID, StarName)
SELECT
v.number % 500,
CAST(RAND() * v.number AS varchar(200))
FROM master.dbo.spt_values v
WHERE v.type = 'P'
AND v.number >= 1
AND v.number <= #BatchSize
SET #BatchNum = #BatchNum + 1
END
Profile Script
The simplest of them all...
DELETE FROM Stars
WHERE CategoryID = 50
Results:
This was tested on an 5-year old workstation machine running, IIRC, a 32-bit dual-core AMD Athlon and a cheap 7200 RPM SATA drive.
I ran the test 10 times using different CategoryIDs. The slowest time (cold cache) was about 5 seconds. The fastest time was 1 second.
Perhaps not as fast as simply dropping the table, but nowhere near the multi-minute deletion times you mentioned. And remember, this isn't even on a decent machine!
But we can do better...
Everything about your question implies that this data isn't related. If you don't have relations, you don't need the surrogate key, and can get rid of one of the indexes, moving the clustered index to the CategoryID column.
Now, as a rule, clustered indexes on non-unique/non-sequential columns are not a good practice. But we're just benchmarking here, so we'll do it anyway:
CREATE TABLE Stars
(
CategoryID smallint NOT NULL,
StarName varchar(200)
)
CREATE CLUSTERED INDEX IX_Stars_Category
ON Stars (CategoryID)
Run the same test data generator on this (incurring a mind-boggling number of page splits) and the same deletion took an average of just 62 milliseconds, and 190 from a cold cache (outlier). And for reference, if the index is made nonclustered (no clustered index at all) then the delete time only goes up to an average of 606 ms.
Conclusion:
If you're seeing delete times of several minutes - or even several seconds then something is very, very wrong.
Possible factors are:
Statistics aren't up to date (shouldn't be an issue here, but if it is, just run sp_updatestats);
Lack of indexing (although, curiously, removing the IX_Stars_Category index in the first example actually leads to a faster overall delete, because the clustered index scan is faster than the nonclustered index delete);
Improperly-chosen data types. If you only have millions of rows, as opposed to billions, then you do not need a bigint on the StarID. You definitely don't need it on the CategoryID - if you have fewer than 32,768 categories then you can even do with a smallint. Every byte of unnecessary data in each row adds an I/O cost.
Lock contention. Maybe the problem isn't actually delete speed at all; maybe some other script or process is holding locks on Star rows and the DELETE just sits around waiting for them to let go.
Extremely poor hardware. I was able to run this without any problems on a pretty lousy machine, but if you're running this database on a '90s-era Presario or some similar machine that's preposterously unsuitable for hosting an instance of SQL Server, and it's heavily-loaded, then you're obviously going to run into problems.
Very expensive foreign keys, triggers, constraints, or other database objects which you haven't included in your example, which might be adding a high cost. Your execution plan should clearly show this (in the optimized example above, it's just a single Clustered Index Delete).
I honestly cannot think of any other possibilities. Deletes in SQL Server just aren't that slow.
If you're able to run these benchmarks and see roughly the same performance I saw (or better), then it means the problem is with your database design and optimization strategy, not with SQL Server or the asymptotic complexity of deletions. I would suggest, as a starting point, to read a little about optimization:
SQL Server Optimization Tips (Database Journal)
SQL Server Optimization (MSDN)
Improving SQL Server Performance (MSDN)
SQL Server Query Processing Team Blog
SQL Server Performance (particularly their tips on indexes)
If this still doesn't help you, then I can offer the following additional suggestions:
Upgrade to SQL Server 2008, which gives you a myriad of compression options that can vastly improve I/O performance;
Consider pre-compressing the per-category Star data into a compact serialized list (using the BinaryWriter class in .NET), and store it in a varbinary column. This way you can have one row per category. This violates 1NF rules, but since you don't seem to be doing anything with individual Star data from within the database anyway anyway, I doubt you'd be losing much.
Consider using a non-relational database or storage format, such as db4o or Cassandra. Instead of implementing a known database anti-pattern (the infamous "data dump"), use a tool that is actually designed for that kind of storage and access pattern.
Must you delete them? Often it is better to just set an IsDeleted bit column to 1, and then do the actual deletion asynchronously during off hours.
Edit:
This is a shot in the dark, but adding a clustered index on CategoryId may speed up deletes. It may also impact other queries adversely. Is this something you can test?
This was the old technique in SQL 2000 , partitioned views and remains a valid option for SQL 2005. The problem does come in from having large quantity of tables and the maintenance overheads associated with them.
As you say, partitioning is an enterprise feature, but is designed for this large scale data removal / rolling window effect.
One other option would be running batched deletes to avoid creating 1 very large transaction, creating hundreds of far smaller transactions, to avoid lock escalations and keep each transaction small.
Having separate tables is partitioning - you are just managing it manually and do not get any management assistance or unified access (without a view or partitioned view).
Is the cost of Enterprise Edition more expensive than the cost of separately building and maintaining a partitioning scheme?
Alternatives to the long-running delete also include populating a replacement table with identical schema and simply excluding the rows to be deleted and then swapping the table out with sp_rename.
I'm not understanding why whole categories of stars are being deleted on a regular basis? Presumably you are having new categories created all the time, which means your number of categories must be huge and partitioning on (manually or not) that would be very intensive.
Maybe on the Stars table set the PK to non-clustered and add a clustered index on categoryid.
Other than that, is the server setup well done regarding best practices for performance? That is using separate physical disks for data and logs, not using RAID5, etc.
When you say deleting millions of rows is "too intense for SQL server", what do you mean? Do you mean that the log file grows too much during the delete?
All you should have to do is execute the delete in batches of a fixed size:
DECLARE #i INT
SET #i = 1
WHILE #i > 0
BEGIN
DELETE TOP 10000 FROM dbo.SuperBigTable
WHERE CategoryID = 743
SELECT #i = ##ROWCOUNT
END
If your database is in full recovery mode, you will have to run frequent transaction log backups during this process so that it can reuse the space in the log. If the database is in simple mode, you shouldn't have to do anything.
My only other recommendation is to make sure that you have an appropriate index in CategoryId. I might even recommend that this be the clustered index.
If you want to optimize on a category delete clustered composite index with category at the first place might do more good than damage.
Also you could describe the relationships on the table.
It sounds like the transaction log is struggling with the size of the delete. The transaction log grows in units, and this takes time whilst it allocates more disk space.
It is not possible to delete rows from a table without enlisting a transaction, although it is possible to truncate a table using the TRUNCATE command. However this will remove all rows in the table without condition.
I can offer the following suggestions:
Switch to a non-transactional database or possibly flat files. It doesn't sound like you need atomicity of a transactional database.
Attempt the following. After every x deletes (depending on size) issue the following statement
BACKUP LOG WITH TRUNCATE_ONLY;
This simply truncates the transaction log, the space remains for the log to refill. However Im not sure howmuch time this will add to the operation.
What do you do with the star data? If you only look at data for one category at any given time this might work, but it is hard to maintain. Every time you have a new category, you will have to build a new table. If you want to query across categories, it becomes more complex and possibly more expensive in terms of time. If you do this and do want to query across categories a view is probably best (but do not pile views on top of views). If you are looking for data on a particular star, would you know which table to query? If not then how are you going to determine which table or are you goign to query them all? When entering data, how will the application decide which table to put the data into? How many categories will there be? And incidentally relating to each having a separate id, use the bigint identities and combine the identity with the category type for your unique identifier.
Truly do you need to delete the whole category or only the star that the data changed for?
And do you need to delete at all, maybe you only need to update information.
Have you tried deleting in batches (1000 records or so at a time in a loop). This is often much faster than deleting a million records in one delete statement. It often keeps the table from getting locked during the delete as well.
Another technique is mark the record for deletion. Then you can run a batch process when usage is low to delete those records and your queries can run on a view that excludes the records marked for deletion.
Given your answers, I think your proposal may be reasonable.
I know this is a bit of a tangent, but is SQL Server (or any relational database) really a good tool for this job? What relation database features are you actually using?
If you are dropping whole categories at a time, you can't have much referential integrity depending on it. The data is read only, so you don't need ACID for data updates.
Sounds to me like you are using basic SELECT query features?
Just taking your idea of many tables - how can you realise that...
What about using dynamic queries.
create the table of categories that have identity category_id column.
create the trigger on insert for this tale - in it create table for stars with the name dynamically made from category_id.
create the trigger on delete - in it drop the corresponding stars table also with the help of dynamically created sql.
to select stars of concrete category you can use function that returns table. It will take category_id as a parameter and return result also through dynamic query.
to insert stars of new category you firstly insert new row in categories table and then insert stars to appropriate table.
Another direction in which I would make some researches is using xml typed column for storing stars data. The main idea here is if you need to operate stars only by categories than why not to store all stars of concrete category in one cell of the table in xml format. Unfortunately I absolutely cannot imaging what will be the performance of such decision.
Both this variants are just like ideas in brainstorm.
As Cade pointed out, adding a table for each category is manually partitioning the data, without the benefits of the unified access.
There will never be any deletions for millions of rows that happen as fast as dropping a table, without the use of partitions.
Therefore, it seems like using a separate table for each category may be a valid solution. However, since you've stated that some of these categories are kept, and some are deleted, here is a solution:
Create a new stars table for each new
category.
Wait for the time period to expire where you decide whether the stars for the category are kept or not.
Roll the records into the main stars table if you plan on keeping them.
Drop the table.
This way, you will have a finite number of tables, depending on the rate you add categories and the time period where you decide if you want them or not.
Ultimately, for the categories that you keep, you're doubling the work, but the extra work is distributed over time. Inserts to the end of the clustered index may be experienced less by the users than deletes from the middle. However, for those categories that you're not keeping, you're saving tons of time.
Even if you're not technically saving work, perception is often the bigger issue.
I didn't get an answer to my comment on the original post, so I am going under some assumptions...
Here's my idea: use multiple databases, one for each category.
You can use the managed ESE database that ships with every version of Windows, for free.
Use the PersistentDictionary object, and keep track of the starid, starname pairs that way. If you need to delete a category, just delete the PersistentDictionary object for that category.
PersistentDictionary<int, string> starsForCategory = new PersistentDictionary<int, string>("Category1");
This will create a database called "Category1", on which you can use standard .NET dictionary methods (add, exists, foreach, etc).

SQL Data Normalisation / Performance

I am working on a web API for the insurance industry and trying to work out a suitable data structure for the quoting of insurance.
The database already contains a "ratings" table which is basically:
sysID (PK, INT IDENTITY)
goods_type (VARCHAR(16))
suminsured_min (DECIMAL(9,2))
suminsured_max (DECIMAL(9,2))
percent_premium (DECIMAL(9,6))
[Unique Index on goods_type, suminsured_min and suminsured_max]
[edit]
Each type of goods typically has 3 - 4 ranges for suminsured
[/edit]
The list of goods_types rarely changes and most queries for insurance will involve goods worth less than $100. Because of this, I was considering de-normalising using tables in the following format (for all values from $0.00 through to $100.00):
Table Name: tblRates[goodstype]
suminsured (DECIMAL(9,2)) Primary Key
premium (DECIMAL(9,2))
Denormalising this data should be easy to maintain as the rates are generally only updated once per month at most. All requests for values >$100 will always be looked up in the primary tables and calculated.
My question(s) are:
1. Am I better off storing the suminsured values as DECIMAL(9,2) or as a value in cents stored in a BIGINT?
2. This de-normalisation method involves storing 10,001 values ($0.00 to $100.00 in $0.01 increments) in possibly 20 tables. Is this likely to be more efficient than looking up the percent_premium and performing a calculation? - Or should I stick with the main tables and do the calculation?
Don't create new tables. You already have an index on goods, min and max values, so this sql for (known goods and its value):
SELECT percent_premium
FROM ratings
WHERE goods='PRECIOUST' and :PREC_VALUE BETWEEN suminsured_min AND suminsured_max
will use your index efficently.
The data type you are looking for is smallmoney. Use it.
The plan you suggest will use a binary search on 10001 rows instead of 3 or 4.
It's hardly a performance improvement, don't do that.
As for arithmetics, BIGINT will be slightly faster, thought I think you will hardly notice that.
i am not entirely sure exactly what calculations we are talking about, but unless they are obnoxiously complicated, they will more than likely be much quicker than looking up data in several different tables. if possible, perform the calculations in the db (i.e. use stored procedures) to minimize the data traffic between your application layers too.
and even if the data loading would be quicker, i think the idea of having to update de-normalized data as often as once a month (or even once a quarter) is pretty scary. you can probably do the job pretty quickly, but what about the next person handling the system? would you require of them to learn the db structure, remember which of the 20-some tables that need to be updated each time, and do it correctly? i would say the possible performance gain on de-normalizing will not be worth much to the risk of contaminating the data with incorrect information.