SQL: Insert Into Table 1 From Table 2 then Update Table 2 - Performance Increase - sql

I am working to increase the speed and performance for a database process that I have inherited. The basic steps, prior to this process, is a utility uploads about a million or more records into an Upload Table. That process is pretty quick, but things start to slow down once we start adding/updating/moving items from the Upload Table into other tables in the database.
I have read a few articles stating that using IF NOT EXIST may be quicker than SELECT DISTINCT so I was thinking about refactoring the following code to do so but I was also wondering if there is a way to combine these two queries in order to increase the speed.
The Upload Table contains many columns, I am just showing the Product portion but there is also Store Columns which has the same number of columns as the Product and many other details that are not a one-to-one relationship between tables.
The first query inserts the product into the Product table if it does not already exist, then the next step updates the Upload Table with Product IDs for all the records in the Upload Table.
INSERT INTO Product (p.ProductCode, p.ProductDescription, p.ProductCodeQualifier, p.UnitOfMeasure)
SELECT DISTINCT
ut.ProductCode, ut.ProductDescription, ut.ProductCodeQualifier, ut.UnitOfMeasure
FROM
Upload_Table ut
LEFT JOIN
Product p ON (ut.ProductCode = p.ProductCode)
AND (ut.ProductDescription = p.ProductDescription)
AND (ut.ProductCodeQualifier = p.ProductCodeQualifier)
AND (ut.UnitOfMeasure = p.UnitOfMeasure)
WHERE
p.Id IS NULL
AND ut.UploadId = 123456;
UPDATE Upload_Table
SET ProductId = Product.Id
FROM Upload_Table
INNER JOIN Product ON Upload_Table.ProductCode = Product.ProductCode
AND Upload_Table.ProductDescription = Product.ProductDescription
AND Upload_Table.ProductCodeQualifier = Product.ProductCodeQualifier
AND Upload_Table.UnitOfMeasure = Product.UnitOfMeasure
WHERE (Upload_Table.UploadId = 123456)
Any help or suggestions would be greatly appreciated. I am decent with my understanding of SQL but I am not an expert.
Thanks!
Currently have not tried to make any changes for this part as I am trying to find the best result for speed increases and a better understanding of how this process can be improved.

Recommendations:
You can disable triggers, foreign keys, constraints, and indexes before inserting and updating, after these processes you can enable all these again. Because indexes, triggers, and foreign keys accept the inserting (updating) performance very badly.
Is not recommended to use auto-commit transaction mode during the update or insert process. This is getting a very bad performance, on auto-commit mode every time for each after inserting records the transactions automatically will be committed. But, for best performance, I recommended you use commit only after inserting 1000 records (or after 10000).
If you can, then you can do this process (insert or update) periodically multiple times a day, you can also do this using triggers. I don't know your business logic, maybe this variant will not satisfy you.
And don't forget to analyze executing plan for your queries.

Related

Delete with inner join takes more than 40 seconds to delete 500 rows

I have this request :
delete L
from L
inner join M on L.id = M.ref_id
or L.id = M.news_id
And it takes between 39 and 50 seconds to finish to delete 588 rows.
Select is very fast 0,015 sec if in my query I replace delete L from L by select * from L
Here is my trace in xml
and here is my explain plan file .SQLPlan
Hope you can help what im missing here, what is the problem here ?
Delete rows from a large clustered table is usually slow.
The famous "BETWEEN" discussed in the comments, is a partition from the clustered index, it takes the min and max values from your join. That's why is recommended to delete in batches.
I guess there could be few general recommendations but it always
depend upon different things like the amount of data, hardware,
memory, log file size etc. - Delete rows in batches. The optimal batch
size would depend upon the environment but generally its around 10K to
100K. - Delete in the order of the clustered index - If possible turn
your recovery model to BULK/SIMPLE for minimal logging. - If possible
drop constraints like foreign key constraint from the child table. -
If possible disable non-clustered indexes - Depending upon the amount
of data you are going to delete, may be it would be better to export
the data to a new table and then drop the old table OR keep it with a
different name for future reference. As you can see, mainly it would
depend upon the situation and environment.
This is the best answer, "maybe you need to Tune the system, not the query"
But from a delete statement, I would prefer to see most of the cost
coming from the delete itself. Understand, a clustered index is where
the data is stored in a table that has a clustered index. When you
delete, you're deleting from the cluster. That's not to say that you
might not be able to tune this further, but you're seeing very normal,
even desirable behavior. Without seeing exactly what you're doing,
it's hard to make suggestions, but it could be that you're doing it
all right. In which case, maybe you need to tune the system, not the
query. Get faster disks. Make sure you're not under memory pressure.
Check your wait statistics to see where most of the work is being done
on the server and make adjustments from there.
If you can not do any of those things, probably will be faster to use a loop to delete row by row....
source: https://ask.sqlservercentral.com/questions/90223/how-to-reduce-the-cost-of-clustered-index-delete.html
If it's truly slow because of the join, and not because the delete itself is slow, I've found it useful to trick SQL into generating a more optimized query by calculating intermediate results into a table variable, and then working off that table variable.
DECLARE #temp TABLE
(
id int
);
INSERT INTO #temp
SELECT id
from L
inner join M on L.id = M.ref_id
or L.id = M.news_id;
DELETE FROM L WHERE id IN (SELECT id FROM #temp)

How to merge 500 million table with another 500 million table

I have to merge two 500M+ row tables.
What is the best method to merge them?
I just need to display the records from these two SQL-Server tables if somebody searches on my webpage.
These are fixed tables, no one will ever change data in these tables once they are live.
create a view myview as select * from table1 union select * from table2
Is there any harm using the above method?
If I start merging 500M rows it will run for days and if machine reboots it will make the database go into recovery mode, and then I have to start from the beginning again.
Why Am I merging these table?
I have a website which provides a search on the person table.
This table have columns like Name, Address, Age etc
We got 500 million similar .txt files which we loaded into some other
table.
Now we want the website search page to query both tables to see if
a person exists in the table.
We get similar .txt files of 100 million or 20 million, which we load
to this huge table.
How we are currently doing it?
We import the .txt files into separate tables ( some columns are different
in .txt)
Then we arrange the columns and do the data type conversions
Then insert this staging table into the liveCopy huge table ( in
test environment)
We have SQL server 2008 R2
Can we use table partitioning for performance benefits?
Is it ok to create monthly small tables and create a view on top of
them?
How can indexing be done in this case?
We only load new data once in a month and do the select
Does replication help?
Biggest issue I am facing is managing huge tables.
I hope I explained the situation .
Thanks & Regards
1) Usually developers, to achieve more performance, are splitting large tables into smaller ones and call this as partitioning (horizontal to be more precise, because there is also vertical one). Your view is a sample of such partitions joined. Of course, it is mostly used to split a large amount of data into range of values (for example, table1 contains records with column [col1] < 0, while table2 with [col1] >= 0). But even for unsorted data it is ok too, because you get more room for speed improvements. For example - parallel reads if put tables to different storages. So this is a good choice.
2) Another way is to use MERGE statement supported in SQL Server 2008 and higher - http://msdn.microsoft.com/en-us/library/bb510625(v=sql.100).aspx.
3) Of course you can copy using INSERT+DELETE, but in this case or in case of MERGE command used do this in a small batches. Smth like:
SET ROWCOUNT 10000
DECLARE #Count [int] = 1
WHILE #Count > 0 BEGIN
... INSERT+DELETE/MERGE transcation...
SET #Count = ##ROWCOUNT
END
If your purpose is truly just to move the data from the two tables into one table, you will want to do it in batches - 100K records at a time, or something like that. I'd guess you crashed before because your T-Log got full, although that's just speculation. Make sure to throw in a checkpoint after each batch if you are in Full recovery mode.
That said, I agree with all the comments that you should provide why you are doing this - it may not be necessary at all.
You may want to have a look at an Indexed View.
In this way, you can set up indexes on your view and get the best performance out of it. The expensive part of using Indexed Views is in the CRUD operations - but for read performance it would be your best solution.
http://www.brentozar.com/archive/2013/11/what-you-can-and-cant-do-with-indexed-views/
https://www.simple-talk.com/sql/learn-sql-server/sql-server-indexed-views-the-basics/
If the two tables are linked one to one, then you are wasting the cpu time a lot for each read. Especially that you mentioned that the tables don't change at all. You should have only one table in this case.
Try creating a new table including (at least) the two columns from the two tables.
You can do this by:
Select into newTable
from A left join B on A.x=B.y
or (if some people don't have the information of the text file)
Select into newTable
from A inner join B on A.x=B.y
And note that you have to have made index on the join fields at least (to speed up the process).
More details about the fields may help giving more precise answer as well.

How can I efficiently manipulate 500k records in SQL Server 2005?

I am getting a large text file of updated information from a customer that contains updates for 500,000 users. However, as I am processing this file, I often am running into SQL Server timeout errors.
Here's the process I follow in my VB application that processes the data (in general):
Delete all records from temporary table (to remove last month's data) (eg. DELETE * FROM tempTable)
Rip text file into the temp table
Fill in extra information into the temp table, such as their organization_id, their user_id, group_code, etc.
Update the data in the real tables based on the data computed in the temp table
The problem is that I often run commands like UPDATE tempTable SET user_id = (SELECT user_id FROM myUsers WHERE external_id = tempTable.external_id) and these commands frequently time out. I have tried bumping the timeouts up to as far as 10 minutes, but they still fail. Now, I realize that 500k rows is no small number of rows to manipulate, but I would think that a database purported to be able to handle millions and millions of rows should be able to cope with 500k pretty easily. Am I doing something wrong with how I am going about processing this data?
Please help. Any and all suggestions welcome.
subqueries like the one you give us in the question:
UPDATE tempTable SET user_id = (SELECT user_id FROM myUsers WHERE external_id = tempTable.external_id)
are only good on one row at a time, so you must be looping. Think set based:
UPDATE t
SET user_id = u.user_id
FROM tempTable t
inner join myUsers u ON t.external_id=u.external_id
and remove your loops, this will update all rows in one statement and be significantly faster!
Needs more information. I am manipulating 3-4 million rows in a 150 million row table regularly and I am NOT thinking this is a lot of data. I have a "products" table that contains about 8 million entries - includign full text search. No problems either.
Can you just elaborte on your hardware? I assume "normal desktop PC" or "low end server", both with absolutely non-optimal disc layout, and thus tons of IO problems - on updates.
Make sure you have indexes on your tables that you are doing the selects from. In your example UPDATE command, you select the user_id from the myUsers table. Do you have an index with the user_id column on the myUsers table? The downside of indexes is that they increase time for inserts/updates. Make sure you don't have indexes on the tables you are trying to update. If the tables you are trying to update do have indexes, consider dropping them and then rebuilding them after your import.
Finally, run your queries in SQL Server Management Studio and have a look at the execution plan to see how the query is being executed. Look for things like table scans to see where you might be able to optimize.
Look at the KM's answer and don't forget about indexes and primary keys.
Are you indexing your temp table after importing the data?
temp_table.external_id should definitely have an index since it is in the where clause.
There are more efficient ways of importing large blocks of data. Look in SQL Books Online under BCP (Bulk Copy Protocol.)

SQL Server Efficiently dropping a group of rows with millions and millions of rows

I recently asked this question:
MS SQL share identity seed amongst tables
(Many people wondered why)
I have the following layout of a table:
Table: Stars
starId bigint
categoryId bigint
starname varchar(200)
But my problem is that I have millions and millions of rows. So when I want to delete stars from the table Stars it is too intense on SQL Server.
I cannot use built in partitioning for 2005+ because I do not have an enterprise license.
When I do delete though, I always delete a whole category Id at a time.
I thought of doing a design like this:
Table: Star_1
starId bigint
CategoryId bigint constaint rock=1
starname varchar(200)
Table: Star_2
starId bigint
CategoryId bigint constaint rock=2
starname varchar(200)
In this way I can delete a whole category and hence millions of rows in O(1) by doing a simple drop table.
My question is, is it a problem to have hundreds of thousands of tables in your SQL Server? The drop in O(1) is extremely desirable to me. Maybe there's a completely different solution I'm not thinking of?
Edit:
Is a star ever modified once it is inserted? No.
Do you ever have to query across star categories? I never have to query across star categories.
If you are looking for data on a particular star, would you know which table to query? Yes
When entering data, how will the application decide which table to put the data into? The insertion of star data is done all at once at the start when the categoryId is created.
How many categories will there be? You can assume there will be infinite star categories. Let's say up to 100 star categories per day and up to 30 star categories not needed per day.
Truly do you need to delete the whole category or only the star that the data changed for? Yes the whole star category.
Have you tried deleting in batches? Yes we do that today, but it is not good enough.
od enough.
Another technique is mark the record for deletion? There is no need to mark a star as deleted because we know the whole star category is eligible to be deleted.
What proportion of them never get used? Typically we keep each star category data for a couple weeks but sometimes need to keep more.
When you decide one is useful is that good for ever or might it still need to be deleted later?
Not forever, but until a manual request to delete the category is issued.
If so what % of the time does that happen? Not that often.
What kind of disc arrangement are you using? Single filegroup storage and no partitioning currently.
Can you use sql enterprise ? No. There are many people that run this software and they only have sql standard. It is outside of their budget to get ms sql enterprise.
My question is, is it a problem to have hundreds of thousands of tables in your SQL Server?
Yes. It is a huge problem to have this many tables in your SQL Server. Every object has to be tracked by SQL Server as metadata, and once you include indexes, referential constraints, primary keys, defaults, and so on, then you are talking about millions of database objects.
While SQL Server may theoretically be able to handle 232 objects, rest assured that it will start buckling under the load much sooner than that.
And if the database doesn't collapse, your developers and IT staff almost certainly will. I get nervous when I see more than a thousand tables or so; show me a database with hundreds of thousands and I will run away screaming.
Creating hundreds of thousands of tables as a poor-man's partitioning strategy will eliminate your ability to do any of the following:
Write efficient queries (how do you SELECT multiple categories?)
Maintain unique identities (as you've already discovered)
Maintain referential integrity (unless you like managing 300,000 foreign keys)
Perform ranged updates
Write clean application code
Maintain any sort of history
Enforce proper security (it seems evident that users would have to be able to initiate these create/drops - very dangerous)
Cache properly - 100,000 tables means 100,000 different execution plans all competing for the same memory, which you likely don't have enough of;
Hire a DBA (because rest assured, they will quit as soon as they see your database).
On the other hand, it's not a problem at all to have hundreds of thousands of rows, or even millions of rows, in a single table - that's the way SQL Server and other SQL RDBMSes were designed to be used and they are very well-optimized for this case.
The drop in O(1) is extremely desirable to me. Maybe there's a completely different solution I'm not thinking of?
The typical solution to performance problems in databases is, in order of preference:
Run a profiler to determine what the slowest parts of the query are;
Improve the query, if possible (i.e. by eliminating non-sargable predicates);
Normalize or add indexes to eliminate those bottlenecks;
Denormalize when necessary (not generally applicable to deletes);
If cascade constraints or triggers are involved, disable those for the duration of the transaction and blow out the cascades manually.
But the reality here is that you don't need a "solution."
"Millions and millions of rows" is not a lot in a SQL Server database. It is very quick to delete a few thousand rows from a table of millions by simply indexing on the column you wish to delete from - in this case CategoryID. SQL Server can do this without breaking a sweat.
In fact, deletions normally have an O(M log N) complexity (N = number of rows, M = number of rows to delete). In order to achieve an O(1) deletion time, you'd be sacrificing almost every benefit that SQL Server provides in the first place.
O(M log N) may not be as fast as O(1), but the kind of slowdowns you're talking about (several minutes to delete) must have a secondary cause. The numbers do not add up, and to demonstrate this, I've gone ahead and produced a benchmark:
Table Schema:
CREATE TABLE Stars
(
StarID int NOT NULL IDENTITY(1, 1)
CONSTRAINT PK_Stars PRIMARY KEY CLUSTERED,
CategoryID smallint NOT NULL,
StarName varchar(200)
)
CREATE INDEX IX_Stars_Category
ON Stars (CategoryID)
Note that this schema is not even really optimized for DELETE operations, it's a fairly run-of-the-mill table schema you might see in SQL server. If this table has no relationships, then we don't need the surrogate key or clustered index (or we could put the clustered index on the category). I'll come back to that later.
Sample Data:
This will populate the table with 10 million rows, using 500 categories (i.e. a cardinality of 1:20,000 per category). You can tweak the parameters to change the amount of data and/or cardinality.
SET NOCOUNT ON
DECLARE
#BatchSize int,
#BatchNum int,
#BatchCount int,
#StatusMsg nvarchar(100)
SET #BatchSize = 1000
SET #BatchCount = 10000
SET #BatchNum = 1
WHILE (#BatchNum <= #BatchCount)
BEGIN
SET #StatusMsg =
N'Inserting rows - batch #' + CAST(#BatchNum AS nvarchar(5))
RAISERROR(#StatusMsg, 0, 1) WITH NOWAIT
INSERT Stars2 (CategoryID, StarName)
SELECT
v.number % 500,
CAST(RAND() * v.number AS varchar(200))
FROM master.dbo.spt_values v
WHERE v.type = 'P'
AND v.number >= 1
AND v.number <= #BatchSize
SET #BatchNum = #BatchNum + 1
END
Profile Script
The simplest of them all...
DELETE FROM Stars
WHERE CategoryID = 50
Results:
This was tested on an 5-year old workstation machine running, IIRC, a 32-bit dual-core AMD Athlon and a cheap 7200 RPM SATA drive.
I ran the test 10 times using different CategoryIDs. The slowest time (cold cache) was about 5 seconds. The fastest time was 1 second.
Perhaps not as fast as simply dropping the table, but nowhere near the multi-minute deletion times you mentioned. And remember, this isn't even on a decent machine!
But we can do better...
Everything about your question implies that this data isn't related. If you don't have relations, you don't need the surrogate key, and can get rid of one of the indexes, moving the clustered index to the CategoryID column.
Now, as a rule, clustered indexes on non-unique/non-sequential columns are not a good practice. But we're just benchmarking here, so we'll do it anyway:
CREATE TABLE Stars
(
CategoryID smallint NOT NULL,
StarName varchar(200)
)
CREATE CLUSTERED INDEX IX_Stars_Category
ON Stars (CategoryID)
Run the same test data generator on this (incurring a mind-boggling number of page splits) and the same deletion took an average of just 62 milliseconds, and 190 from a cold cache (outlier). And for reference, if the index is made nonclustered (no clustered index at all) then the delete time only goes up to an average of 606 ms.
Conclusion:
If you're seeing delete times of several minutes - or even several seconds then something is very, very wrong.
Possible factors are:
Statistics aren't up to date (shouldn't be an issue here, but if it is, just run sp_updatestats);
Lack of indexing (although, curiously, removing the IX_Stars_Category index in the first example actually leads to a faster overall delete, because the clustered index scan is faster than the nonclustered index delete);
Improperly-chosen data types. If you only have millions of rows, as opposed to billions, then you do not need a bigint on the StarID. You definitely don't need it on the CategoryID - if you have fewer than 32,768 categories then you can even do with a smallint. Every byte of unnecessary data in each row adds an I/O cost.
Lock contention. Maybe the problem isn't actually delete speed at all; maybe some other script or process is holding locks on Star rows and the DELETE just sits around waiting for them to let go.
Extremely poor hardware. I was able to run this without any problems on a pretty lousy machine, but if you're running this database on a '90s-era Presario or some similar machine that's preposterously unsuitable for hosting an instance of SQL Server, and it's heavily-loaded, then you're obviously going to run into problems.
Very expensive foreign keys, triggers, constraints, or other database objects which you haven't included in your example, which might be adding a high cost. Your execution plan should clearly show this (in the optimized example above, it's just a single Clustered Index Delete).
I honestly cannot think of any other possibilities. Deletes in SQL Server just aren't that slow.
If you're able to run these benchmarks and see roughly the same performance I saw (or better), then it means the problem is with your database design and optimization strategy, not with SQL Server or the asymptotic complexity of deletions. I would suggest, as a starting point, to read a little about optimization:
SQL Server Optimization Tips (Database Journal)
SQL Server Optimization (MSDN)
Improving SQL Server Performance (MSDN)
SQL Server Query Processing Team Blog
SQL Server Performance (particularly their tips on indexes)
If this still doesn't help you, then I can offer the following additional suggestions:
Upgrade to SQL Server 2008, which gives you a myriad of compression options that can vastly improve I/O performance;
Consider pre-compressing the per-category Star data into a compact serialized list (using the BinaryWriter class in .NET), and store it in a varbinary column. This way you can have one row per category. This violates 1NF rules, but since you don't seem to be doing anything with individual Star data from within the database anyway anyway, I doubt you'd be losing much.
Consider using a non-relational database or storage format, such as db4o or Cassandra. Instead of implementing a known database anti-pattern (the infamous "data dump"), use a tool that is actually designed for that kind of storage and access pattern.
Must you delete them? Often it is better to just set an IsDeleted bit column to 1, and then do the actual deletion asynchronously during off hours.
Edit:
This is a shot in the dark, but adding a clustered index on CategoryId may speed up deletes. It may also impact other queries adversely. Is this something you can test?
This was the old technique in SQL 2000 , partitioned views and remains a valid option for SQL 2005. The problem does come in from having large quantity of tables and the maintenance overheads associated with them.
As you say, partitioning is an enterprise feature, but is designed for this large scale data removal / rolling window effect.
One other option would be running batched deletes to avoid creating 1 very large transaction, creating hundreds of far smaller transactions, to avoid lock escalations and keep each transaction small.
Having separate tables is partitioning - you are just managing it manually and do not get any management assistance or unified access (without a view or partitioned view).
Is the cost of Enterprise Edition more expensive than the cost of separately building and maintaining a partitioning scheme?
Alternatives to the long-running delete also include populating a replacement table with identical schema and simply excluding the rows to be deleted and then swapping the table out with sp_rename.
I'm not understanding why whole categories of stars are being deleted on a regular basis? Presumably you are having new categories created all the time, which means your number of categories must be huge and partitioning on (manually or not) that would be very intensive.
Maybe on the Stars table set the PK to non-clustered and add a clustered index on categoryid.
Other than that, is the server setup well done regarding best practices for performance? That is using separate physical disks for data and logs, not using RAID5, etc.
When you say deleting millions of rows is "too intense for SQL server", what do you mean? Do you mean that the log file grows too much during the delete?
All you should have to do is execute the delete in batches of a fixed size:
DECLARE #i INT
SET #i = 1
WHILE #i > 0
BEGIN
DELETE TOP 10000 FROM dbo.SuperBigTable
WHERE CategoryID = 743
SELECT #i = ##ROWCOUNT
END
If your database is in full recovery mode, you will have to run frequent transaction log backups during this process so that it can reuse the space in the log. If the database is in simple mode, you shouldn't have to do anything.
My only other recommendation is to make sure that you have an appropriate index in CategoryId. I might even recommend that this be the clustered index.
If you want to optimize on a category delete clustered composite index with category at the first place might do more good than damage.
Also you could describe the relationships on the table.
It sounds like the transaction log is struggling with the size of the delete. The transaction log grows in units, and this takes time whilst it allocates more disk space.
It is not possible to delete rows from a table without enlisting a transaction, although it is possible to truncate a table using the TRUNCATE command. However this will remove all rows in the table without condition.
I can offer the following suggestions:
Switch to a non-transactional database or possibly flat files. It doesn't sound like you need atomicity of a transactional database.
Attempt the following. After every x deletes (depending on size) issue the following statement
BACKUP LOG WITH TRUNCATE_ONLY;
This simply truncates the transaction log, the space remains for the log to refill. However Im not sure howmuch time this will add to the operation.
What do you do with the star data? If you only look at data for one category at any given time this might work, but it is hard to maintain. Every time you have a new category, you will have to build a new table. If you want to query across categories, it becomes more complex and possibly more expensive in terms of time. If you do this and do want to query across categories a view is probably best (but do not pile views on top of views). If you are looking for data on a particular star, would you know which table to query? If not then how are you going to determine which table or are you goign to query them all? When entering data, how will the application decide which table to put the data into? How many categories will there be? And incidentally relating to each having a separate id, use the bigint identities and combine the identity with the category type for your unique identifier.
Truly do you need to delete the whole category or only the star that the data changed for?
And do you need to delete at all, maybe you only need to update information.
Have you tried deleting in batches (1000 records or so at a time in a loop). This is often much faster than deleting a million records in one delete statement. It often keeps the table from getting locked during the delete as well.
Another technique is mark the record for deletion. Then you can run a batch process when usage is low to delete those records and your queries can run on a view that excludes the records marked for deletion.
Given your answers, I think your proposal may be reasonable.
I know this is a bit of a tangent, but is SQL Server (or any relational database) really a good tool for this job? What relation database features are you actually using?
If you are dropping whole categories at a time, you can't have much referential integrity depending on it. The data is read only, so you don't need ACID for data updates.
Sounds to me like you are using basic SELECT query features?
Just taking your idea of many tables - how can you realise that...
What about using dynamic queries.
create the table of categories that have identity category_id column.
create the trigger on insert for this tale - in it create table for stars with the name dynamically made from category_id.
create the trigger on delete - in it drop the corresponding stars table also with the help of dynamically created sql.
to select stars of concrete category you can use function that returns table. It will take category_id as a parameter and return result also through dynamic query.
to insert stars of new category you firstly insert new row in categories table and then insert stars to appropriate table.
Another direction in which I would make some researches is using xml typed column for storing stars data. The main idea here is if you need to operate stars only by categories than why not to store all stars of concrete category in one cell of the table in xml format. Unfortunately I absolutely cannot imaging what will be the performance of such decision.
Both this variants are just like ideas in brainstorm.
As Cade pointed out, adding a table for each category is manually partitioning the data, without the benefits of the unified access.
There will never be any deletions for millions of rows that happen as fast as dropping a table, without the use of partitions.
Therefore, it seems like using a separate table for each category may be a valid solution. However, since you've stated that some of these categories are kept, and some are deleted, here is a solution:
Create a new stars table for each new
category.
Wait for the time period to expire where you decide whether the stars for the category are kept or not.
Roll the records into the main stars table if you plan on keeping them.
Drop the table.
This way, you will have a finite number of tables, depending on the rate you add categories and the time period where you decide if you want them or not.
Ultimately, for the categories that you keep, you're doubling the work, but the extra work is distributed over time. Inserts to the end of the clustered index may be experienced less by the users than deletes from the middle. However, for those categories that you're not keeping, you're saving tons of time.
Even if you're not technically saving work, perception is often the bigger issue.
I didn't get an answer to my comment on the original post, so I am going under some assumptions...
Here's my idea: use multiple databases, one for each category.
You can use the managed ESE database that ships with every version of Windows, for free.
Use the PersistentDictionary object, and keep track of the starid, starname pairs that way. If you need to delete a category, just delete the PersistentDictionary object for that category.
PersistentDictionary<int, string> starsForCategory = new PersistentDictionary<int, string>("Category1");
This will create a database called "Category1", on which you can use standard .NET dictionary methods (add, exists, foreach, etc).

What's the best way to update data in a table while it's in use without locking the table?

I have a table in a SQL Server 2005 Database that is used a lot. It has our product on hand availability information. We get updates every hour from our warehouse and for the past few years we've been running a routine that truncates the table and updates the information. This only takes a few seconds and has not been a problem, until now. We have a lot more people using our systems that query this information now, and as a result we're seeing a lot of timeouts due to blocking processes.
... so ...
We researched our options and have come up with an idea to mitigate the problem.
We would have two tables. Table A (active) and table B (inactive).
We would create a view that points to the active table (table A).
All things needing this tables information (4 objects) would now have to go through the view.
The hourly routine would truncate the inactive table, update it with the latest information then update the view to point at the inactive table, making it the active one.
This routine would determine which table is active and basically switch the view between them.
What's wrong with this? Will switching the view mid query cause problems? Can this work?
Thank you for your expertise.
Extra Information
the routine is a SSIS package that peforms many steps and eventually truncates/updates the table in question
The blocking processes are two other stored procedures that query this table.
Have you considered using snapshot isolation. It would allow you to begin a big fat transaction for your SSIS stuff and still read from the table.
This solution seems much cleaner than switching the tables.
I think this is going about it the wrong way - updating a table has to lock it, although you can limit that locking to per page or even per row.
I'd look at not truncating the table and refilling it. That's always going to interfere with users trying to read it.
If you did update rather than replace the table you could control this the other way - the reading users shouldn't block the table and may be able to get away with optimistic reads.
Try adding the with(nolock) hint to the reading SQL View statement. You should be able to get very large volumes of users reading even with the table being regularly updated.
Personally, if you are always going to be introducing down time to run a batch process against the table, I think you should manage the user experience at the business/data access layer. Introduce a table management object that monitors connections to that table and controls the batch processing.
When new batch data is ready, the management object stops all new query requests (maybe even queueing?), allows the existing queries to complete, runs the batch, then reopens the table for queries. The management object can raise an event (BatchProcessingEvent) that the UI layer can interpret to let people know that the table is currently unavailable.
My $0.02,
Nate
Just read you are using SSIS
you could use the TableDiference Component from: http://www.sqlbi.eu/Home/tabid/36/ctl/Details/mid/374/ItemID/0/Default.aspx
(source: sqlbi.eu)
This way you can apply the changes to the table, ONE by ONE but of course, this will be much more slower and depending on the table size will require more RAM at the server, but the Locking problem will be totally corrected.
Why not use transactions to update the information rather than a truncate operation.
Truncate is non logged so it cannot be done in a transaction.
If you're operation is done in a transaction then existing users will not be affected.
How this is done would depend on things like the size of the table and how radically the data changes. If you give more detail perhaps I could advise further.
One possible solution would be to minimize the time needed to update the table.
I would first Create a staging table to download the data from the warehouse.
If you have to do "inserts, updates and deletes" in the final table
Lets suppose the finale table looks like this:
Table Products:
ProductId int
QuantityOnHand Int
And you need to update QuantityOnHand from the warehouse.
First Create a Staging table like:
Table Prodcuts_WareHouse
ProductId int
QuantityOnHand Int
And then Create an "Actions" Table like this:
Table Prodcuts_Actions
ProductId int
QuantityOnHand Int
Action Char(1)
The update process should then be something like this:
1.Truncate table Prodcuts_WareHouse
2.Truncate table Prodcuts_Actions
3.Fill the Prodcuts_WareHouse table with the data from the warehouse
4.Fill the Prodcuts_Actions table with this:
Inserts:
INSERT INTO Prodcuts_Actions (ProductId, QuantityOnHand,Action)
SELECT SRC.ProductId, SRC.QuantityOnHand, 'I' AS ACTION
FROM Prodcuts_WareHouse AS SRC LEFT OUTER JOIN
Products AS DEST ON SRC.ProductId = DEST.ProductId
WHERE (DEST.ProductId IS NULL)
Deletes
INSERT INTO Prodcuts_Actions (ProductId, QuantityOnHand,Action)
SELECT DEST.ProductId, DEST.QuantityOnHand, 'D' AS Action
FROM Prodcuts_WareHouse AS SRC RIGHT OUTER JOIN
Products AS DEST ON SRC.ProductId = DEST.ProductId
WHERE (SRC.ProductId IS NULL)
Updates
INSERT INTO Prodcuts_Actions (ProductId, QuantityOnHand,Action)
SELECT SRC.ProductId, SRC.QuantityOnHand, 'U' AS Action
FROM Prodcuts_WareHouse AS SRC INNER JOIN
Products AS DEST ON SRC.ProductId = DEST.ProductId AND SRC.QuantityOnHand <> DEST.QuantityOnHand
Until now you haven't locked the final table.
5.In a transaction update the final table:
BEGIN TRANS
DELETE Products FROM Products INNER JOIN
Prodcuts_Actions ON Products.ProductId = Prodcuts_Actions.ProductId
WHERE (Prodcuts_Actions.Action = 'D')
INSERT INTO Prodcuts (ProductId, QuantityOnHand)
SELECT ProductId, QuantityOnHand FROM Prodcuts_Actions WHERE Action ='I';
UPDATE Products SET QuantityOnHand = SRC.QuantityOnHand
FROM Products INNER JOIN
Prodcuts_Actions AS SRC ON Products.ProductId = SRC.ProductId
WHERE (SRC.Action = 'U')
COMMIT TRAN
With all the process above, you minimize the amount of records to be updated to the minimum necessary, and so the time the final table will be locked while updating.
You can even don't use a transaction in the final step, so between command the table will be released.
If you have the Enterprise Edition of SQL Server at your disposal then may I suggest that you use SQL Server Partitioning technology.
You could have your currently required data reside within the 'Live' partition and the updated version of the data in the 'Secondary' partition (which is not available for querying but rather for administering data).
Once the data has been imported into the 'Secondary' parition you can instantly SWITCH the 'LIVE' partition OUT and the 'Secondary' partition IN, thereby incurring zero downtime and no blocking.
Once you have made the switch, you can go about truncating the no longer needed data without adversley affecting users of the newly live data (previously the Secondary partition).
Each time you need to do an import job, you simply repeat/reverse the process.
To learn more about SQL Server Partitioning see:
http://msdn.microsoft.com/en-us/library/ms345146(SQL.90).aspx
Or you can just ask me :-)
EDIT:
On a side note, in order to address any blocking issues, you could use SQL Server Row Versioning technology.
http://msdn.microsoft.com/en-us/library/ms345124(SQL.90).aspx
We do this on our high usage systems and haven't had any problems. However, as with all things database, the only way to be sure it would help would be to make the change in dev and then load test it. Not knowing waht else your SSIS package does, it may still cause blocks.
If the table is not very large you could cache the data in your application for a short time. It may not eliminate blocking altogether, but it would reduce the chances that the table would be queried when an update occurs.
Perhaps it would make sense to do some analysis of the processes which are blocking since they seem to be the part of your landscape which has changed. It only takes one poorly written query to create the blocks which your are seeing. Barring a poorly written query, maybe the table needs one or more covering indexes to speed up those queries and get you back on your way without having to re-engineer your already working code.
Hope this helps,
Bill