I have some database tables those contains some aggregated data. Their records (some thousand / tables) are recomputed periodically by an external .NET app, so the old data should be deleted and the new should be inserted periodically. Update is not an option in this case.
Between the delete / insert there is an intermediate time, when the records state is inconsistent (old ones are deleted, new ones are not in the table yet), so making select query in that state results an incorrect result.
I use subsonic simplerepository to handle database features.
What is the best practice / pattern to workaround / handle this state?
Three options come to my mind:
Create a transaction with a lock on reads until it is done. This only works if processes are relatively fast. A few thousand records shouldn't be too bad if you transact/lock a table at a time -- if you lock the whole process, that could be costly! But if data is related, this is what you'd have to do
Write to temporary versions of the table, then drop old tables and rename temp tables.
Same as above, except bulk copy from temp tables (not necessarily SQL temporary tables, but ancillary holding tables would suffice) into correct tables, first deleting from main table. you'd still want to use a transaction for this.
Related
Our app has a few very large tables in SQL Server. 500 million rows and 1 billion rows in two tables that we'd like to clean up to reclaim some disk space.
In our testing environment, I tried running chunked deletes in a loop but I don't think this is a feasible solution in prod.
So the other alternative is to select/insert the data we want to keep into a temp table, truncate/drop the old table, and then recreate
indexes
foreign key constraints
table permissions
rename the temp table back to the original table name
My question is, am I missing anything from my list? Are there any other objects / structures that we will lose which we need to re-create or restore? It would be a disastrous situation if something went wrong. So I am playing this extremely safe.
Resizing the db/adding more space is not a possible solution. Our SQL Server app is near end of life and is being decom'd, so we are just keeping the lights on until then.
While you are doing this operation will there be new records added to the original table? I mean is the app that writing to this table will be live? If it is the case, maybe it would be better to change the order of steps like:
First to rename original table's name to the temp
Create a new table with the original name so that new records can be added from the writing app.
In parallel, you can move the data you want to keep, from temp to the new original table.
Let's say I have a database with many tables in it. I want to perform data archiving on certain tables, that is create a same table with same structures (same constraint, indexes, columns, triggers, etc) as a new table and insert specific data into the new table from the old table.
Example, current table has data from 2008-2017 and I want to move only data from 2010-2017 into the new table. Then after that, I can delete the old table and rename the new table with naming conventions similar to old table.
How should I approach this?
For the sort of clone-rename-drop logic you're talking about, the basics are pretty straight forward. Really the only time this is a good idea is if you have a table with a large amount of data, which you can't afford down time or blocking on, and you only plan to do this one. The process looks something like this:
Insert all the data from your original table into the clone table
In a single transaction, sp_rename the original table from (for example) myTable to myTable_OLD (just something to distinguish it from the real table). Then sp_rename the clone table from (for example) myTable_CLONE to myTable
Drop myTable_OLD when you're happy everything has worked how you want. If it didn't work how you want, just sp_rename the objects back.
Couple considerations to think about if you go that route
Identity columns: If your table has any identities on it, you'll have to use identity_insert on then reseed the identity to pick up at where the old identity left off
Do you have the luxury of blocking the table while you do this? Generally if you need to do this sort of thing, the answer is no. What I find works well is to insert all the rows I need using (nolock), or however you need to do it so the impact of the select from the original table is mitigated. Then, after I've moved 99% of the data, I will then open a transaction, block the original table, insert just the new data that's come in since the bulk of the data movement, then do the sp_rename stuff
That way you don't lock anything for the bulk of the data movement, and you only block the table for the very last bit of data that came into the original table between your original insert and your sp_rename
How you determine what's come in "since you started" will depend on how your table is structured. If you have an identity or a datestamp column, you can probably just pick rows which came in after the max of those fields you moved over. If your table does NOT have something you can easily hook into, you might need to get creative.
Alternatives
A couple other alternatives that came to mind:
Table Partitioning:
This shards a single table across multiple partitions (which can be managed sort of like individual tables). You can, say, partition you data by year, then when you want to purge the trailing year of data, you "switch out" that partition to a special table which you can then truncate. All those operations are meta-data only, so they're super fast. This also works really well for huge amounts of data where deletes and all their pesky transaction logging aren't feasible
The downside to table partitioning is it's kind of a pain to set up and manage.
Batched Deletes:
If you're data isn't too big, you could just do batched deletes on the trailing end of your data. If you can find a way to get clustered index seeks for your deletes, they should be reasonably lightweight. As long as you're not accumulating data faster than you can get rid of it, the benefit of this kind of thing is you just run it semi-continuously and it just nibbles away at the trailing end of your data
Snapshot Isolation:
If deletes cause too much blocking, you can also set up something like snapshot isolation, which basically stores historical versions of rows in tempdb. Any query which sets isolation level read committed snapshot will then read those pre-change rows instead of contend for locks on the "real" table. You can then do batched deletes to your hearts content and know that any queries that hit the table will never get blocked by a delete (or any other DML operation) because they'll either read the pre-delete snapshot, or they'll read the post-delete snapshot. They won't wait for an in-process delete to figure out whether it's going to commit or rollback. This is not without its drawbacks as well unfortunately. For large data sets, it can put a big burden on tempdb and it too can be a little bit of a black box. It's also going to require buy-in from your DBAs.
We have an internal software based on an SQL-server DB with a master table and multiple joined tables. The nature of data we store is quite difficult to describe, but suppose we have a customers table with some joined tables: orders, shipments, phone-logs, complaints, etc.
We need to sync this software with an external one that has its own DB (with the very same structure) and produces an XML file with updated information about our "customers" (one file per customer). Updates may be in the master table and/or in 0 to n joined tables.
To import these files, one option is to query all the involved tables and compare them with the XML file, possibly adding-updating-deleting rows.
This would require a lot of coding.
Another option is to completely delete all data for the given customer (at least from the joined tables) and insert them again.
This would be not so efficient.
Please consider that the master table has 13 fields and there are about 6 tables with 3 to 15 fields.
In this app, we mainly use LINQ.
How would you proceed?
PS: I noticed a few answers on this subject here on StackOverflow, but almost all concern (single rows in) single tables.
For a scenario where I have a lot of join and lots of rows I prefer to update and make logical deletes. Example I have millions of customers and happens I have dozens tables with millions rows with FK poiting to customer ID. Trying to delete a customer can take several minutes.
For your particular scenario I can use a flag in each pertinent table to tell me: This rows was already synchronized, the row was inserted as is pending exportation, the rows is pending a delete or the rows was exported to xml in the past but was updated.
For exports:
It can make easy to query just the rows pending to be iserted, updated or deleted and ignore rows are up to date.
For imports:
If the other system don't have this facility there's a little trick you can do. Add a "external ID" column to fast search your database and identify the rows originated from that external source.
Even using this trick can be a pain to find if only that phone number was updated int that large table. For those extreme cases you can use a hash computed column to quickly identify if the two rows are different and update the entire (at least the common column) row.
An idea (considering you do this at database server side):
Build tables out of the customer xml. It can be temporary table or in-memory table.
Create SELECT queries to find new, updated, deleted data. These select queries would join the tables in your database and the table built from the customer xml. The output of the join would tell whether you have new records, updated records, deleted records or mix of it.
Run insert, update, delete accordingly.
First off let me say I am running on SQL Server 2005 so I don't have access to MERGE.
I have a table with ~150k rows that I am updating daily from a text file. As rows fall out of the text file I need to delete them from the database and if they change or are new I need to update/insert accordingly.
After some testing I've found that performance wise it is exponentially faster to do a full delete and then bulk insert from the text file rather than read through the file line by line doing an update/insert. However I recently came across some posts discussing mimicking the MERGE functionality of SQL Server 2008 using a temp table and the output of the UPDATE statement.
I was interested in this because I am looking into how I can eliminate the time in my Delete/Bulk Insert method when the table has no rows. I still think that this method will be the fastest so I am looking for the best way to solve the empty table problem.
Thanks
I think your fastest method would be to:
Drop all foreign keys and indexes
from your table.
Truncate your
table.
Bulk insert your data.
Recreate your foreign keys and
indexes.
Is the problem that Joe's solution is not fast enough, or that you can not have any activity against the target table while your process runs? If you just need to prevent users from running queries against your target table, you should contain your process within a transaction block. This way, when your TRUNCATE TABLE executes, it will create a table lock that will be held for the duration of the transaction, like so:
begin tran;
truncate table stage_table
bulk insert stage_table
from N'C:\datafile.txt'
commit tran;
An alternative solution which would satsify your requirement for not having "down time" for the table you are updating.
It sounds like originally you were reading the file and doing an INSERT/UPDATE/DELETE 1 row at a time. A more performant approach than that, that does not involve clearing down the table is as follows:
1) bulk load the file into a new, separate table (no indexes)
2) then create the PK on it
3) Run 3 statements to update the original table from this new (temporary) table:
DELETE rows in the main table that don't exist in the new table
UPDATE rows in the main table where there is a matching row in the new table
INSERT rows into main table from the new table where they don't already exist
This will perform better than row-by-row operations and should hopefully satisfy your overall requirements
There is a way to update the table with zero downtime: keep two day's data in the table, and delete the old rows after loading the new ones!
Add a DataDate column representing the date for which your ~150K rows are valid.
Create a one-row, one-column table with "today's" DataDate.
Create a view of the two tables that selects only rows matching the row in the DataDate table. Index it if you like. Readers will now refer to this view, not the table.
Bulk insert the rows. (You'll obviously need to add the DataDate to each row.)
Update the DataDate table. View updates Instantly!
Delete yesterday's rows at your leisure.
SELECT performance won't suffer; joining one row to 150,000 rows along the primary key should present no problem to any server less than 15 years old.
I have used this technique often, and have also struggled with processes that relied on sp_rename. Production processes that modify the schema are a headache. Don't.
For raw speed, I think with ~150K rows in the table, I'd just drop the table, recreate it from scratch (without indexes) and then bulk load afresh. Once the bulk load has been done, then create the indexes.
This assumes of course that having a period of time when the table is empty/doesn't exist is acceptable which it does sound like could be the case.
I have a very large table (more than 300 millions records) that will need to be cleaned up. Roughly 80% of it will need to be deleted. The database software is MS SQL 2005. There are several indexes and statistics on the table but not external relationships.
The best solution I came up with, so far, is to put the database into "simple" recovery mode, copy all the records I want to keep to a temporary table, truncate the original table, set identity insert to on and copy back the data from the temp table.
It works but it's still taking several hours to complete. Is there a faster way to do this ?
As per the comments my suggestion would be to simply dispense with the copy back step and promote the table containing records to be kept to become the new main table by renaming it.
It should be quite straightforward to script out the index/statistics creation to be applied to the new table before it gets swapped in.
The clustered index should be created before the non clustered indexes.
A couple of points I'm not sure about though.
Whether it would be quicker to insert into a heap then create the clustered index afterwards. (I guess no if the insert can be done in clustered index order)
Whether the original table should be truncated before being dropped (I guess yes)
#uriDium -- Chunking using batches of 50,000 will escalate to a table lock, unless you have disabled lock escalation via alter table (sql2k8) or other various locking tricks.
I am not sure what the structure of your data is. When does a row become eligible for deletion? If it is a purely ID based on date based thing then you can create a new table for each day, insert your new data into the new tables and when it comes to cleaning simply drop the required tables. Then for any selects construct a view over all the tables. Just an idea.
EDIT: (In response to comments)
If you are maintaining a view over all the tables then no it won't be complicated at all. The complex part is coding the dropping and recreating of the view.
I am assuming that you don't want you data to be locked down too much during deletes. Why not chunk the delete operations. Created a SP that will delete the data in chunks, 50 000 rows at a time. This should make sure that SQL Server keeps a row lock instead of a table lock. Use the
WAITFOR DELAY 'x'
In your while loop so that you can give other queries a bit of breathing room. Your problem is the old age computer science, space vs time.