Drop/Rebuild indexes during Bulk Insert - sql

I have got tables which has got more than 70 million records in it; what I just found that developers were dropping indexes before bulk insert and then creating again after the bulk insert is over. Execution time for the stored procedure is nearly 30 mins (do drop index, do bulk insert, then recreate index from scratch
Advice: Is this a good practice to drop INDEXs from table which has more than 70+ millions records and increasing by 3-4 million everyday.
Would it be help to improve performance by not dropping index before bulk insert ?
What is the best practice to be followed while doing BULK insert in BIG TABLE.
Thanks and Regards

Like everything in SQL Server, "It Depends"
There is overhead in maintaining indexes during the insert and there is overhead in rebuilding the indexes after the insert. The only way to definitively determine which method incurs less overhead is to try them both and benchmark them.
If I were a betting man I would put my wager that leaving the indexes in place would edge out the full rebuild but I don't have the full picture to make an educated guess. Again, the only way to know for sure is to try both options.
One key optimization is to make sure your bulk insert is in clustered key order.
If I'm reading your question correctly, that table is pretty much off limits (locked) for the duration of the load and this is a problem.
If your primary goal is to increase availability/decrease blocking, try taking the A/B table approach.
The A/B approach breaks down as follows:
Given a table called "MyTable" you would actually have two physical tables (MyTable_A and MyTable_B) and one view (MyTable).
If MyTable_A contains the current "active" dataset, your view (MyTable) is selecting all columns from MyTable_A. Meanwhile you can have carte blanche on MyTable_B (which contains a copy of MyTable_A's data and the new data you're writing.) Once MyTable_B is loaded, indexed and ready to go, update your "MyTable" view to point to MyTable_B and truncate MyTable_A.
This approach assumes that you're willing to increase I/O and storage costs (dramatically, in your case) to maintain availability. It also assumes that your big table is also relatively static. If you do follow this approach, I would recommend a second view, something like MyTable_old which points to the non-live table (i.e. if MyTable_A is the current presentation table and is referenced by the MyTable view, MyTable_old will reference MyTable_B) You would update the MyTable_old view at the same time you update the MyTable view.
Depending on the nature of the data you're inserting (and your SQL Server version/edition), you may also be able to take advantage of partitioning (MSDN blog on this topic.)

Related

Does it affect performance to frequently repopulate a highly read database table?

I have a database table with about 2500 rows in, which is frequently read by my web application. Will it affect the performance of reading from that table if all of the data in it is frequently (e.g. every 1-5 minutes) deleted and re-inserted?
By that I mean:
DELETE FROM MyTable
INSERT INTO MyTable SELECT ...
Probably not, at the given numbers ...
However, if you have one or more index(es) on your table (to help with read/select, or automatically on any PK/UK ...) you should consider that every delete/insert may result in re-calculation of any such index (on top of the delete/insert as such), not directly affecting table-reads as such, but adding to the overall load on the DB server.
There is no sourcecode, but it appears you are using this table as intermediate/interface to sth. else, so while 'updating' you'd probably want to make sure to bundle your delete(s)/insert(s) in transactions, best you can, rather than e.g. executing them all individually, like in a loop. Or see if you can keep your PKs and rather just update ...?
This could also help reduce fragmentation in the underlying storage ...

How to perform data archive in SQL Server with many tables?

Let's say I have a database with many tables in it. I want to perform data archiving on certain tables, that is create a same table with same structures (same constraint, indexes, columns, triggers, etc) as a new table and insert specific data into the new table from the old table.
Example, current table has data from 2008-2017 and I want to move only data from 2010-2017 into the new table. Then after that, I can delete the old table and rename the new table with naming conventions similar to old table.
How should I approach this?
For the sort of clone-rename-drop logic you're talking about, the basics are pretty straight forward. Really the only time this is a good idea is if you have a table with a large amount of data, which you can't afford down time or blocking on, and you only plan to do this one. The process looks something like this:
Insert all the data from your original table into the clone table
In a single transaction, sp_rename the original table from (for example) myTable to myTable_OLD (just something to distinguish it from the real table). Then sp_rename the clone table from (for example) myTable_CLONE to myTable
Drop myTable_OLD when you're happy everything has worked how you want. If it didn't work how you want, just sp_rename the objects back.
Couple considerations to think about if you go that route
Identity columns: If your table has any identities on it, you'll have to use identity_insert on then reseed the identity to pick up at where the old identity left off
Do you have the luxury of blocking the table while you do this? Generally if you need to do this sort of thing, the answer is no. What I find works well is to insert all the rows I need using (nolock), or however you need to do it so the impact of the select from the original table is mitigated. Then, after I've moved 99% of the data, I will then open a transaction, block the original table, insert just the new data that's come in since the bulk of the data movement, then do the sp_rename stuff
That way you don't lock anything for the bulk of the data movement, and you only block the table for the very last bit of data that came into the original table between your original insert and your sp_rename
How you determine what's come in "since you started" will depend on how your table is structured. If you have an identity or a datestamp column, you can probably just pick rows which came in after the max of those fields you moved over. If your table does NOT have something you can easily hook into, you might need to get creative.
Alternatives
A couple other alternatives that came to mind:
Table Partitioning:
This shards a single table across multiple partitions (which can be managed sort of like individual tables). You can, say, partition you data by year, then when you want to purge the trailing year of data, you "switch out" that partition to a special table which you can then truncate. All those operations are meta-data only, so they're super fast. This also works really well for huge amounts of data where deletes and all their pesky transaction logging aren't feasible
The downside to table partitioning is it's kind of a pain to set up and manage.
Batched Deletes:
If you're data isn't too big, you could just do batched deletes on the trailing end of your data. If you can find a way to get clustered index seeks for your deletes, they should be reasonably lightweight. As long as you're not accumulating data faster than you can get rid of it, the benefit of this kind of thing is you just run it semi-continuously and it just nibbles away at the trailing end of your data
Snapshot Isolation:
If deletes cause too much blocking, you can also set up something like snapshot isolation, which basically stores historical versions of rows in tempdb. Any query which sets isolation level read committed snapshot will then read those pre-change rows instead of contend for locks on the "real" table. You can then do batched deletes to your hearts content and know that any queries that hit the table will never get blocked by a delete (or any other DML operation) because they'll either read the pre-delete snapshot, or they'll read the post-delete snapshot. They won't wait for an in-process delete to figure out whether it's going to commit or rollback. This is not without its drawbacks as well unfortunately. For large data sets, it can put a big burden on tempdb and it too can be a little bit of a black box. It's also going to require buy-in from your DBAs.

SQL insert caching

I have a qustion regarding an sql insert. The problem is that if you have a big table with a lot of indexes and a lot of inserts then the inserting of the data is slow. Is it a good approach if I have table A without any indexes and table B with indexes (A and B have the exact same scheme - they are equal) and if I insert everything in table A first and a separate service will work on background and will move all the data to table B where it will be indexed?
Using a staging table like this is not inconceivable. In fact, operational data stores essentially do this.
However, such database structures are usually used for storing incoming data close to the "transactional" format. The final load into history involves more than just building indexes.
Before going down that path, you need to investigate why inserts are so slow and what is the volume of inserts. Can you replace some of the inserts with bulk inserts? Reducing the number of transactions can improve performance. Are all the indexes needed? Do you have the optimal data model for your problem?

Speed-up SQL Insert Statements

I am facing an issue with an ever slowing process which runs every hour and inserts around 3-4 million rows daily into an SQL Server 2008 Database.
The schema consists of a large table which contains all of the above data and has a clustered index on a datetime field (by day), a unique index on a combination of fields in order to exclude duplicate inserts, and a couple more indexes on 2 varchar fields.
The typical behavior as of late, is that the insert statements get suspended for a while before they complete. The overall process used to take 4-5 mins and now it's usually well over 40 mins.
The inserts are executed by a .net service which parses a series of xml files, performs some data transformations and then inserts the data to the DB. The service has not changed at all, it's just that the inserts take longer than they use to.
At this point I'm willing to try everything. Please, let me know whether you need any more info and feel free to suggest anything.
Thanks in advance.
Sounds like you have exhausted the buffer pools ability to cache all the pages needed for the insert process. Append-style inserts (like with your date table) have a very small working set of just a few pages. Random-style inserts have basically the entire index as their working set. If you insert a row at a random location the existing page that row is supposed to be written to must be read first.
This probably means tons of disk seeks for inserts.
Make sure to insert all rows in one statement. Use bulk insert or TVPs. This allows SQL Server to optimize the query plan by sorting the inserts by key value making IO much more efficient.
This will, however, not realize a big speedup (I have seen 5x in similar situations). To regain the original performance you must bring the working set back into memory. Add RAM, purge old data, or partition such that you only need to touch very few partitions.
drop index's before insert and set them up on completion

Delete large portion of huge tables

I have a very large table (more than 300 millions records) that will need to be cleaned up. Roughly 80% of it will need to be deleted. The database software is MS SQL 2005. There are several indexes and statistics on the table but not external relationships.
The best solution I came up with, so far, is to put the database into "simple" recovery mode, copy all the records I want to keep to a temporary table, truncate the original table, set identity insert to on and copy back the data from the temp table.
It works but it's still taking several hours to complete. Is there a faster way to do this ?
As per the comments my suggestion would be to simply dispense with the copy back step and promote the table containing records to be kept to become the new main table by renaming it.
It should be quite straightforward to script out the index/statistics creation to be applied to the new table before it gets swapped in.
The clustered index should be created before the non clustered indexes.
A couple of points I'm not sure about though.
Whether it would be quicker to insert into a heap then create the clustered index afterwards. (I guess no if the insert can be done in clustered index order)
Whether the original table should be truncated before being dropped (I guess yes)
#uriDium -- Chunking using batches of 50,000 will escalate to a table lock, unless you have disabled lock escalation via alter table (sql2k8) or other various locking tricks.
I am not sure what the structure of your data is. When does a row become eligible for deletion? If it is a purely ID based on date based thing then you can create a new table for each day, insert your new data into the new tables and when it comes to cleaning simply drop the required tables. Then for any selects construct a view over all the tables. Just an idea.
EDIT: (In response to comments)
If you are maintaining a view over all the tables then no it won't be complicated at all. The complex part is coding the dropping and recreating of the view.
I am assuming that you don't want you data to be locked down too much during deletes. Why not chunk the delete operations. Created a SP that will delete the data in chunks, 50 000 rows at a time. This should make sure that SQL Server keeps a row lock instead of a table lock. Use the
WAITFOR DELAY 'x'
In your while loop so that you can give other queries a bit of breathing room. Your problem is the old age computer science, space vs time.