SQL Index - Recreate after DELETE - sql

I have a temp table #Data, that I populate inside a stored procedure.
It contains like 15M rows.
Then I create a clustered index, say IX_Data, for a couple of column of the temp table #Data.
Then I delete from #Data which deletes like 1M rows (keeping the total rows 14M now).
My question: at this point, should I drop IX_Data and recreate it?
#Data is being referred further in the rest of the stored procedure just at one place.

You should not. Indexes are maintained by dbms automatically and always kept in sync.
That's why it's not recommended to create more indexes than you need since it's a performance penalty for every DML query.

You don't specify which database you are on (maybe it's Oracle?), but your question seems about fragmentation, since data integrity should be mantained in every database even if you delete a million of rows (so there's no need to drop and recreate an index for data integrity).
So, how do you know if your index is fragmented (so you need to recreate, or better, rebuild it)? Every database has his method. Oracle has internal tables from which you can determine if an object is fragmented.
But it depends on your type of database.

Related

SQL Server : large data import with clustered index

Performance-wise, does a clustered index help or not when bulk inserting hundreds of millions of rows in a table?
LE: after the INSERTs I have to put the database into production so I will have to create the one or more indexes.
A clustered index specifies that the data is ordered on the data pages.
When you are inserting data, the new data has to be sorted and compared to existing values. This is going to incur overhead.
The one exception is when you have an identity column -- that is being generated during the insert. Then the database knows that the new data goes "at the end" of the table.
Indexes are meant for speeding up retrieval (SELECT) of rows. They only have anti-effect with respect to INSERT or DELETE or UPDATE. And, in your case, if INSERT is the predominant operation to be performed in your system, don't go for indexes at all. Even in your Production system, assess the ratio between retrieval operations and insert/update operations and if it turns out to be that the retrieval operation is going to be dominant, then you can think of indexes.
Note: Whenever we define a Primary Key on a table, a basic index structure is already created for that table. So, without any specific need for retrieval optimization, there is no actual need to design and implement indexes.
You can know more here: https://www.geeksforgeeks.org/sql-indexes/

ORACLE 11g SET COLUMN NULL for specific Partition of large table

I have a Composite-List-List partitioned table with 19 Columns and about 400 million rows. Once a week new data is inserted in this table and before the insert I need to set the values of 2 columns to null for specific partitions.
Obvious approach would be something like the following where COLUMN_1 is the partition criteria:
UPDATE BLABLA_TABLE
SET COLUMN_18 = NULL, SET COLUMN_19 = NULL
WHERE COLUMN_1 IN (VALUE1, VALUE2…)
Of course this would be awfully slow.
My second thought was to use CTAS for every partition that I need to set those two columns to null and then use EXCHANGE PARTITION to update the data in my big table. Unfortunately that wouldn’t work because it´s a Composite-Partition.
I could use the same approach with subpartitions but then I would have to use CATS about 8000 times and drop those tables afterwards every week. I guess that would not pass the upcoming code-review.
May somebody has another idea how to performantly solve this?
PS: I’m using ORACLE 11g as database.
PPS: Sorry for my bad English…..
You've ruled out updating through DDL (switch partitions), so this lets us with only DML to consider.
I don't think that it's actually that bad an update with a table so heavily partitioned. You can easily split the update in 8k mini updates (each a single tiny partition):
UPDATE BLABLA_TABLE SUBPARTITION (partition1) SET COLUMN_18 = NULL...
Each subpartition would contain 15k rows to be updated on average so the update would be relatively tiny.
While it still represents a very big amount of work, it should be easy to set to run in parallel, hopefully during hours where database activity is very light. Also the individual updates are easy to restart if one of them fails (rows locked?) whereas a 120M update would take such a long time to rollback in case of error.
If I were to update almost 90% of rows in table, I would check feasibility/duration of just inserting to another table of same structure (less redo, no row chaining/migration, bypass cache and so on via direct insert. drop indexes and triggers first. exclude columns to leave them null in target table), rename the tables to "swap" them, rebuild indexes and triggers, then drop the old table.
From my experience in data warehousing, plain direct insert is better than update/delete. More steps needed but it's done in less time overall. I agree, partition swap is easier said than done when you have to process most of the table and just makes it more complex for the ETL developer (logic/algorithm bound to what's in the physical layer), we haven't encountered need to do partition swaps so far.
I would also isolate this table in its own tablespaces, then alternate storage between these two tablespaces (insert to 2nd drop table from 1st, vice-versa in next run, resize empty tablespace to reclaim space).

Drop/Rebuild indexes during Bulk Insert

I have got tables which has got more than 70 million records in it; what I just found that developers were dropping indexes before bulk insert and then creating again after the bulk insert is over. Execution time for the stored procedure is nearly 30 mins (do drop index, do bulk insert, then recreate index from scratch
Advice: Is this a good practice to drop INDEXs from table which has more than 70+ millions records and increasing by 3-4 million everyday.
Would it be help to improve performance by not dropping index before bulk insert ?
What is the best practice to be followed while doing BULK insert in BIG TABLE.
Thanks and Regards
Like everything in SQL Server, "It Depends"
There is overhead in maintaining indexes during the insert and there is overhead in rebuilding the indexes after the insert. The only way to definitively determine which method incurs less overhead is to try them both and benchmark them.
If I were a betting man I would put my wager that leaving the indexes in place would edge out the full rebuild but I don't have the full picture to make an educated guess. Again, the only way to know for sure is to try both options.
One key optimization is to make sure your bulk insert is in clustered key order.
If I'm reading your question correctly, that table is pretty much off limits (locked) for the duration of the load and this is a problem.
If your primary goal is to increase availability/decrease blocking, try taking the A/B table approach.
The A/B approach breaks down as follows:
Given a table called "MyTable" you would actually have two physical tables (MyTable_A and MyTable_B) and one view (MyTable).
If MyTable_A contains the current "active" dataset, your view (MyTable) is selecting all columns from MyTable_A. Meanwhile you can have carte blanche on MyTable_B (which contains a copy of MyTable_A's data and the new data you're writing.) Once MyTable_B is loaded, indexed and ready to go, update your "MyTable" view to point to MyTable_B and truncate MyTable_A.
This approach assumes that you're willing to increase I/O and storage costs (dramatically, in your case) to maintain availability. It also assumes that your big table is also relatively static. If you do follow this approach, I would recommend a second view, something like MyTable_old which points to the non-live table (i.e. if MyTable_A is the current presentation table and is referenced by the MyTable view, MyTable_old will reference MyTable_B) You would update the MyTable_old view at the same time you update the MyTable view.
Depending on the nature of the data you're inserting (and your SQL Server version/edition), you may also be able to take advantage of partitioning (MSDN blog on this topic.)

Delete large portion of huge tables

I have a very large table (more than 300 millions records) that will need to be cleaned up. Roughly 80% of it will need to be deleted. The database software is MS SQL 2005. There are several indexes and statistics on the table but not external relationships.
The best solution I came up with, so far, is to put the database into "simple" recovery mode, copy all the records I want to keep to a temporary table, truncate the original table, set identity insert to on and copy back the data from the temp table.
It works but it's still taking several hours to complete. Is there a faster way to do this ?
As per the comments my suggestion would be to simply dispense with the copy back step and promote the table containing records to be kept to become the new main table by renaming it.
It should be quite straightforward to script out the index/statistics creation to be applied to the new table before it gets swapped in.
The clustered index should be created before the non clustered indexes.
A couple of points I'm not sure about though.
Whether it would be quicker to insert into a heap then create the clustered index afterwards. (I guess no if the insert can be done in clustered index order)
Whether the original table should be truncated before being dropped (I guess yes)
#uriDium -- Chunking using batches of 50,000 will escalate to a table lock, unless you have disabled lock escalation via alter table (sql2k8) or other various locking tricks.
I am not sure what the structure of your data is. When does a row become eligible for deletion? If it is a purely ID based on date based thing then you can create a new table for each day, insert your new data into the new tables and when it comes to cleaning simply drop the required tables. Then for any selects construct a view over all the tables. Just an idea.
EDIT: (In response to comments)
If you are maintaining a view over all the tables then no it won't be complicated at all. The complex part is coding the dropping and recreating of the view.
I am assuming that you don't want you data to be locked down too much during deletes. Why not chunk the delete operations. Created a SP that will delete the data in chunks, 50 000 rows at a time. This should make sure that SQL Server keeps a row lock instead of a table lock. Use the
WAITFOR DELAY 'x'
In your while loop so that you can give other queries a bit of breathing room. Your problem is the old age computer science, space vs time.

Insertion of data after creating index on empty table or creating unique index after inserting data on oracle?

Which option is better and faster?
Insertion of data after creating index on empty table or creating unique index after inserting data. I have around 10M rows to insert. Which option would be better so that I could have least downtime.
Insert your data first, then create your index.
Every time you do an UPDATE, INSERT or DELETE operation, any indexes on the table have to be updated as well. So if you create the index first, and then insert 10M rows, the index will have to be updated 10M times as well (unless you're doing bulk operations).
It is faster and better to insert the records and then create the index after rows have been imported. It's faster because you don't have the overhead of index maintenance as the rows are inserted and it is better from a fragmentation standpoint on your indexes.
Obviously for a unique index, be sure that the data you are importing is unique so you don't have failures when trying to create the index.
As others have said, insert first and add the index later. If the table already exists and you have to insert a pile of data like this, drop all indexes and constraints, insert the data, then re-apply first your indexes and then your constraints. You'll certainly want to do intermediate commits to help preclude the possibility that you'll run out of rollback segment space or something similar. If you're inserting this much data it might prove useful to look at using SQL*Loader to save yourself time and aggravation.
I hope this helps.