Performance of inserting to big table while checking for duplicates - sql

I have a simple table that contains a varchar(100). I am trying to populate it with 1 billion unique records. I have a stored proc that takes a table type parameter containig 1000 records at a time and inserts it into the table while checking no duplicate exists. After about 50 million the performance goes down. I tried sharding the table and using the sql table partitioning with balanced distribution but no gain was observed.
How can i build this solution in sql with reasonable performance?

You might want to try de-duping the data before you put it into the database, then disabling the unique key while inserting so you don't have to deal with rebuilding it as you go.

Related

Update time of a big table in PostgreSQL

i have a question for performance for update on a big table that is around 8 to 10GBs of size.
I have a task where i'm supposed to detect distinct values from a table of mentioned size with about 4.3 million rows and insert them into some table. This part is not really a problem but it's the update that follows afterwards. So i need to update some column based on the id of the created rows in the table i did an import. A example of the query i'm executing is:
UPDATE billinglinesstagingaws as s
SET product_id = p.id
FROM product AS p
WHERE p.key=(s.data->'product'->>'sku')::varchar(75)||'-'||(s.data->'lineitem'->>'productcode')::varchar(75) and cloudplatform_id = 1
So as mentioned, the staging table size is around 4.3 million rows and 8-10Gb and as it can be seen from the query, it has a JSONB field and the product table has around 1500 rows.
This takes about 12 minutes, which i'm not really sure if it is ok, and i'm really wondering, what i can do to speed it up somehow. There aren't foreign key constraints, there is a unique constraint on two columns together. There are no indexes on the staging table.
I attached the query plan of the query, so any advice would be helpful. Thanks in advance.
This is a bit too long for a comment.
Updating 4.3 million rows in a table is going to take some time. Updates take time because the the ACID properties of databases require that something be committed to disk for each update -- typically log records. And that doesn't count the time for reading the records, updating indexes, and other overhead.
So, about 17,000 updates per second isn't so bad.
There might be ways to speed up your query. However, you describe these as new rows. That makes me wonder if you can just insert the correct values when creating the table. Can you lookup the appropriate value during the insert, rather than doing so afterwards in an update?

Does the size of a table impact the performance of insert query in SQL Server?

I need to improve performance of INSERT INTO query to a table which has 1 billion rows. This table contains a clustered index primary key.
One suggestion is to reduce the table data by deleting (copying to archive table) old records and keep most recent records in the table. This will reduce data from 1 billion to 2 million. Will this approach increase data write process?
Are there any other way to increase record insert process?
Note: this INSERT INTO query is in a complex stored procedure and the execution plan points to this INSERT statement taking a certain amount of time.
Simplistically, reducing the size of the table will not have much impact on performance. There are some cases where it could make a difference.
If the clustered index primary key is not ordered, then you have a fragmentation problem. That means that inserts are likely to be splitting pages and rewriting them.
The "good" news is that in the existing table, your pages are probably already fragmented, so you probably have few full pages. So splitting is less likely. This is "good" in quotes because it means that you have lots of wasted space, which is inefficient for queries.
If you remove the excess rows and compact (defrag) the table, then you will have some advantages. The biggest is that the data will probably fit into memory -- a big performance advantage.
I would recommend that you fix the table because the extra rows are probably hurting query performance. Given the volume of data, I would suggest a truncate/re-insert approach:
select t.*
into temp_t
from t
where <rows to keep logic here>;
truncate table t; -- be sure you have a backup!
insert into t
select *
from temp_t;
This will be much faster than trying to delete 99.9% of the rows (unless you happen to have a partitioned table where you can simply drop partitions).
If you want to keep the old data, you might find a way to partition the table. Of course, your queries would have to use the partitioning key to access the "valid" rows rather than the archive.

SQL Index - Recreate after DELETE

I have a temp table #Data, that I populate inside a stored procedure.
It contains like 15M rows.
Then I create a clustered index, say IX_Data, for a couple of column of the temp table #Data.
Then I delete from #Data which deletes like 1M rows (keeping the total rows 14M now).
My question: at this point, should I drop IX_Data and recreate it?
#Data is being referred further in the rest of the stored procedure just at one place.
You should not. Indexes are maintained by dbms automatically and always kept in sync.
That's why it's not recommended to create more indexes than you need since it's a performance penalty for every DML query.
You don't specify which database you are on (maybe it's Oracle?), but your question seems about fragmentation, since data integrity should be mantained in every database even if you delete a million of rows (so there's no need to drop and recreate an index for data integrity).
So, how do you know if your index is fragmented (so you need to recreate, or better, rebuild it)? Every database has his method. Oracle has internal tables from which you can determine if an object is fragmented.
But it depends on your type of database.

ORACLE 11g SET COLUMN NULL for specific Partition of large table

I have a Composite-List-List partitioned table with 19 Columns and about 400 million rows. Once a week new data is inserted in this table and before the insert I need to set the values of 2 columns to null for specific partitions.
Obvious approach would be something like the following where COLUMN_1 is the partition criteria:
UPDATE BLABLA_TABLE
SET COLUMN_18 = NULL, SET COLUMN_19 = NULL
WHERE COLUMN_1 IN (VALUE1, VALUE2…)
Of course this would be awfully slow.
My second thought was to use CTAS for every partition that I need to set those two columns to null and then use EXCHANGE PARTITION to update the data in my big table. Unfortunately that wouldn’t work because it´s a Composite-Partition.
I could use the same approach with subpartitions but then I would have to use CATS about 8000 times and drop those tables afterwards every week. I guess that would not pass the upcoming code-review.
May somebody has another idea how to performantly solve this?
PS: I’m using ORACLE 11g as database.
PPS: Sorry for my bad English…..
You've ruled out updating through DDL (switch partitions), so this lets us with only DML to consider.
I don't think that it's actually that bad an update with a table so heavily partitioned. You can easily split the update in 8k mini updates (each a single tiny partition):
UPDATE BLABLA_TABLE SUBPARTITION (partition1) SET COLUMN_18 = NULL...
Each subpartition would contain 15k rows to be updated on average so the update would be relatively tiny.
While it still represents a very big amount of work, it should be easy to set to run in parallel, hopefully during hours where database activity is very light. Also the individual updates are easy to restart if one of them fails (rows locked?) whereas a 120M update would take such a long time to rollback in case of error.
If I were to update almost 90% of rows in table, I would check feasibility/duration of just inserting to another table of same structure (less redo, no row chaining/migration, bypass cache and so on via direct insert. drop indexes and triggers first. exclude columns to leave them null in target table), rename the tables to "swap" them, rebuild indexes and triggers, then drop the old table.
From my experience in data warehousing, plain direct insert is better than update/delete. More steps needed but it's done in less time overall. I agree, partition swap is easier said than done when you have to process most of the table and just makes it more complex for the ETL developer (logic/algorithm bound to what's in the physical layer), we haven't encountered need to do partition swaps so far.
I would also isolate this table in its own tablespaces, then alternate storage between these two tablespaces (insert to 2nd drop table from 1st, vice-versa in next run, resize empty tablespace to reclaim space).

Delete large portion of huge tables

I have a very large table (more than 300 millions records) that will need to be cleaned up. Roughly 80% of it will need to be deleted. The database software is MS SQL 2005. There are several indexes and statistics on the table but not external relationships.
The best solution I came up with, so far, is to put the database into "simple" recovery mode, copy all the records I want to keep to a temporary table, truncate the original table, set identity insert to on and copy back the data from the temp table.
It works but it's still taking several hours to complete. Is there a faster way to do this ?
As per the comments my suggestion would be to simply dispense with the copy back step and promote the table containing records to be kept to become the new main table by renaming it.
It should be quite straightforward to script out the index/statistics creation to be applied to the new table before it gets swapped in.
The clustered index should be created before the non clustered indexes.
A couple of points I'm not sure about though.
Whether it would be quicker to insert into a heap then create the clustered index afterwards. (I guess no if the insert can be done in clustered index order)
Whether the original table should be truncated before being dropped (I guess yes)
#uriDium -- Chunking using batches of 50,000 will escalate to a table lock, unless you have disabled lock escalation via alter table (sql2k8) or other various locking tricks.
I am not sure what the structure of your data is. When does a row become eligible for deletion? If it is a purely ID based on date based thing then you can create a new table for each day, insert your new data into the new tables and when it comes to cleaning simply drop the required tables. Then for any selects construct a view over all the tables. Just an idea.
EDIT: (In response to comments)
If you are maintaining a view over all the tables then no it won't be complicated at all. The complex part is coding the dropping and recreating of the view.
I am assuming that you don't want you data to be locked down too much during deletes. Why not chunk the delete operations. Created a SP that will delete the data in chunks, 50 000 rows at a time. This should make sure that SQL Server keeps a row lock instead of a table lock. Use the
WAITFOR DELAY 'x'
In your while loop so that you can give other queries a bit of breathing room. Your problem is the old age computer science, space vs time.