Optimize Delete - sql

I have a importer system which updates the column of already existing rows in a Table. Since UPDATE was taking time I changed it to DELETE and BULK INSERT.
Here is my database setup snippet
Table: ParameterDefinition
Columns: Id, Name, Other Cols
Table: ParameterValue
Columns: Id, CustId, ParameterDefId, Value
I get the values associated to ParamterDefinition.Name from my XML source, so to import I first delete all the existing ParamterValue with all the ParamterDefinition.Name passed in the XML and finally do bulk insert of all the values from XML. Here is my query
DELETE FROM ParameterValue WHERE CustId = ? AND ParameterDefId IN (?,?...?);
For 1000 Customers the above DELETE statement is called 1000 times which is very time consuming now, approximately 64 seconds.
Is there any better way to handle DELETE of 1000 customers?
Thanks,
Sheeju

Create a temporary table for the bulk-insert (ParameterValue_Import). Do the bulk-inserts to this table, then update/insert/delete based on the imported data.
INSERT INTO .. SELECT .. WHERE NOT EXISTS ( .. ) for the new rows
UPDATE .. FROM for the updates
DELETE FROM WHERE NOT EXISTS ( .. ) for the deletion
Bulk operations have better performance than standalone operations. Most DBMSs are designed to handle set based operations instead of record based ones.
Edit
To delete or update one record based on a WHERE clause which refers to only one record, the DBMS should either do a full table scan (if there is no index for the where condition) or do an index lookup. Only after the record successfully identified, the DBMS proceeds the original request (update or delete). Based on the number of records in the table and/or the size/depth of the index, this could be really expensive. This process are done for each and every command in the batch. Summing up the total cost, it could be more than if you are updating/deleting records based on another table. (Especially if the operations are update/delete nearly all records in the target table.)
When you are trying to delete/update several records at once (e.g. based on another table), the DBMS could do the lookups with only one table scan/index lookup and do a logical join when processing your request.
The cost of purely updating a record is the same in each case, just the total cost of lookup could be significantly different.
Furthermore deleting then inserting a record to update it could require more resources: when you are deleting a record, all related indexes will be updates, and when you insert the new record, the indexes will be updated once more, while with updating the record, only those indexes should be updated, which are related to an updated column (and the index update should be done only once).

I am giving the exact syntax to the above idea given by #Pred
After Bulk Insert lets say you have data in "ParamterValue_Import"
To INSERT The Records in "ParamterValue_Import" which are not in "ParamterValue"
INSERT INTO ParameterValue (
CustId, ParameterDefId, Value
)
SELECT
CustId, ParameterDefId, Value
FROM
ParameterValue_Import
WHERE
NOT EXISTS (
SELECT null
FROM ParameterValue
WHERE ParameterValue.CustId = ParameterValue_Import.CustId
);
To UPDATE The Records in "ParamterValue" which are also in "ParamterValue_Import"
UPDATE
ParameterValue
SET
Value = ParameterValue_Import.Value
FROM
ParameterValue_Import
WHERE
ParameterValue.ParameterDefId = ParameterValue_Import.ParameterDefId
AND ParameterValue.CustId = ParameterValue_Import.CustId;

Related

COPY FROM / INSERT Millions of Rows Into Same PostgreSQL table

I have a table with hundreds of millions of rows that I need to essentially create a "duplicate" of each existing row in, doubling its row count. I'm currently using an insert operation (and unlogging the table prior to inserting) which still takes a long while as one transaction. Looking for guidance on if there may be a more efficient way to execute the query below.
INSERT INTO costs(
parent_record, is_deleted
)
SELECT id, is_deleted
FROM costs;

How to do update partitioned table [duplicate]

So I have a main table in Hive, it will store all my data.
I want to be able to load a incremental data update about every month
with a large amount of data couple billion rows. There will be new data
as well as updated entries.
What is the best way to approach this, I know Hive recently upgrade and supports update/insert/delete.
What I've been thinking is to somehow find the entries that will be updated and remove them from the main table and then just insert the new incremental update. However after trying this, the inserts are very fast, but the deletes are very slow.
The other way is to do something using the update statement to match the key values from the main table and the incremental update and update their fields. I haven't tried this yet. This also sounds painfully slow since Hive would have to update each entry 1 by 1.
Anyone got any ideas as to how to do this most efficiently and effectively ??
I'm pretty new to Hive and databases in general.
If merge in ACID mode is not applicable, then it's possible to update using FULL OUTER JOIN or using UNION ALL + row_number.
To find all entries that will be updated you can join increment data with old data:
insert overwrite target_data [partition() if applicable]
SELECT
--select new if exists, old if not exists
case when i.PK is not null then i.PK else t.PK end as PK,
case when i.PK is not null then i.COL1 else t.COL1 end as COL1,
...
case when i.PK is not null then i.COL_n else t.COL_n end as COL_n
FROM
target_data t --restrict partitions if applicable
FULL JOIN increment_data i on (t.PK=i.PK);
It's possible to optimize this by restricting partitions in target_data that will be overwritten and joined using WHERE partition_col in (select distinct partition_col from increment_data) or pass partition list if possible as a parameter and use in the where clause, it will work even faster.
Also if you want to update all columns with new data, you can apply this solution with UNION ALL+row_number(), it works faster than full join: https://stackoverflow.com/a/44755825/2700344
Here is my solution/work around if you are using old hive version. This works better when you have large data in target table which we can't drop and recreate with full data every time.
create one more table say delete_keys table. This will hold all the key from main table which are deleted along with its surrogate key.
While loading incremental data into main table, do a left join with main table. For all the matching records, we ideally should update the main table. But instead, we take keys (along with surrogate key) from main table for all matching records and insert that to delete_keys table. Now we can insert all delta records into main table as it irrespective of whether they are to be updated or inserted.
Create view on main table using delete-keys table so that matching keys with delete-keys table are not fetched. So, this view will be final target table. This view will not show records from main table which are updated with latest records.

How to delete all data then insert new data

I have a process that runs every 60 minutes. On one table I need to remove all data then insert records from a different table. The problem is it takes a long time to delete and reinsert the data. When the table has no data I am afraid the users will see this. Is there a way to refresh the data without users seeing this?
If you want to remove all data from the table then use the TRUNCATE
TABLE instead of delete - It'll do it faster.
As for the insert it is a bit hard to say because you did not give any details but what you can try is:
Option 1 - Using temp table
create table table_temp as select * from original_table where rownum < 1;
//insert into table_temp
drop table original_table;
Exec sp_rename 'table_temp' , 'original_table'
Option 2 - Use 2 tables "Active-Passive" -
Have 2 tables for the data and a view to select over them. The view will join with a third table that will specify from which of the tables to select. kind of an "active-passive" concept.
To demonstrate concept:
with active_table as ( select 'table1_active' active_table )
select 1 data
where 'table1_active' in (select * from active_table)
union all
select 2
where 'table2_active' in (select * from active_table)
//This returns only one record with the "1"
Are you truncating instead of deleting? A truncate (while logged) is much, much, faster then a delete.
If you cannot truncate try deleting 1000-10000 rows at a time (smaller log buildup and on deleting large amounts of rows great increase in speed.)
If you really want fast performance you can create a second table, fill it with data, and then drop the first table and rename the second table as the first table. You will lose all the permissions on the table when you do this so be sure to reapply the permissions to the renamed table.
If you are deleting all rows in a table, you can consider using a TRUNCATE statement against the table instead of a DELETE. It will speed up part of your process. Keep in mind that this will reset any identity seeds you may have on the table.
As suggested, you can wrap this process in a transaction and depending on how you set your transaction isolation level, you can control what your users will see if they query the data during the transaction.
Make it sequence based, your copied in records all have have a series number (all the same for all copied in records) and another file holds which sequence is active, and you always select on a join to this table - when you copy in new records they have a new sequence that is not yet active, when they are all copied in, then the sequence table is updated to the new sequence - the redundant sequence records are deleted at your leisure.
Example
Let's suppose your table has field SeriesNo added and table ActiveSeries has field SeriesNo.
All queries of your table:
SELECT *
FROM YourTable Y
JOIN ActiveSeries A
ON A.SeriesNo = Y.SeriesNo
then updating SeriesNo in ActiveSeries makes new series of records available instantly.
I would follow below approach. While I troubleshoot why the delete and reinsert is taking time.
Create a new table ( t1 ) which has same data as oldtable ( maintable )
Now do your stuff on t1.
When your stuff is done, rename t1 to maintable.

Limit Rows in ETL Without Date Column for Cue

We have two large tables (Clients and Contacts) which undergo an ETL process every night, being inserted into a single "People" table in the data warehouse. This table is used in many places and cannot be significantly altered without a lot of work.
The source tables are populated by third party software; we used to assume that we could identify the rows that had been updated since last night by using the "UpdateDate" column in each, but more recently identified some rows that were not touched by the ETL, as the "UpdateDate" column was not behaving as we had thought; the software company do not see this as a bug, so we have to live with this fact.
As a result, we now take all source rows, transformed into a temp staging table and then Merge that into the data warehouse, using the Merge to identify any changed values. We have noticed that this process is taking too long on some days and would like to limit the number of rows that the ETL process looks at, as we believe that the reason for the hold-up is the principally the sheer volume of data that is examined and stored on the temp database. We can see no way to look purely at the source data and identify when each row last changed.
Here is a simplified pseudocode of the ETL stored procedure, although what the procedure actually does is not really relevant to the question (included just in case you disagree with me!)
CREATE #TempTable (ClientOrContact BIT NOT NULL, Id INT NOT NULL, [Some_Other_Columns])
INSERT #TempTable
SELECT 1 AS ClientOrContact, C.Id, [SomeColumns] FROM
(SELECT [SomeColumns]
FROM Source_ClientsTable C
JOIN FieldsTable F JOIN [SomeOtherTables])
PIVOT (MAX(F.FieldValue) FOR F.FieldName IN ([SomeFieldNames]));
INSERT #TempTable
SELECT 0 AS ClientOrContact, C.Id, [SomeColumns] FROM
(SELECT [SomeColumns]
FROM Source_ContactsTable C
JOIN FieldsTable F JOIN [SomeOtherTables])
PIVOT (MAX(F.FieldValue) FOR F.FieldName IN ([SomeFieldNames]));
ALTER #TempTable ADD PRIMARY KEY (ClientOrContact, Id);
MERGE Target_PeopleTable AS Tgt
USING (SELECT [SomeColumns] FROM #TempTable JOIN [SomeOtherTables]) AS Src
ON Tgt.ClientOrContact = Src.ClientOrContact AND Tgt.Id = Src.Id
WHEN MATCHED AND NOT EXISTS (SELECT Tgt.* INTERSECT SELECT Src.*)
THEN UPDATE SET ([All_NonKeyTargetColumns] = [All_NonKeySourceColumns])
WHEN NOT MATCHED BY Target THEN INSERT [All_TargetColumns] VALUES [All_SourceColumns]
OUTPUT $Action INTO #Changes;
RETURN COUNT(*) FROM #Changes;
GO
The source tables have about 1.5M rows each, but each day only a relatively small number of rows are inserted or updated (never deleted). There are about 50 columns in each table, of those, about 40 columns can have changed values each night. Most columns are VARCHAR and each table contains an independent incremental primary key column. We can add indexes to the source tables, but not alter them in any other way (They have already been indexed by a predecessor) The source tables and target table are on the same server, but different databases. Edit: The Target Table has a composite primary key on the ClientOrContact and Id columns, matching that shown on the temp table in the script above.
So, my question is this - please could you suggest any general possible strategies that might be useful to limit the number of rows we look at or copy across each night? If we only touched the rows that we needed to each night, we would be touching less than 1% of the data we do at the moment...
Before you try the following suggestion, just one thing to check is that the Target_PeopleTable has an index or primary key on the id column. It probably does but without schema information to verify I am making no assumptions and this might speed up the merge stage.
As you've identified if you could somehow limit the records in TempTable to just the changed rows then this could offer a performance win for the actual MERGE statement (depending on how expensive determining just the changed rows is).
As a general strategy I would consider some kind of checksum to try and identify the changed records only. The T-SQL Checksum function could be used to calculate a check sum across the required columns by specifying the columns as a comma separated list to that function or there are actual column types available for this such as Binary_Checksum.
Since you cannot change the source schema you would have to maintain a list of record ids and associated checksums in your target database so that you can readily compare the source checksums to the target checksums from the last run in order to identify a difference.
You can then only insert into the Temp table where there is a checksum difference between the target and source or the id does not exist in the target db.
This might just be moving the performance problem to the temp insert part but I think it's worth a try.
Have you considered triggers? I avoid them like the plague, but they really are the solution to some problems.
Put an INSERT/UPDATE [/DELETE?] trigger on your two source tables. Program it such that when rows are added or updated, the trigger will log the IDs of these rows in a (you'll have to create this) audit table, where that table would contain the ID, the type of change (update or insert – and delete, if you have to worry about those) and when the change was made. When you run ETL, join this list of “to be merged” items with the source tables. When you’re done, delete the table and it’s reset for the next run. (Use the “added on” datetime column to make sure you don’t delete rows that may have been added while you were running ETL.)
There’s lots of details behind proper use and implementation, but overall this idea should do what you need.

Incremental load for Updates into Warehouse

I am planning for an incremental load into warehouse (especially for updates of source tables in RDBMS).
Capturing the updated rows in staging tables from RDBMS based the updates datetime. But how do I determine which column of a particular row needs to be updated in the target warehouse tables?
Or do I just delete a particular row in the warehouse table (based on the primary key of the row in staging table) and insert the new updated row?
Which is the best way to implement the incremental load between the RDBMS and Warehouse using PL/SQL and SQL coding?
In my opinion, the easiest way to accomplish this is as follows:
Create a stage table identical to your host table. When you do your incremental/net-change load, load all changed records into this table (based on whatever your "last updated" field is)
Delete the records from your actual table based on the primary key. For example, if your primary key is customer, part, the query might look like this:
delete from main_table m
where exists (
select null
from stage_table s
where
m.customer = s.customer and
m.part = s.part
);
Insert the records from the stage to the main table.
You could also do an update existing records / insert new records, but either way that's two steps. The advantage of the method I listed is that it will work even if your tables have partitions and the newly updated data violates one of the original partition rules, whereas an update would not accomplish that. Also, the syntax is much simpler as your update would have to list every single field, whereas the delete from / insert into allows you list only the primary key fields.
Oracle also has a merge clause that will update if it exists or insert if it does not. I honestly don't know how that would be impacted if you had partitions.
One major caveat. If your updates include deletes -- records that need to be deleted from the main table, none of these will resolve that and you will need some other way to handle that. It may not be necessary, depending on your circumstances, but it's something to consider.