I have a table with hundreds of millions of rows that I need to essentially create a "duplicate" of each existing row in, doubling its row count. I'm currently using an insert operation (and unlogging the table prior to inserting) which still takes a long while as one transaction. Looking for guidance on if there may be a more efficient way to execute the query below.
INSERT INTO costs(
parent_record, is_deleted
)
SELECT id, is_deleted
FROM costs;
Related
I have a base_table and a final_table having same columns with plan and date being the primary keys. The data flow happens from base to final table.
Initially final table will look like below:
After that the base table will have
Now the data needs to flow from base table to final table, based on primary keys columns (plan, date) and distinct rows the Final_table should have:
The first two rows gets updated with new values in percentage from base table to final table.
How do we write a SQL query for this?
I am looking to write this query in Redshift SQL.
Pseudo code tried:
insert into final_table
(plan, date, percentage)
select
b.plan, b.date, b. percentage from base_table
inner join final_table f on b.plan=f.plan andb.date=f.date;
First you need to understand that clustered (distributed) columnar databases like Redshift and Snowflake don't enforce uniqueness constraints (would be a performance killer). So your pseudo code is incorrect as this will create duplicate rows in the final_table.
You could use UPDATE to change the values in the rows with matching PKs. However, this won't work in the case where there are new values to be added to final_table. I expect you need a more general solution that works in the case of updated values AND new values.
The general way to address this is to create an "upsert" transaction that deletes the matching rows and then inserts rows into the target table. A transaction is needed so no other session can see the table where the rows are deleted but not yet inserted. It looks like:
begin;
delete from final_table
using base_table
where final_table.plan = base_table.plan
and final_table.date = base_table.date;
insert into final_table
select * from base_table;
commit;
Things to remember - 1) autocommit mode can break the transaction 2) you should vacuum and analyze the table if the number of rows changed is large.
Based on your description it is not clear that I have captured the full intent of you situation ("distinct rows from two tables"). If I have missed the intent please update.
You don't need an INSERT statement but an UPDATE statement -
UPDATE final_table
SET percentage = b.percentage
FROM base_table b
INNER JOIN final_table f ON b.plan = f.plan AND b.date = f.date;
I have a importer system which updates the column of already existing rows in a Table. Since UPDATE was taking time I changed it to DELETE and BULK INSERT.
Here is my database setup snippet
Table: ParameterDefinition
Columns: Id, Name, Other Cols
Table: ParameterValue
Columns: Id, CustId, ParameterDefId, Value
I get the values associated to ParamterDefinition.Name from my XML source, so to import I first delete all the existing ParamterValue with all the ParamterDefinition.Name passed in the XML and finally do bulk insert of all the values from XML. Here is my query
DELETE FROM ParameterValue WHERE CustId = ? AND ParameterDefId IN (?,?...?);
For 1000 Customers the above DELETE statement is called 1000 times which is very time consuming now, approximately 64 seconds.
Is there any better way to handle DELETE of 1000 customers?
Thanks,
Sheeju
Create a temporary table for the bulk-insert (ParameterValue_Import). Do the bulk-inserts to this table, then update/insert/delete based on the imported data.
INSERT INTO .. SELECT .. WHERE NOT EXISTS ( .. ) for the new rows
UPDATE .. FROM for the updates
DELETE FROM WHERE NOT EXISTS ( .. ) for the deletion
Bulk operations have better performance than standalone operations. Most DBMSs are designed to handle set based operations instead of record based ones.
Edit
To delete or update one record based on a WHERE clause which refers to only one record, the DBMS should either do a full table scan (if there is no index for the where condition) or do an index lookup. Only after the record successfully identified, the DBMS proceeds the original request (update or delete). Based on the number of records in the table and/or the size/depth of the index, this could be really expensive. This process are done for each and every command in the batch. Summing up the total cost, it could be more than if you are updating/deleting records based on another table. (Especially if the operations are update/delete nearly all records in the target table.)
When you are trying to delete/update several records at once (e.g. based on another table), the DBMS could do the lookups with only one table scan/index lookup and do a logical join when processing your request.
The cost of purely updating a record is the same in each case, just the total cost of lookup could be significantly different.
Furthermore deleting then inserting a record to update it could require more resources: when you are deleting a record, all related indexes will be updates, and when you insert the new record, the indexes will be updated once more, while with updating the record, only those indexes should be updated, which are related to an updated column (and the index update should be done only once).
I am giving the exact syntax to the above idea given by #Pred
After Bulk Insert lets say you have data in "ParamterValue_Import"
To INSERT The Records in "ParamterValue_Import" which are not in "ParamterValue"
INSERT INTO ParameterValue (
CustId, ParameterDefId, Value
)
SELECT
CustId, ParameterDefId, Value
FROM
ParameterValue_Import
WHERE
NOT EXISTS (
SELECT null
FROM ParameterValue
WHERE ParameterValue.CustId = ParameterValue_Import.CustId
);
To UPDATE The Records in "ParamterValue" which are also in "ParamterValue_Import"
UPDATE
ParameterValue
SET
Value = ParameterValue_Import.Value
FROM
ParameterValue_Import
WHERE
ParameterValue.ParameterDefId = ParameterValue_Import.ParameterDefId
AND ParameterValue.CustId = ParameterValue_Import.CustId;
I am trying to execute a query within a SQL trigger.
I have 4 tables A, B, C, D. Table A is a lookup list and contains roughly 1400 rows of data. Table B are values being input through an HMI with a timestamp. Table C is the table where my values are intended to go. Table D is a list of multipliers to use to multiply values from table A to table B (I am only using one multiplier from table D at the moment).
When a user inputs data into table B, that should trigger the procedure to get the values that were inserted (including the itemnumber) and relate the itemnumber to table A and use table D to multiply a few things together to send values to Table C. If I only input 3 rows of data in table B for example, I should only get three rows of data in table C. I am merely using table A to match the item number and get some data. But for some reason I am inserting way more records than intended, over 1600 rows.
Table D multipliers have a timestamp that does not match or have any correlation with any other table. So I am using a timestamp and selecting the multipliers that are closest to the timestamp from table B (some multipliers will change throughout time and I need a historical multiplier to correctly multiply the right things together)
Your help is most appreciated. Thank you.
Insert into TableC( ItemNumber, Cases, [Description], [Type], Wic, Elc, TotalElc, LbsPerCase, TotalLbs, PeopleRequired, ScheduleHours, Rated, Capacity, [TimeStamp])
Select
b.ItemNumber, b.CaseCount, a.ItemDescription, a.DivisionCode, a.workcenter,
a.LaborPercase as ELC, b.CaseCount * a.LaborPerCase * d.IpCg,
a.LbsPerCase, a.LaborPerCase * b.CaseCount as TotalLbs,
a.PersonReqd, b.Schedulehours, a.PoundRating,
b.ScheduleHours * a.PoundRating as Capactity, b.shift, GETDATE()
from
TableA a, TableB b, TableD
Where
a.itemnumber = b.itemnumber
and d.IpCG < b.TimeStamp
and b.CasesCount > 0
You do not reference the inserted or deleted tables that are available only in the trigger, so of course you are returning more records tha you need in your query.
When first writing a trigger, what I do is create a temp table called #inserted (and/or #deleted) and populate it with several records. It should match the design of the table that the trigger will be on. It is important to make your temp table have several input records that might meet the various criteria that affect your query (so in your caseyou want some where the case count would be 0 and some where it would not for instance) and that would be typical of data inserted into the table or updated init. SQL server triggers operate on sets of data, so this also ensures that your trigger can properly handle multiple record uiinserts or updates. A properly written trigger would have test cases you need to test to make sure everything happens correctly, your #inserted table should include records that meet all those test cases.
Then write the query in a transaction (and roll it back while you are testing) joining to #inserted. If you are doing an insert with a select, only write the select part until you get that right, then add the insert. For testing, write a select from the table you are inserting to in order to see the data you inserted before you rollback.
Once you get everything working, change the #inserted references to inserted, remove any testing code and of course the rollback (possibly the whole transaction depednig on what you are doing.) and add the drop and create trigger part of the code. Now you can test you trigger as a trigger, but you are in good shape becasue you know that it is likely to work from your earlier testing.
I have above 60M rows to delete from 2 separate tables (38M and 19M). I have never deleted this amount of rows before and I'm aware that it'll cause things like rollback errors etc. and probably won't complete.
What's the best way to delete this amount of rows?
You can delete some number of rows at a time and do it repeatedly.
delete from *your_table*
where *conditions*
and rownum <= 1000000
The above sql statement will remove 1M rows at once, and you can execute it 38 times, either by hand or using PL/SQL block.
The other way I can think of is ... If the large portion of data should be removed, you can negate the condition and insert the data (that should be remained) to a new table, and after inserting, drop the original table and rename the new table.
create table *new_table* as
select * from *your_table*
where *conditions_of_remaining_data*
After the above, you can drop the old table, and rename the table.
drop table *your_table*;
alter table *new_table* rename to *your_table*;
We hare having around 20,80,000 records in the table.
We needed to add new column to it and we added that.
Since this new column needs to be primary key and we want to update all rows with Sequence
Here's the query
BEGIN
FOR loop_counter IN 1 .. 211 LOOP
update user_char set id = USER_CHAR__ID_SEQ.nextval where user_char.id is null and rownum<100000;
commit;
END LOOP;
end;
But it'w now almost 1 day completed. still the query is running.
Note: I am not db developer/programmer.
Is there anything wrong with this query or any other query solution (quick) to do the same job?
First, there does not appear to be any reason to use PL/SQL here. It would be more efficient to simply issue a single SQL statement to update every row
UPDATE user_char
SET id = USER_CHAR__ID_SEQ.nextval
WHERE id IS NULL;
Depending on the situation, it may also be more efficient to create a new table and move the data from the old table to the new table in order to avoid row migration, i.e.
ALTER TABLE user_char
RENAME TO user_char_old;
CREATE TABLE user_char
AS
SELECT USER_CHAR__ID_SEQ.nextval, <<list of other columns>>
FROM user_char;
<<Build indexes on user_char>>
<<Drop and recreate any foreign key constraints involving user_char>>
If this was a large table, you could use parallelism in the CREATE TABLE statement. It's not obvious that you'd get a lot of benefit from parallelism with a small 2 million row table but that might shave a few seconds off the operation.
Second, if it is taking a day to update a mere 2 million rows, there must be something else going on. A 2 million row table is pretty small these days-- I can populate and update a 2 million row table on my laptop in somewhere between a few seconds and a few minutes. Are there triggers on this table? Are there foreign keys? Are there other sessions updating the rows? What is the query waiting on?