Selecting distinct rows from two tables and replacing the values

Selecting distinct rows from two tables and replacing the values - sql

I have a base_table and a final_table having same columns with plan and date being the primary keys. The data flow happens from base to final table.
Initially final table will look like below:
After that the base table will have
Now the data needs to flow from base table to final table, based on primary keys columns (plan, date) and distinct rows the Final_table should have:
The first two rows gets updated with new values in percentage from base table to final table.
How do we write a SQL query for this?
I am looking to write this query in Redshift SQL.
Pseudo code tried:
insert into final_table
(plan, date, percentage)
select
b.plan, b.date, b. percentage from base_table
inner join final_table f on b.plan=f.plan andb.date=f.date;

First you need to understand that clustered (distributed) columnar databases like Redshift and Snowflake don't enforce uniqueness constraints (would be a performance killer). So your pseudo code is incorrect as this will create duplicate rows in the final_table.
You could use UPDATE to change the values in the rows with matching PKs. However, this won't work in the case where there are new values to be added to final_table. I expect you need a more general solution that works in the case of updated values AND new values.
The general way to address this is to create an "upsert" transaction that deletes the matching rows and then inserts rows into the target table. A transaction is needed so no other session can see the table where the rows are deleted but not yet inserted. It looks like:
begin;
delete from final_table
using base_table
where final_table.plan = base_table.plan
and final_table.date = base_table.date;
insert into final_table
select * from base_table;
commit;
Things to remember - 1) autocommit mode can break the transaction 2) you should vacuum and analyze the table if the number of rows changed is large.
Based on your description it is not clear that I have captured the full intent of you situation ("distinct rows from two tables"). If I have missed the intent please update.

You don't need an INSERT statement but an UPDATE statement -
UPDATE final_table
SET percentage = b.percentage
FROM base_table b
INNER JOIN final_table f ON b.plan = f.plan AND b.date = f.date;

Related

How do you merge two tables with multiple unique indentifiers?

Hey I have two tables with the same rows the first table is the main table and I want to upsert the data with new unique entries from the _tmp_ table.
for example;
id, text_id, last_sent, recent_sent, updated_at, date_created
I want to merge a communicated _tmp_ table that is created from another table into the communicated table. Only if the communicated table doesn't have an identical row id, text_id, last_sent and recent_sent
The query I'm using now is posted below but doesn't work. This query inserts all the data from the _tmp_ table.
I have checked and both the types of the tables are the same. And I just don't know what I'm doing wrong.
Help much appreciated
MERGE
`project.map.communicated` CURRENT_TABLE
USING
`project.map.communicated_tmp_` NEW_OR_UPDATED
ON
(CURRENT_TABLE.id = NEW_OR_UPDATED.id
AND CURRENT_TABLE.text_id = NEW_OR_UPDATED.text_id
AND CURRENT_TABLE.last_sent = NEW_OR_UPDATED.last_sent
AND CURRENT_TABLE.recent_sent = NEW_OR_UPDATED.recent_sent)
WHEN NOT MATCHED
THEN
INSERT
(`id`,
`text_id`,
`last_sent`,
`recent_sent`,
`updated_at`,
`date_created`)
VALUES
(`id`,`text_id`,`last_sent`,`recent_sent`,`updated_at`,`date_created`)

The Merge statement uses JOIN logic to see matches. The only reason this should not work if there are rows that have NULLS in either of the fields you use for the join. Make sure to exclude the NULLS or make a composite key which works around the NULL values.

Remove duplicate SQL rows by looking at all columns

I have this table, where every column is a VARCHAR (or equivalent):
field001 field002 field003 field004 field005 .... field500
500 VARCHAR columns. No primary keys. And no column is guaranteed to be unique. So the only way to know for sure if two rows are the same is to compare the values of all columns.
(Yes, this should be in TheDailyWTF. No, it's not my fault. Bear with me here).
I inserted a duplicate set of rows by mistake, and I need to find them and remove them.
There's 12 million rows on this table, so I'd rather not recreate it.
However, I do know what rows were mistakenly inserted (I have the .sql file).
So I figured I'd create another table and load it with those. And then I'd do some sort of join that would compare all columns on both tables and then delete the rows that are equal from the first table. I tried a NATURAL JOIN as that looked promising, but nothing was returned.
What are my options?
I'm using Amazon Redshift (so PostgreSQL 8.4 if I recall), but I think this is a general SQL question.

You can treat the whole row as a single record in Postgres (and thus I think in Redshift).
The following works in Postgres, and will keep one of the duplicates
delete from the_table
where ctid not in (select min(ctid)
from the_table
group by the_table); --<< Yes, the group by is correct!
This is going to be slow!
Grouping over so many columns and then deleting with a NOT IN will take quite some time. Especially if a lot of rows are going to be deleted.
If you want to delete all duplicate rows (not keeping any of them), you can use the following:
delete from the_table
where the_table in (select the_table
from the_table
group by the_table
having count(*) > 1);

You should be able to identify all the mistakenly inserted rows using CREATEXID.If you group by CREATEXID on your table as below and get the count you should be able to understand how many rows were inserted in your transaction and remove them using DELETE command.
SELECT CREATEXID,COUNT(1)
FROM yourtable
GROUP BY 1;

One simplistic solution is to recreate the table, e.g.
CREATE TABLE my_temp_table (
-- add column definitions here, just like the original table
);
INSERT INTO my_temp_table SELECT DISTINCT * FROM original_table;
DROP TABLE original_table;
ALTER TABLE my_temp_table RENAME TO original_table;
or even
CREATE TABLE my_temp_table AS SELECT DISTINCT * FROM original_table;
DROP TABLE original_table;
ALTER TABLE my_temp_table RENAME TO original_table;

It is a trick but probably it helps.
Each row in the table containing the transaction ID in which it row was inserted/updated: System Columns. It is xmin column. So using it you can to find the transaction ID in which you inserted the wrong data. Then just delete the rows using
delete from my_table where xmin = <the_wrong_transaction_id>;
PS: Be careful and try it on the some test table first.

Update a column value for 500 million rows in Interval Partitioned table

we've a table with 10 Billion rows. This table is Interval Partitioned on date. In a subpartition we need to update the date for 500 million rows that matches the criteria to a new value. This will definetly affect creation of new partition or something because the table is partitioned on the same date. Could anyone give me pointers to a best approach to follow?
Thanks in advance!

If you are going to update partitioning key and the source rows are in a single (sub)partition, then the reasonable approach would be to:
Create a temporary table for the updated rows. If possible, perform the update on the fly
CREATE TABLE updated_rows
AS
SELECT add_months(partition_key, 1), other_columns...
FROM original_table PARITION (xxx)
WHERE ...;
Drop original (sub)partition
ALTER TABLE original_table DROP PARTITION xxx;
Reinsert the updated rows back
INSERT /*+append*/ INTO original_table
SELECT * FROM updated_rows;
In case you have issues with CTAS or INSERT INTO SELECT for 500M rows, consider partitioning the temporary table and moving the data in batches.

hmmm... If you have enough space i would create a "copy" of the source table with the good updated rows, then check the results and drop the source table after it, in the end rename the "copy" to the source. Yes this have a long executing time, but this could be a painless way, of course parallel hint is needed.

You may consider to add a new column (Flag) 'updated' bit that have by fedault the values NULL (Or 0, i preffer NULL) to your table, and using the criticias of dates that you need to update you can update data group by group in the same way described by Kombajn, once the group of data is updated you can affect the value 1 to the flag 'updated' to your group of data.
For exemple lets start by making groups of datas, let consider that the critecia of groups is the year. so lets start to treate data year by year.
Create a temporary table of year 1 :
CREATE TABLE updated_rows
AS
SELECT columns...
FROM original_table PARITION (2001)
WHERE YEAR = 2001
...;
2.Drop original (sub)partition
ALTER TABLE original_table DROP PARTITION 2001;
3.Reinsert the updated rows back
INSERT /*+append*/ INTO original_table(columns....,updated)
SELECT columns...,1 FROM updated_rows;
Hope this will helps you to treat data step by step to prevent waiting all data of the table to be updated in once. You may consider a cursor that loop over years.

SQL Trigger Inserting from Multiple tables

I am trying to execute a query within a SQL trigger.
I have 4 tables A, B, C, D. Table A is a lookup list and contains roughly 1400 rows of data. Table B are values being input through an HMI with a timestamp. Table C is the table where my values are intended to go. Table D is a list of multipliers to use to multiply values from table A to table B (I am only using one multiplier from table D at the moment).
When a user inputs data into table B, that should trigger the procedure to get the values that were inserted (including the itemnumber) and relate the itemnumber to table A and use table D to multiply a few things together to send values to Table C. If I only input 3 rows of data in table B for example, I should only get three rows of data in table C. I am merely using table A to match the item number and get some data. But for some reason I am inserting way more records than intended, over 1600 rows.
Table D multipliers have a timestamp that does not match or have any correlation with any other table. So I am using a timestamp and selecting the multipliers that are closest to the timestamp from table B (some multipliers will change throughout time and I need a historical multiplier to correctly multiply the right things together)
Your help is most appreciated. Thank you.
Insert into TableC( ItemNumber, Cases, [Description], [Type], Wic, Elc, TotalElc, LbsPerCase, TotalLbs, PeopleRequired, ScheduleHours, Rated, Capacity, [TimeStamp])
Select
b.ItemNumber, b.CaseCount, a.ItemDescription, a.DivisionCode, a.workcenter,
a.LaborPercase as ELC, b.CaseCount * a.LaborPerCase * d.IpCg,
a.LbsPerCase, a.LaborPerCase * b.CaseCount as TotalLbs,
a.PersonReqd, b.Schedulehours, a.PoundRating,
b.ScheduleHours * a.PoundRating as Capactity, b.shift, GETDATE()
from
TableA a, TableB b, TableD
Where
a.itemnumber = b.itemnumber
and d.IpCG < b.TimeStamp
and b.CasesCount > 0

You do not reference the inserted or deleted tables that are available only in the trigger, so of course you are returning more records tha you need in your query.
When first writing a trigger, what I do is create a temp table called #inserted (and/or #deleted) and populate it with several records. It should match the design of the table that the trigger will be on. It is important to make your temp table have several input records that might meet the various criteria that affect your query (so in your caseyou want some where the case count would be 0 and some where it would not for instance) and that would be typical of data inserted into the table or updated init. SQL server triggers operate on sets of data, so this also ensures that your trigger can properly handle multiple record uiinserts or updates. A properly written trigger would have test cases you need to test to make sure everything happens correctly, your #inserted table should include records that meet all those test cases.
Then write the query in a transaction (and roll it back while you are testing) joining to #inserted. If you are doing an insert with a select, only write the select part until you get that right, then add the insert. For testing, write a select from the table you are inserting to in order to see the data you inserted before you rollback.
Once you get everything working, change the #inserted references to inserted, remove any testing code and of course the rollback (possibly the whole transaction depednig on what you are doing.) and add the drop and create trigger part of the code. Now you can test you trigger as a trigger, but you are in good shape becasue you know that it is likely to work from your earlier testing.

delete old values of a table and update the table with results of same query

My question is to simple, but I can't find out a way to delete old values of a table and update same table with results of same query.
UPDATE
The query is an SELECT on Table A, and the results be Table B. And nothing on Table B different of the result of last query on Table A.
I have a very big table, and I need to process the records and create a new table regularly. The old values of this table are not important, only the new ones.
I will appreciate any help.

What about a view? If you only need table B to query on. You said you have a select on table A. Lets say your select is SELECT * FROM TableA WHERE X = Y. Then your statement would be
CREATE VIEW vwTableB AS
SELECT * FROM TableA WHERE X = Y
And then instead of querying tableB you would query vwTableB. Any changes to the data in table A would be reflected in the view so you don't have to keep running a script yourself.
This was the data in vwTableB would be kept updated and you wouldn't have to keep deleting and inserting into the second table.

you can use a temporary table to store results you are working with, if you only need it for one session. it will automatically be dropped when you sign out.
you didn't say what db you are using, but try this
create temp tableB AS select * from tableA

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas