Oracle SQL Merge Statement with Conditions - sql

I"m relatively new to SQL, and I'm having an issue where the target table is not being updated.
I have duplicate account # (key) with different contact information in the associated columns. I’m attempting to consolidate the contact information (source) into a single row / account number with the non duplicate contact information going into (target) extended columns.
I constructed a Merge statement with a case condition to check if the data exists in the target table. If the data is not in the target table then add the information in the extended columns. The issue is that the target table doesn’t get updated. Both Source and Target tables are similarity defined.
**Merge SQL- reduced query**
MERGE INTO target tgt
USING (select accountno, cell, site, contact, email1 from (select w.accountno, w.cell, w.site, w.contact, email1, row_number() over (PARTITION BY w.accountno order by accountno desc) acct
from source w) inn where inn.acct =1) src
ON (tgt.accountno = src.accountno)
WHEN MATCHED
THEN
UPDATE SET
tgt.phone4 =
CASE WHEN src.cell <> tgt.cell
THEN src.cell
END,
tgt.phone5 =
CASE WHEN src.site <> tgt.site
THEN src.site
END
I have validated that there is contact information in the source table for an accountno that should be added to the target table. I greatly appreciate any insight as to why the target table is not being updated.
I saw a similar question on Stack, but it didn't have a response.

Your SRC subquery in using clause, returns just 1 random row for each accountno.
You need to aggregate them, for example using PIVOT:
with source(accountno, cell, site, contact) as ( --test data:
select 1,8881234567,8881235678,8881236789 from dual union all
select 1,8881234567,8881235678,8881236789 from dual
)
select accountno, contact,
r1_cell, r1_site,
r2_cell, r2_site
from (select s.*,row_number()over(partition by accountno order by cell) rn
from source s
)
pivot (
max(cell) cell,max(site) site
FOR rn
IN (1 R1,2 R2)
)
So finally you can compare r1_cell, r1_site, r2_cell, r2_site with destination values and use required ones:
MERGE INTO target tgt
USING (
select accountno, contact,
r1_cell, r1_site,
r2_cell, r2_site
from (select s.*,row_number()over(partition by accountno order by cell) rn
from source s
)
pivot (
max(cell) cell,max(site) site
FOR rn
IN (1 R1,2 R2)
)
) src
ON (tgt.accountno = src.accountno)
WHEN MATCHED
THEN
UPDATE SET
tgt.phone4 =
CASE
WHEN src.r1_cell <> tgt.cell
THEN src.r1_cell
ELSE src.r2_cell
END,
tgt.phone5 =
CASE WHEN src.r1_site <> tgt.site
THEN src.r1_site
ELSE src.r2_site
END
/

the issue is with regards to the logic you have used in row_numbering the rows with identical account_number.
MERGE
INTO target tgt
USING (select accountno, cell, site, contact, email1
from (select w.accountno, w.cell, w.site, w.contact, email1
, row_number() over (PARTITION BY w.accountno order by w.accountno desc) acct
from source w
left join target w2
on w.accountno=w2.accountno
where w2.cell is null /* get records which are not in target*/
) inn
where inn.acct =1
) src
ON (tgt.accountno = src.accountno)
WHEN MATCHED THEN
UPDATE
SET tgt.phone4 = src.cell,
tgt.phone5 = src.site

Related

INSERT or UPDATE the table from SELECT in sql server

I have a requirement where I have to check if the record for the business date already exists in the table then I need to update the values for that business date from the select statement otherwise I have to insert for that business date from the select statement. Below is my full query where I am only inserting at the moment:
INSERT INTO
gstl_calculated_daily_fee(business_date,fee_type,fee_total,range_id,total_band_count)
select
#tlf_business_date,
'FEE_LOCAL_CARD',
SUM(C.settlement_fees),
C.range_id,
Count(1)
From
(
select
*
from
(
select
rowNumber = #previous_mada_switch_fee_volume_based_count + (ROW_NUMBER() OVER(PARTITION BY DATEPART(MONTH, x_datetime) ORDER BY x_datetime)),
tt.x_datetime
from gstl_trans_temp tt where (message_type_mapping = '0220') and card_type ='GEIDP1' and response_code IN('00','10','11') and tran_amount_req >= 5000 AND merchant_type NOT IN(5542,5541,4829)
) A
CROSS APPLY
(
select
rtt.settlement_fees,
rtt.range_id
From gstl_mada_local_switch_fee_volume_based rtt
where A.rowNumber >= rtt.range_start
AND (A.rowNumber <= rtt.range_end OR rtt.range_end IS NULL)
) B
) C
group by CAST(C.x_datetime AS DATE),C.range_id
I have tried to use the if exists but could not fit in the above full query.
if exists (select
business_date
from gstl_calculated_daily_fee
where
business_date = #tlf_business_date)
UPDATE gstl_calculated_daily_fee
SET fee_total = #total_mada_local_switch_fee_low
WHERE fee_type = 'FEE_LOCAL_CARD'
AND business_date = #tlf_business_date
else
INSERT INTO
Please help.
You need a MERGE statement with a join.
Basically, our issue with MERGE is going to be that we only want to merge against a subset of the target table. To do this, we pre-filter the table as a CTE. We can also put the source table as a CTE.
Be very careful when you write MERGE when using a CTE. You must make sure you fully filter the target within the CTE to what rows you want to merge against, and then match the rows using ON
;with source as (
select
business_date = #tlf_business_date,
fee_total = SUM(C.settlement_fees),
C.range_id,
total_band_count = Count(1)
From
(
select
rowNumber = #previous_mada_switch_fee_volume_based_count + (ROW_NUMBER() OVER(PARTITION BY DATEPART(MONTH, x_datetime) ORDER BY x_datetime)),
tt.x_datetime
from gstl_trans_temp tt where (message_type_mapping = '0220') and card_type ='GEIDP1' and response_code IN('00','10','11') and tran_amount_req >= 5000 AND merchant_type NOT IN(5542,5541,4829)
) A
CROSS APPLY
(
select
rtt.settlement_fees,
rtt.range_id
From gstl_mada_local_switch_fee_volume_based rtt
where A.rowNumber >= rtt.range_start
AND (A.rowNumber <= rtt.range_end OR rtt.range_end IS NULL)
) B
group by CAST(A.x_datetime AS DATE), B.range_id
),
target as (
select
business_date,fee_type,fee_total,range_id,total_band_count
from gstl_calculated_daily_fee
where business_date = #tlf_business_date AND fee_type = 'FEE_LOCAL_CARD'
)
MERGE INTO target t
USING source s
ON t.business_date = s.business_date AND t.range_id = s.range_id
WHEN NOT MATCHED BY TARGET THEN INSERT
(business_date,fee_type,fee_total,range_id,total_band_count)
VALUES
(s.business_date,'FEE_LOCAL_CARD', s.fee_total, s.range_id, s.total_band_count)
WHEN MATCHED THEN UPDATE SET
fee_total = #total_mada_local_switch_fee_low
;
The way a MERGE statement works, is that it basically does a FULL JOIN between the source and target tables, using the ON clause to match. It then applies various conditions to the resulting join and executes statements based on them.
There are three possible conditions you can do:
WHEN MATCHED THEN
WHEN NOT MATCHED [BY TARGET] THEN
WHEN NOT MATCHED BY SOURCE THEN
And three possible statements, all of which refer to the target table: UPDATE, INSERT, DELETE (not all are applicable in all cases obviously).
A common problem is that we would only want to consider a subset of a target table. There a number of possible solutions to this:
We could filter the matching inside the WHEN MATCHED clause e.g. WHEN MATCHED AND target.somefilter = #somefilter. This can often cause a full table scan though.
Instead, we put the filtered target table inside a CTE, and then MERGE into that. The CTE must follow Updatable View rules. We must also select all columns we wish to insert or update to. But we must make sure we are fully filtering the target, otherwise if we issue a DELETE then all rows in the target table will get deleted.

My SQL MERGE statement runs for too long

I have this Hive MERGE statement:
MERGE INTO destination dst
USING (
SELECT
-- DISTINCT fields
company
, contact_id as id
, ct.cid as cid
-- other fields
, email
, timestamp_utc
-- there are actually about 6 more
-- deduplication
, ROW_NUMBER() OVER (
PARTITION BY company
, ct.id
, contact_id
ORDER BY timestamp_utc DESC
) as r
FROM
source
LATERAL VIEW explode(campaign_id) ct AS cid
) src
ON
dst.company = src.company
AND dst.campaign_id = src.cid
AND dst.id = src.id
-- On match: keep latest loaded
WHEN MATCHED
AND dst.updated_on_utc < src.timestamp_utc
AND src.r = 1
THEN UPDATE SET
email = src.email
, updated_on_utc = src.timestamp_utc
WHEN NOT MATCHED AND src.r = 1 THEN INSERT VALUES (
src.id
, src.email
, src.timestamp_utc
, src.license_name
, src.cid
)
;
Which runs for a very long time (30 minutes for 7GB of avro compressed data on disk).
I wonder if there are any SQL ways to improve it.
ROW_NUMBER() is here to deduplicate the source table, so that in the MATCH clause we only select the earliest row.
One thing I am not sure of, is that hive says:
SQL Standard requires that an error is raised if the ON clause is such
that more than 1 row in source matches a row in target. This check is
computationally expensive and may affect the overall runtime of a
MERGE statement significantly. hive.merge.cardinality.check=false may
be used to disable the check at your own risk. If the check is
disabled, but the statement has such a cross join effect, it may lead
to data corruption.
I do indeed disable the cardinality check, as although the ON statement might give 2 rows in source, those rows are limited to 1 only thanks to the r=1 later in the MATCH clause.
Overall I like this MERGE statement but it is just too slow and any help would be appreciated.
Note that the destination table is partitioned. The source table is not as it is an external table which for every run must be fully merged, so fully scanned (in the background already merged data files are removed and new files are added before next run). Not sure that partitioning would help in that case
What I have done:
play with hdfs/hive/yarn configuration
try with a temporary table (2 steps) instead of a single MERGE, the run time jumped to more than 2 hours.
Option 1: Move where filter where src.r = 1 inside the src subquery and check the merge performance. This will reduce the number of source rows before merge.
Other two options do not require ACID mode. Do full target rewrite.
Option 2: Rewrite using UNION ALL + row_number (this should be the fastest one):
insert overwrite table destination
select
company
, contact_id as id
, ct.cid as cid
, email
, timestamp_utc
, -- add more fields
from
(
select --dedupe, select last updated rows using row_number
s.*
, ROW_NUMBER() OVER (PARTITION BY company, ct.id , contact_id ORDER BY timestamp_utc DESC) as rn
from
(
select --union all source and target
company
, contact_id as id
, ct.cid as cid
, email
, timestamp_utc
, -- add more fields
from source LATERAL VIEW explode(campaign_id) ct AS cid
UNION ALL
select
company
, contact_id as id
, ct.cid as cid
, email
, timestamp_utc
,-- add more fields
from destination
)s --union all
where rn=1 --filter duplicates
)s-- filtered dups
If source contains a lot of duplicates, you can apply additional row_number filtering to the src subquery as well before union.
One more approach using full join: https://stackoverflow.com/a/37744071/2700344

SQL - Returning CTE with Top 1

I am trying to return a set of results and decided to try my luck with CTE, the first table "Vendor", has a list of references, the second table "TVView", has ticket numbers that were created using a reference from the "Vendor" table. There may be one or more tickets using the same ticket number depending on the state of that ticket and I am wanting to return the last entry for each ticket found in "TVView" that matches a selected reference from "Vendor". Also, the "TVView" table has a seed field that is incremented.
I got this to return the right amount of entries (meaning not showing the duplicate tickets but only once) but I cannot figure out how to add an additional layer to go back through and select the last entry for that ticket and return some other fields. I can figure out how to sum which is actually easy, but I really need the Top 1 of each ticket entry in "TVView" regardless if its a duplicate or not while returning all references from "Vendor". Would be nice if SQL supported "Last"
How do you do that?
Here is what I have done so far:
with cteTickets as (
Select s.Mth2, c.Ticket, c.PyRt from Vendor s
Inner join
TVView c on c.Mth1 = s.Mth1 and c.Vendor = s.Vendor
)
Select Mth2, Ticket, PayRt from cteTickets
Where cteTickets.Vendor >='20'
and cteTickets.Vendor <='40'
and cteTickets.Mth2 ='8/15/2014'
Group by cteTickets.Ticket
order by cteTickets.Ticket
Several rdbms's that support Common Table Expressions (CTE) that I am aware of also support analytic functions, including the very useful ROW_NUMBER(), so the following should work in Oracle, TSQL (MSSQL/Sybase), DB2, PostgreSQL.
In the suggestions the intention is to return just the most recent entry for each ticket found in TVView. This is done by using ROW_NUMBER() which is PARTITIONED BY Ticket that instructs row_number to recommence numbering for each change of the Ticket value. The subsequent ORDER BY Mth1 DESC is used to determine which record within each partition is assigned 1, here it will be the most recent date.
The output of row_number() needs to be referenced by a column alias, so using it in a CTE or derived table permits selection of just the most recent records by RN = 1 which you will see used in both options below:
-- using a CTE
WITH
TVLatest
AS (
SELECT
* -- specify the fields
, ROW_NUMBER() OVER (PARTITION BY Ticket
ORDER BY Mth1 DESC) AS RN
FROM TVView
)
SELECT
Mth2
, Ticket
, PayRt
FROM Vendor v
INNER JOIN TVLatest l ON v.Mth1 = l.Mth1
AND v.Vendor = l.Vendor
AND l.RN = 1
WHERE v.Vendor >= '20'
AND v <= '40'
AND v.Mth2 = '2014-08-15'
ORDER BY
v.Ticket
;
-- using a derived table instead
SELECT
Mth2
, Ticket
, PayRt
FROM Vendor v
INNER JOIN (
SELECT
* -- specify the fields
, ROW_NUMBER() OVER (PARTITION BY Ticket
ORDER BY Mth1 DESC) AS RN
FROM TVView
) TVLatest l ON v.Mth1 = l.Mth1
AND v.Vendor = l.Vendor
AND l.RN = 1
WHERE v.Vendor >= '20'
AND v <= '40'
AND v.Mth2 = '2014-08-15'
ORDER BY
v.Ticket
;
please note: "SELECT *" is a convenience or used as an abbreviation if full details are unknown. The queries above may not operate without correctly specifying the field list (eg. 'as is' they would fail in Oracle).

SQL UPDATE row Number

I have a table serviceClusters with a column identity(1590 values). Then I have another table serviceClustersNew with the columns ID, text and comment. In this table, I have some values for text and comment, the ID is always 1. Here an example for the table:
[1, dummy1, hello1;
1, dummy2, hello2;
1, dummy3, hello3;
etc.]
WhaI want now for the values in the column ID is the continuing index of the table serviceClusters plus the current Row number: In our case, this would be 1591, 1592 and 1593.
I tried to solve the problem like this: First I updated the column ID with the maximum value, then I tryed to add the row number, but this doesnt work:
-- Update ID to the maximum value 1590
UPDATE serviceClustersNew
SET ID = (SELECT MAX(ID) FROM serviceClusters);
-- This command returns the correct values 1591, 1592 and 1593
SELECT ID+ROW_NUMBER() OVER (ORDER BY Text_ID) AS RowNumber
FROM serviceClustersNew
-- But I'm not able to update the table with this command
UPDATE serviceClustersNew
SET ID = (SELECT ID+ROW_NUMBER() OVER (ORDER BY Text_ID) AS RowNumber FROM
serviceClustersNew)
By sending the last command, I get the error "Syntax error: Ordered Analytical Functions are not allowed in subqueries.". Do you have any suggestions, how I could solve the problem? I know I could do it with a volatile table or by adding a column, but is there a way without creating a new table / altering the current table?
You have to rewrite it using UPDATE FROM, the syntax is just a bit bulky:
UPDATE serviceClustersNew
FROM
(
SELECT text_id,
(SELECT MAX(ID) FROM serviceClusters) +
ROW_NUMBER() OVER (ORDER BY Text_ID) AS newID
FROM serviceClustersNew
) AS src
SET ID = newID
WHERE serviceClustersNew.Text_ID = src.Text_ID
You are not dealing with a lot of data, so a correlated subquery can serve the same purpose:
UPDATE serviceClustersNew
SET ID = (select max(ID) from serviceClustersNew) +
(select count(*)
from serviceClustersNew scn2
where scn2.Text_Id <= serviceClustersNew.TextId
)
This assumes that the text_id is unique along the rows.
Apparently you can update a base table through a CTE... had no idea. So, just change your last UPDATE statement to this, and you should be good. Just be sure to include any fields in the CTE that you desire to update.
;WITH cte_TEST AS
( SELECT
ID,
ID+ROW_NUMBER() OVER (ORDER BY TEXT_ID) AS RowNumber FROM serviceClustersNew)
UPDATE cte_TEST
SET cte_TEST.ID = cte_TEST.RowNumber
Source:
http://social.msdn.microsoft.com/Forums/sqlserver/en-US/ee06f451-c418-4bca-8288-010410e8cf14/update-table-using-rownumber-over

Update based on subquery fails

I am trying to do the following update in Oracle 10gR2:
update
(select voyage_port_id, voyage_id, arrival_date, port_seq,
row_number() over (partition by voyage_id order by arrival_date) as new_seq
from voyage_port) t
set t.port_seq = t.new_seq
Voyage_port_id is the primary key, voyage_id is a foreign key. I'm trying to assign a sequence number based on the dates within each voyage.
However, the above fails with ORA-01732: data manipulation operation not legal on this view
What is the problem and how can I avoid it ?
Since you can't update subqueries with row_number, you'll have to calculate the row number in the set part of the update. At first I tried this:
update voyage_port a
set a.port_seq = (
select
row_number() over (partition by voyage_id order by arrival_date)
from voyage_port b
where b.voyage_port_id = a.voyage_port_id
)
But that doesn't work, because the subquery only selects one row, and then the row_number() is always 1. Using another subquery allows a meaningful result:
update voyage_port a
set a.port_seq = (
select c.rn
from (
select
voyage_port_id
, row_number() over (partition by voyage_id
order by arrival_date) as rn
from voyage_port b
) c
where c.voyage_port_id = a.voyage_port_id
)
It works, but more complex than I'd expect for this task.
You can update some views, but there are restrictions and one is that the view must not contain analytic functions. See SQL Language Reference on UPDATE and search for first occurence of "analytic".
This will work, provided no voyage visits more than one port on the same day (or the dates include a time component that makes them unique):
update voyage_port vp
set vp.port_seq =
( select count(*)
from voyage_port vp2
where vp2.voyage_id = vp.voyage_id
and vp2.arrival_date <= vp.arrival_date
)
I think this handles the case where a voyage visits more than 1 port per day and there is no time component (though the sequence of ports visited on the same day is then arbitrary):
update voyage_port vp
set vp.port_seq =
( select count(*)
from voyage_port vp2
where vp2.voyage_id = vp.voyage_id
and (vp2.arrival_date <= vp.arrival_date)
or ( vp2.arrival_date = vp.arrival_date
and vp2.voyage_port_id <= vp.voyage_port_id
)
)
Don't think you can update a derived table, I'd rewrite as:
update voyage_port
set port_seq = t.new_seq
from
voyage_port p
inner join
(select voyage_port_id, voyage_id, arrival_date, port_seq,
row_number() over (partition by voyage_id order by arrival_date) as new_seq
from voyage_port) t
on p.voyage_port_id = t.voyage_port_id
The first token after the UPDATE should be the name of the table to update, then your columns-to-update. I'm not sure what you are trying to achieve with the select statement where it is, but you can' update the result set from the select legally.
A version of the sql, guessing what you have in mind, might look like...
update voyage_port t
set t.port_seq = (<select statement that generates new value of port_seq>)
NOTE: to use a select statement to set a value like this you must make sure only 1 row will be returned from the select !
EDIT : modified statement above to reflect what I was trying to explain. The question has been answered very nicely by Andomar above