Rewrite this loop as a recursive CTE? - sql

Link to demo code and data: http://sqlfiddle.com/#!18/5811b/9
This while loop iterates over data and inserts data based on a merge condition, which is (roughly) "find the MergeRecordId in the earliest row that matches the ActionId, ParentActionId and Category, then insert relevant rows that have this Merge Id, using the current row's Merge Id". It ignores rows that have no "predecessors". Idx and OriginalIdx are helper columns.
Is it possible to rewrite this using a recursive CTE? Or should I be using a different technique?
This is currently what I have, but it obviously doesn't work because it doesn't match the earliest item (iRank = 1 over the Date):
; WITH cte AS
(
SELECT Idx
, ActionId
, ParentActionId
, Category
, ActionDateUpdated
, MergeRecordId
, OriginalIdx
FROM #Example
UNION ALL
SELECT -ex1.Idx
, ex2.ActionId
, ex2.ParentActionId
, ex2.Category
, ex1.ActionDateUpdated
, ex1.MergeRecordId
, ex2.Idx
FROM #Example ex1
JOIN cte ex2
ON ex1.Category = ex2.Category
AND ex1.ActionDateUpdated > ex2.ActionDateUpdated
WHERE EXISTS
(
SELECT 1
FROM #Example ex3
WHERE ex1.ActionId = ex3.ActionId
AND ex1.ParentActionId = ex3.ParentActionId
AND ex1.Category = ex3.Category
AND ex1.ActionDateUpdated > ex3.ActionDateUpdated
)
)
SELECT *
FROM cte ORDER BY ABS(idx), ABS(OriginalIdx);
Under usual circumstances it should be trivial to get the required MergeRecordId using a sub-query, but you can't use sub-queries containing CTEs in a recursive CTE. Without this filter it makes the query useless with large datasets.
(Another quirk is the deletion of the current row in the loop, but I'm not too concerned about that.)

Related

INSERT or UPDATE the table from SELECT in sql server

I have a requirement where I have to check if the record for the business date already exists in the table then I need to update the values for that business date from the select statement otherwise I have to insert for that business date from the select statement. Below is my full query where I am only inserting at the moment:
INSERT INTO
gstl_calculated_daily_fee(business_date,fee_type,fee_total,range_id,total_band_count)
select
#tlf_business_date,
'FEE_LOCAL_CARD',
SUM(C.settlement_fees),
C.range_id,
Count(1)
From
(
select
*
from
(
select
rowNumber = #previous_mada_switch_fee_volume_based_count + (ROW_NUMBER() OVER(PARTITION BY DATEPART(MONTH, x_datetime) ORDER BY x_datetime)),
tt.x_datetime
from gstl_trans_temp tt where (message_type_mapping = '0220') and card_type ='GEIDP1' and response_code IN('00','10','11') and tran_amount_req >= 5000 AND merchant_type NOT IN(5542,5541,4829)
) A
CROSS APPLY
(
select
rtt.settlement_fees,
rtt.range_id
From gstl_mada_local_switch_fee_volume_based rtt
where A.rowNumber >= rtt.range_start
AND (A.rowNumber <= rtt.range_end OR rtt.range_end IS NULL)
) B
) C
group by CAST(C.x_datetime AS DATE),C.range_id
I have tried to use the if exists but could not fit in the above full query.
if exists (select
business_date
from gstl_calculated_daily_fee
where
business_date = #tlf_business_date)
UPDATE gstl_calculated_daily_fee
SET fee_total = #total_mada_local_switch_fee_low
WHERE fee_type = 'FEE_LOCAL_CARD'
AND business_date = #tlf_business_date
else
INSERT INTO
Please help.
You need a MERGE statement with a join.
Basically, our issue with MERGE is going to be that we only want to merge against a subset of the target table. To do this, we pre-filter the table as a CTE. We can also put the source table as a CTE.
Be very careful when you write MERGE when using a CTE. You must make sure you fully filter the target within the CTE to what rows you want to merge against, and then match the rows using ON
;with source as (
select
business_date = #tlf_business_date,
fee_total = SUM(C.settlement_fees),
C.range_id,
total_band_count = Count(1)
From
(
select
rowNumber = #previous_mada_switch_fee_volume_based_count + (ROW_NUMBER() OVER(PARTITION BY DATEPART(MONTH, x_datetime) ORDER BY x_datetime)),
tt.x_datetime
from gstl_trans_temp tt where (message_type_mapping = '0220') and card_type ='GEIDP1' and response_code IN('00','10','11') and tran_amount_req >= 5000 AND merchant_type NOT IN(5542,5541,4829)
) A
CROSS APPLY
(
select
rtt.settlement_fees,
rtt.range_id
From gstl_mada_local_switch_fee_volume_based rtt
where A.rowNumber >= rtt.range_start
AND (A.rowNumber <= rtt.range_end OR rtt.range_end IS NULL)
) B
group by CAST(A.x_datetime AS DATE), B.range_id
),
target as (
select
business_date,fee_type,fee_total,range_id,total_band_count
from gstl_calculated_daily_fee
where business_date = #tlf_business_date AND fee_type = 'FEE_LOCAL_CARD'
)
MERGE INTO target t
USING source s
ON t.business_date = s.business_date AND t.range_id = s.range_id
WHEN NOT MATCHED BY TARGET THEN INSERT
(business_date,fee_type,fee_total,range_id,total_band_count)
VALUES
(s.business_date,'FEE_LOCAL_CARD', s.fee_total, s.range_id, s.total_band_count)
WHEN MATCHED THEN UPDATE SET
fee_total = #total_mada_local_switch_fee_low
;
The way a MERGE statement works, is that it basically does a FULL JOIN between the source and target tables, using the ON clause to match. It then applies various conditions to the resulting join and executes statements based on them.
There are three possible conditions you can do:
WHEN MATCHED THEN
WHEN NOT MATCHED [BY TARGET] THEN
WHEN NOT MATCHED BY SOURCE THEN
And three possible statements, all of which refer to the target table: UPDATE, INSERT, DELETE (not all are applicable in all cases obviously).
A common problem is that we would only want to consider a subset of a target table. There a number of possible solutions to this:
We could filter the matching inside the WHEN MATCHED clause e.g. WHEN MATCHED AND target.somefilter = #somefilter. This can often cause a full table scan though.
Instead, we put the filtered target table inside a CTE, and then MERGE into that. The CTE must follow Updatable View rules. We must also select all columns we wish to insert or update to. But we must make sure we are fully filtering the target, otherwise if we issue a DELETE then all rows in the target table will get deleted.

Convert subselect to a join

I seem to understand that Join is preferred to sub-select.
I'm unable to see how to turn the 3 sub-selects to joins.
My sub-selects fetch the first row only
I'm perfectly willing to leave this alone if it is not offensive SQL.
This is my query, and yes, those really are the table and column names
select x1.*, x2.KTNR, x3.J6NQ
from
(select D0HONB as HONB, D0HHNB as HHNB,
(
select DHHHNB
from ECDHREP
where DHAOEQ = D0ATEQ and DHJRCD = D0KNCD
order by DHEJDT desc
FETCH FIRST 1 ROW ONLY
) as STC_HHNB,
(
select FIQ9NB
from DCFIREP
where FIQ7NB = D0Q7NB
AND FIBAEQ = D0ATEQ
and FISQCD = D0KNCD
and FIGZSZ in ('POS', 'ACT', 'MAN', 'HLD')
order by FIYCNB desc
FETCH FIRST 1 ROW ONLY
) as BL_Q9NB,
(
select AAKPNR
from C1AACPP
where AACEEQ = D0ATEQ and AARCCE = D0KNCD and AARDCE = D0KOCD
order by AAHMDT desc, AANENO desc
FETCH FIRST 1 ROW ONLY
) as NULL_KPNR
from ECD0REP
) as x1
left outer join (
select AAKPNR as null_kpnr, max(ABKTNR) as KTNR
from C1AACPP
left outer join C1ABCPP on AAKPNR = ABKPNR
group by AAKPNR
) as X2 on x1.NULL_KPNR = x2.null_KPNR
left outer join (
select ACKPNR as KPNR, count(*) as J6NQ
from C1ACCPP
WHERE ACJNDD = 'Y'
group by ACKPNR
) as X3 on x1.NULL_KPNR = x3.KPNR
You've got a combination of correlated subselects and nested table expressions (NTE).
Personally, I'd call it offensive if I had to maintain it. ;)
Consider common table expressions & joins...without your data and tabvle structure, I can't give you the real statement, but the general form would look like
with
STC_HHNB as (
select DHHHNB, DHAOEQ, DHJRCD, DHEJDT
from ECDHREP )
, BL_Q9NB as ( <....>
where FIGZSZ in ('POS', 'ACT', 'MAN', 'HLD'))
<...>
select <...>
from stc_hhb
join blq9nb on <...>
Two important reasons to favor CTE over NTE...the results of a CTE can be reused Also it's easy to build a statement with CTE's incrementally.
By re-used, I mean you can have
with
cte1 as (<...>)
, cte2 as (select <...> from cte1 join <...>)
, cte3 as (select <...> from cte1 join <...>)
, cte4 as (select <...> from cte2 join cte3 on <...>)
select * from cte4;
The optimizer can choose to build a temporary results set for cte1 and use it multiple times. From a building standpoint, you can see I'm builing on each preceding cte.
Here's a good article
https://www.mcpressonline.com/programming/sql/simplify-sql-qwithq-common-table-expressions
Edit
Let's dig into your first correlated sub-query.
select D0HONB as HONB, D0HHNB as HHNB,
(
select DHHHNB
from ECDHREP
where DHAOEQ = D0ATEQ and DHJRCD = D0KNCD
order by DHEJDT desc
FETCH FIRST 1 ROW ONLY
) as STC_HHNB
from ECD0REP
What you asking the DB to do is for every row read in ECD0REP, go out and get a row from ECDHREP. If you're unlucky, the DB will have to read lots of records in ECDHREP to find that one row. Generally, consider that with correlated sub-query the inner query would need to read every row. So if there's M rows in the outer and N rows in the inner...then you're looking at MxN rows being read.
I've seen this before, especially on the IBM i. As that's how an RPG developer would do it
read ECD0REP;
dow not %eof(ECD0REP);
//need to get DHHHNB from ECDHREP
chain (D0ATEQ, D0KNCD) ECDHREP;
STC_HHNB = DHHHNB;
read ECD0REP;
enddo;
But that's not the way to do it in SQL. SQL is (supposed to be) set based.
So what you need to do is think of how to select the set of records out of ECDHREP that will match up to the set of record you want from ECD0REP.
with cte1 as (
select DHHHNB, DHAOEQ, DHJRCD
from ECDHREP
)
select D0HONB as HONB
, D0HHNB as HHNB
, DHHHBN as STC_HHNB
from ECD0REP join cte1
on DHAOEQ = D0ATEQ and DHJRCD = D0KNCD
Now maybe that's not quite correct. Perhaps there's multiple rows in ECDHREP with the same values (DHAOEQ, DHJRCD); thus you needed the FETCH FIRST in your correlated sub-query. Fine you can focus on the CTE and figure out what needs to be done to get that 1 row you want. Perhaps MAX(DHHHNB) or MIN(DHHHNB) would work. If nothing else, you could use ROW_NUMBER() to pick out just one row...
with cte1 as (
select DHHHNB, DHAOEQ, DHJRCD
, row_number() over(partition by DHAOEQ, DHJRCD
order by DHAOEQ, DHJRCD)
as rowNbr
from ECDHREP
), cte2 as (
select DHHHNB, DHAOEQ, DHJRCD
from cte1
where rowNbr = 1
)
select D0HONB as HONB
, D0HHNB as HHNB
, DHHHBN as STC_HHNB
from ECD0REP join cte2
on DHAOEQ = D0ATEQ and DHJRCD = D0KNCD
Now you're dealing with sets of records, joining them together for your final results.
Worse case, the DB has to read M + N records.
It's not really about performance, it's about thinking in sets.
Sure with a simple statement using a correlated sub-query, the optimizer will probably be able to re-write it into a join.
But it's best to write the best code you can, rather then hope the optimizer can correct it.
I've seen and rewritten queries with 100's of correlated & regular sub-queries....in fact I've seen a query that had to be broken into 2 because there were two many sub-queries. The DB has a limit of 256 per statement.
I'm going to have to differ with Charles here if the FETCH FIRST 1 ROW ONLY clauses are necessary. In this case you likely can't pull those sub-selects out into a CTE because that CTE would only have a single row in it. I suspect you could pull the outer sub-select into a CTE, but you would still need the sub-selects in the CTE. Since there appears to be no sharing, I would call this personal preference. BTW, I don't think pulling the sub-selects into a join will work for you either, in this case, for the same reason.
What is the difference between a sub-select and a CTE?
with mycte as (
select field1, field2
from mytable
where somecondition = true)
select *
from mycte
vs.
select *
from (select field1, field2
from mytable
where somecondition = true) a
It's really just a personal preference, though depending on the specific requirements, a CTE can be used multiple times within the SQL statement, but a sub-select will be more correct in other cases like the FETCT FIRST clause in your question.
EDIT
Let's look at the first sub-query. With the appropriate index:
(
select DHHHNB
from ECDHREP
where DHAOEQ = D0ATEQ and DHJRCD = D0KNCD
order by DHEJDT desc
FETCH FIRST 1 ROW ONLY
) as STC_HHNB,
only has to read one record per row in the output set. I don't think that is terribly onerous. This is the same for the third correlated sub-query as well.
That index on the first correlated sub-query would be:
create index ECDHREP_X1
on ECDHREP (DHAOEQ, DHJRCD, DHEJDT);
The second correlated sub-query might need more than one read per row, just because of the IN predicate, but it is far from needing a full table scan.

My SQL MERGE statement runs for too long

I have this Hive MERGE statement:
MERGE INTO destination dst
USING (
SELECT
-- DISTINCT fields
company
, contact_id as id
, ct.cid as cid
-- other fields
, email
, timestamp_utc
-- there are actually about 6 more
-- deduplication
, ROW_NUMBER() OVER (
PARTITION BY company
, ct.id
, contact_id
ORDER BY timestamp_utc DESC
) as r
FROM
source
LATERAL VIEW explode(campaign_id) ct AS cid
) src
ON
dst.company = src.company
AND dst.campaign_id = src.cid
AND dst.id = src.id
-- On match: keep latest loaded
WHEN MATCHED
AND dst.updated_on_utc < src.timestamp_utc
AND src.r = 1
THEN UPDATE SET
email = src.email
, updated_on_utc = src.timestamp_utc
WHEN NOT MATCHED AND src.r = 1 THEN INSERT VALUES (
src.id
, src.email
, src.timestamp_utc
, src.license_name
, src.cid
)
;
Which runs for a very long time (30 minutes for 7GB of avro compressed data on disk).
I wonder if there are any SQL ways to improve it.
ROW_NUMBER() is here to deduplicate the source table, so that in the MATCH clause we only select the earliest row.
One thing I am not sure of, is that hive says:
SQL Standard requires that an error is raised if the ON clause is such
that more than 1 row in source matches a row in target. This check is
computationally expensive and may affect the overall runtime of a
MERGE statement significantly. hive.merge.cardinality.check=false may
be used to disable the check at your own risk. If the check is
disabled, but the statement has such a cross join effect, it may lead
to data corruption.
I do indeed disable the cardinality check, as although the ON statement might give 2 rows in source, those rows are limited to 1 only thanks to the r=1 later in the MATCH clause.
Overall I like this MERGE statement but it is just too slow and any help would be appreciated.
Note that the destination table is partitioned. The source table is not as it is an external table which for every run must be fully merged, so fully scanned (in the background already merged data files are removed and new files are added before next run). Not sure that partitioning would help in that case
What I have done:
play with hdfs/hive/yarn configuration
try with a temporary table (2 steps) instead of a single MERGE, the run time jumped to more than 2 hours.
Option 1: Move where filter where src.r = 1 inside the src subquery and check the merge performance. This will reduce the number of source rows before merge.
Other two options do not require ACID mode. Do full target rewrite.
Option 2: Rewrite using UNION ALL + row_number (this should be the fastest one):
insert overwrite table destination
select
company
, contact_id as id
, ct.cid as cid
, email
, timestamp_utc
, -- add more fields
from
(
select --dedupe, select last updated rows using row_number
s.*
, ROW_NUMBER() OVER (PARTITION BY company, ct.id , contact_id ORDER BY timestamp_utc DESC) as rn
from
(
select --union all source and target
company
, contact_id as id
, ct.cid as cid
, email
, timestamp_utc
, -- add more fields
from source LATERAL VIEW explode(campaign_id) ct AS cid
UNION ALL
select
company
, contact_id as id
, ct.cid as cid
, email
, timestamp_utc
,-- add more fields
from destination
)s --union all
where rn=1 --filter duplicates
)s-- filtered dups
If source contains a lot of duplicates, you can apply additional row_number filtering to the src subquery as well before union.
One more approach using full join: https://stackoverflow.com/a/37744071/2700344

SQL Server reuse aliases in subqueries

This might sound like a dumb question - apologies, I'm new to SQL Server and I just want to confirm my understanding.
I've got a query that is aggregating values in a table as a subquery in different ways for different columns, e.g. for a transaction on a given day, transactions in the previous month, previous 6 months, before that, after that.
I aliased the main table as tx, then the subquery alias as tx1 so I could use for example:
tx1.TransactionDate < tx.TransactionDate
I created one column, copied it and amended the WHERE conditions.
I assumed that the scope of an alias in the subquery is bound to that subquery, so it didn't matter that the alias was the same in each case.
It seems to work, but then as neither the main table tx is altered nor the subquery tables tx1 I wouldn't know if the scope of the alias tx1 was bound to each subquery or if the initial tx1 was being reused.
Am I correct in my assumption?
Query:
SELECT tr.transaction_value ,
Isnull(
(
SELECT Sum(tr1.transaction_value)
FROM [MyDB].[dbo].[Transactions] tr1
WHERE tr1.client_ref = tr.client_ref),0)
and tr1.transaction_date > tr.transaction_date ),0) AS 'Future_Transactions' ,isnull(
(
SELECT sum(tr1.transaction_value)
FROM [MyDB].[dbo].[Transactions] tr1
WHERE tr1.client_ref = tr.client_ref),0)
AND
tr1.transaction_date < tr.transaction_date ),0) AS 'Prior_Transactions' FROM [MyDB].[dbo].[Transactions]
I think that following script can explain you everything.
SELECT 1,1,GETDATE()
INSERT INTO #t ( Id, UserId, TranDate )
SELECT 2,1,GETDATE()
INSERT INTO #t ( Id, UserId, TranDate )
SELECT 3,1,GETDATE()
SELECT tx.Id/*main alias*/,
tx1.Id /*First subquery alias*/,
tx2.Id /*Second subquery alias*/,
(SELECT Id FROM #t txs /*alias only in this one subquery/must be different from main if you want use main alias in it...*/
WHERE txs.Id = tx.Id+2 /*here is used main value = subquery value+2*/) AS Id
FROM #t tx /*main*/
JOIN (SELECT *
FROM #t tx
WHERE tx.Id = 1 /*this one using subquery values + you are not able to use here main value*/
) tx1 --alias of subquery
ON tx.Id = tx1.Id /*here is used main value = subquery value*/
CROSS APPLY (SELECT TOP 1 *
FROM #t txc /*This one must be different from main if you want use it to comparison with main*/
WHERE txc.Id > tx.Id /*this one using subquery value > main value*/
) tx2 --alias of subquery
WHERE tx.Id = 1 AND /*Subquery alias canot reference on First subquery value*/
tx1.Id = 1 AND/*Subquery alias*/
tx2.Id = 2 /*Subquery alias*/
It means that yea, it could be reused, but only if you dont want compare main / sub, because if you reuse it and for example you try to do folowing statement in subquery tx.Id > tx.Id It causes that only values in subquery will be compared. In our example it causes that you dont get anything because you comaring values in same row...

Fastest way to check if values in one table does not exist in another based on multiple values

This is probably easy but i have been racking my brain. I have two tables (TEMP_PARTY and GTT_PARTY).
TEMP_PARTY AND GTT_PARTY have the following columns => system, case_num, party_id, party, role.
What I am trying to do is find all values from TEMP_PARTY that do not exist in GTT_PARTY based on a unique combination of System, case_num and party_id.
Hence, i want to pull all parties from TEMP_PARTY where I have no record in GTT_PARTY. A record is uniquely identified by data in the three columns (System, case_num and party_id).
Speed is a HUGE concern for me. Can someone please help.
Use NOT EXISTS like below
select * from TEMP_PARTY
where not exists
(
select 1 from GTT_PARTY
where System = some_val
and case_num = 3
and party_id = 7
)
Also, You can use MINUS operator (If I am not wrong Oracle has it and it's same as EXCEPT in SQL Server) like
select * from TEMP_PARTY
where System = some_val
and case_num = 3
and party_id = 7
MINUS
select * from GTT_PARTY
where System = some_val
and case_num = 3
and party_id = 7
One approach is an anti-join pattern:
SELECT t.system
, t.case_num
, t.party_id
, t.party
, t.role
FROM TEMP_PARTY t
LEFT
JOIN GTT_PARTY g
ON g.system = t.system
AND g.case_num = t.case_num
AND g.party_id = t.party_id
WHERE g.system IS NULL
This is an "outer" join, returning all rows from t, along with all matching rows from g, except we've added a predicate in the WHERE clause which effectively excludes all rows that had a match. So what we're left with is rows from t that don't have any matching row(s) in g.
This isn;t the only way to get the result. There are several other approaches, for example, you can return an equivalent result with a query using a NOT EXISTS and correlated subquery.
SELECT t.system
, t.case_num
, t.party_id
, t.party
, t.role
FROM TEMP_PARTY t
WHERE NOT EXISTS
( SELECT 1
FROM GTT_PARTY g
WHERE g.system = t.system
AND g.case_num = t.case_num
AND g.party_id = t.party_id
)
For best performance, we'd want to see an index (with leading columns of)
... ON GTT_PARTY (system,case_num,party_id)
EXPLAIN output will show the execution plan.
(I was thinking this was for MySQL; these same queries will work in Oracle as well.)