My SQL MERGE statement runs for too long - hive

I have this Hive MERGE statement:
MERGE INTO destination dst
USING (
SELECT
-- DISTINCT fields
company
, contact_id as id
, ct.cid as cid
-- other fields
, email
, timestamp_utc
-- there are actually about 6 more
-- deduplication
, ROW_NUMBER() OVER (
PARTITION BY company
, ct.id
, contact_id
ORDER BY timestamp_utc DESC
) as r
FROM
source
LATERAL VIEW explode(campaign_id) ct AS cid
) src
ON
dst.company = src.company
AND dst.campaign_id = src.cid
AND dst.id = src.id
-- On match: keep latest loaded
WHEN MATCHED
AND dst.updated_on_utc < src.timestamp_utc
AND src.r = 1
THEN UPDATE SET
email = src.email
, updated_on_utc = src.timestamp_utc
WHEN NOT MATCHED AND src.r = 1 THEN INSERT VALUES (
src.id
, src.email
, src.timestamp_utc
, src.license_name
, src.cid
)
;
Which runs for a very long time (30 minutes for 7GB of avro compressed data on disk).
I wonder if there are any SQL ways to improve it.
ROW_NUMBER() is here to deduplicate the source table, so that in the MATCH clause we only select the earliest row.
One thing I am not sure of, is that hive says:
SQL Standard requires that an error is raised if the ON clause is such
that more than 1 row in source matches a row in target. This check is
computationally expensive and may affect the overall runtime of a
MERGE statement significantly. hive.merge.cardinality.check=false may
be used to disable the check at your own risk. If the check is
disabled, but the statement has such a cross join effect, it may lead
to data corruption.
I do indeed disable the cardinality check, as although the ON statement might give 2 rows in source, those rows are limited to 1 only thanks to the r=1 later in the MATCH clause.
Overall I like this MERGE statement but it is just too slow and any help would be appreciated.
Note that the destination table is partitioned. The source table is not as it is an external table which for every run must be fully merged, so fully scanned (in the background already merged data files are removed and new files are added before next run). Not sure that partitioning would help in that case
What I have done:
play with hdfs/hive/yarn configuration
try with a temporary table (2 steps) instead of a single MERGE, the run time jumped to more than 2 hours.

Option 1: Move where filter where src.r = 1 inside the src subquery and check the merge performance. This will reduce the number of source rows before merge.
Other two options do not require ACID mode. Do full target rewrite.
Option 2: Rewrite using UNION ALL + row_number (this should be the fastest one):
insert overwrite table destination
select
company
, contact_id as id
, ct.cid as cid
, email
, timestamp_utc
, -- add more fields
from
(
select --dedupe, select last updated rows using row_number
s.*
, ROW_NUMBER() OVER (PARTITION BY company, ct.id , contact_id ORDER BY timestamp_utc DESC) as rn
from
(
select --union all source and target
company
, contact_id as id
, ct.cid as cid
, email
, timestamp_utc
, -- add more fields
from source LATERAL VIEW explode(campaign_id) ct AS cid
UNION ALL
select
company
, contact_id as id
, ct.cid as cid
, email
, timestamp_utc
,-- add more fields
from destination
)s --union all
where rn=1 --filter duplicates
)s-- filtered dups
If source contains a lot of duplicates, you can apply additional row_number filtering to the src subquery as well before union.
One more approach using full join: https://stackoverflow.com/a/37744071/2700344

Related

Rewrite this loop as a recursive CTE?

Link to demo code and data: http://sqlfiddle.com/#!18/5811b/9
This while loop iterates over data and inserts data based on a merge condition, which is (roughly) "find the MergeRecordId in the earliest row that matches the ActionId, ParentActionId and Category, then insert relevant rows that have this Merge Id, using the current row's Merge Id". It ignores rows that have no "predecessors". Idx and OriginalIdx are helper columns.
Is it possible to rewrite this using a recursive CTE? Or should I be using a different technique?
This is currently what I have, but it obviously doesn't work because it doesn't match the earliest item (iRank = 1 over the Date):
; WITH cte AS
(
SELECT Idx
, ActionId
, ParentActionId
, Category
, ActionDateUpdated
, MergeRecordId
, OriginalIdx
FROM #Example
UNION ALL
SELECT -ex1.Idx
, ex2.ActionId
, ex2.ParentActionId
, ex2.Category
, ex1.ActionDateUpdated
, ex1.MergeRecordId
, ex2.Idx
FROM #Example ex1
JOIN cte ex2
ON ex1.Category = ex2.Category
AND ex1.ActionDateUpdated > ex2.ActionDateUpdated
WHERE EXISTS
(
SELECT 1
FROM #Example ex3
WHERE ex1.ActionId = ex3.ActionId
AND ex1.ParentActionId = ex3.ParentActionId
AND ex1.Category = ex3.Category
AND ex1.ActionDateUpdated > ex3.ActionDateUpdated
)
)
SELECT *
FROM cte ORDER BY ABS(idx), ABS(OriginalIdx);
Under usual circumstances it should be trivial to get the required MergeRecordId using a sub-query, but you can't use sub-queries containing CTEs in a recursive CTE. Without this filter it makes the query useless with large datasets.
(Another quirk is the deletion of the current row in the loop, but I'm not too concerned about that.)

INSERT or UPDATE the table from SELECT in sql server

I have a requirement where I have to check if the record for the business date already exists in the table then I need to update the values for that business date from the select statement otherwise I have to insert for that business date from the select statement. Below is my full query where I am only inserting at the moment:
INSERT INTO
gstl_calculated_daily_fee(business_date,fee_type,fee_total,range_id,total_band_count)
select
#tlf_business_date,
'FEE_LOCAL_CARD',
SUM(C.settlement_fees),
C.range_id,
Count(1)
From
(
select
*
from
(
select
rowNumber = #previous_mada_switch_fee_volume_based_count + (ROW_NUMBER() OVER(PARTITION BY DATEPART(MONTH, x_datetime) ORDER BY x_datetime)),
tt.x_datetime
from gstl_trans_temp tt where (message_type_mapping = '0220') and card_type ='GEIDP1' and response_code IN('00','10','11') and tran_amount_req >= 5000 AND merchant_type NOT IN(5542,5541,4829)
) A
CROSS APPLY
(
select
rtt.settlement_fees,
rtt.range_id
From gstl_mada_local_switch_fee_volume_based rtt
where A.rowNumber >= rtt.range_start
AND (A.rowNumber <= rtt.range_end OR rtt.range_end IS NULL)
) B
) C
group by CAST(C.x_datetime AS DATE),C.range_id
I have tried to use the if exists but could not fit in the above full query.
if exists (select
business_date
from gstl_calculated_daily_fee
where
business_date = #tlf_business_date)
UPDATE gstl_calculated_daily_fee
SET fee_total = #total_mada_local_switch_fee_low
WHERE fee_type = 'FEE_LOCAL_CARD'
AND business_date = #tlf_business_date
else
INSERT INTO
Please help.
You need a MERGE statement with a join.
Basically, our issue with MERGE is going to be that we only want to merge against a subset of the target table. To do this, we pre-filter the table as a CTE. We can also put the source table as a CTE.
Be very careful when you write MERGE when using a CTE. You must make sure you fully filter the target within the CTE to what rows you want to merge against, and then match the rows using ON
;with source as (
select
business_date = #tlf_business_date,
fee_total = SUM(C.settlement_fees),
C.range_id,
total_band_count = Count(1)
From
(
select
rowNumber = #previous_mada_switch_fee_volume_based_count + (ROW_NUMBER() OVER(PARTITION BY DATEPART(MONTH, x_datetime) ORDER BY x_datetime)),
tt.x_datetime
from gstl_trans_temp tt where (message_type_mapping = '0220') and card_type ='GEIDP1' and response_code IN('00','10','11') and tran_amount_req >= 5000 AND merchant_type NOT IN(5542,5541,4829)
) A
CROSS APPLY
(
select
rtt.settlement_fees,
rtt.range_id
From gstl_mada_local_switch_fee_volume_based rtt
where A.rowNumber >= rtt.range_start
AND (A.rowNumber <= rtt.range_end OR rtt.range_end IS NULL)
) B
group by CAST(A.x_datetime AS DATE), B.range_id
),
target as (
select
business_date,fee_type,fee_total,range_id,total_band_count
from gstl_calculated_daily_fee
where business_date = #tlf_business_date AND fee_type = 'FEE_LOCAL_CARD'
)
MERGE INTO target t
USING source s
ON t.business_date = s.business_date AND t.range_id = s.range_id
WHEN NOT MATCHED BY TARGET THEN INSERT
(business_date,fee_type,fee_total,range_id,total_band_count)
VALUES
(s.business_date,'FEE_LOCAL_CARD', s.fee_total, s.range_id, s.total_band_count)
WHEN MATCHED THEN UPDATE SET
fee_total = #total_mada_local_switch_fee_low
;
The way a MERGE statement works, is that it basically does a FULL JOIN between the source and target tables, using the ON clause to match. It then applies various conditions to the resulting join and executes statements based on them.
There are three possible conditions you can do:
WHEN MATCHED THEN
WHEN NOT MATCHED [BY TARGET] THEN
WHEN NOT MATCHED BY SOURCE THEN
And three possible statements, all of which refer to the target table: UPDATE, INSERT, DELETE (not all are applicable in all cases obviously).
A common problem is that we would only want to consider a subset of a target table. There a number of possible solutions to this:
We could filter the matching inside the WHEN MATCHED clause e.g. WHEN MATCHED AND target.somefilter = #somefilter. This can often cause a full table scan though.
Instead, we put the filtered target table inside a CTE, and then MERGE into that. The CTE must follow Updatable View rules. We must also select all columns we wish to insert or update to. But we must make sure we are fully filtering the target, otherwise if we issue a DELETE then all rows in the target table will get deleted.

Oracle SQL Merge Statement with Conditions

I"m relatively new to SQL, and I'm having an issue where the target table is not being updated.
I have duplicate account # (key) with different contact information in the associated columns. I’m attempting to consolidate the contact information (source) into a single row / account number with the non duplicate contact information going into (target) extended columns.
I constructed a Merge statement with a case condition to check if the data exists in the target table. If the data is not in the target table then add the information in the extended columns. The issue is that the target table doesn’t get updated. Both Source and Target tables are similarity defined.
**Merge SQL- reduced query**
MERGE INTO target tgt
USING (select accountno, cell, site, contact, email1 from (select w.accountno, w.cell, w.site, w.contact, email1, row_number() over (PARTITION BY w.accountno order by accountno desc) acct
from source w) inn where inn.acct =1) src
ON (tgt.accountno = src.accountno)
WHEN MATCHED
THEN
UPDATE SET
tgt.phone4 =
CASE WHEN src.cell <> tgt.cell
THEN src.cell
END,
tgt.phone5 =
CASE WHEN src.site <> tgt.site
THEN src.site
END
I have validated that there is contact information in the source table for an accountno that should be added to the target table. I greatly appreciate any insight as to why the target table is not being updated.
I saw a similar question on Stack, but it didn't have a response.
Your SRC subquery in using clause, returns just 1 random row for each accountno.
You need to aggregate them, for example using PIVOT:
with source(accountno, cell, site, contact) as ( --test data:
select 1,8881234567,8881235678,8881236789 from dual union all
select 1,8881234567,8881235678,8881236789 from dual
)
select accountno, contact,
r1_cell, r1_site,
r2_cell, r2_site
from (select s.*,row_number()over(partition by accountno order by cell) rn
from source s
)
pivot (
max(cell) cell,max(site) site
FOR rn
IN (1 R1,2 R2)
)
So finally you can compare r1_cell, r1_site, r2_cell, r2_site with destination values and use required ones:
MERGE INTO target tgt
USING (
select accountno, contact,
r1_cell, r1_site,
r2_cell, r2_site
from (select s.*,row_number()over(partition by accountno order by cell) rn
from source s
)
pivot (
max(cell) cell,max(site) site
FOR rn
IN (1 R1,2 R2)
)
) src
ON (tgt.accountno = src.accountno)
WHEN MATCHED
THEN
UPDATE SET
tgt.phone4 =
CASE
WHEN src.r1_cell <> tgt.cell
THEN src.r1_cell
ELSE src.r2_cell
END,
tgt.phone5 =
CASE WHEN src.r1_site <> tgt.site
THEN src.r1_site
ELSE src.r2_site
END
/
the issue is with regards to the logic you have used in row_numbering the rows with identical account_number.
MERGE
INTO target tgt
USING (select accountno, cell, site, contact, email1
from (select w.accountno, w.cell, w.site, w.contact, email1
, row_number() over (PARTITION BY w.accountno order by w.accountno desc) acct
from source w
left join target w2
on w.accountno=w2.accountno
where w2.cell is null /* get records which are not in target*/
) inn
where inn.acct =1
) src
ON (tgt.accountno = src.accountno)
WHEN MATCHED THEN
UPDATE
SET tgt.phone4 = src.cell,
tgt.phone5 = src.site

Modify my SQL Server query -- returns too many rows sometimes

I need to update the following query so that it only returns one child record (remittance) per parent (claim).
Table Remit_To_Activate contains exactly one date/timestamp per claim, which is what I wanted.
But when I join the full Remittance table to it, since some claims have multiple remittances with the same date/timestamps, the outermost query returns more than 1 row per claim for those claim IDs.
SELECT * FROM REMITTANCE
WHERE BILLED_AMOUNT>0 AND ACTIVE=0
AND REMITTANCE_UUID IN (
SELECT REMITTANCE_UUID FROM Claims_Group2 G2
INNER JOIN Remit_To_Activate t ON (
(t.ClaimID = G2.CLAIM_ID) AND
(t.DATE_OF_LATEST_REGULAR_REMIT = G2.CREATE_DATETIME)
)
where ACTIVE=0 and BILLED_AMOUNT>0
)
I believe the problem would be resolved if I included REMITTANCE_UUID as a column in Remit_To_Activate. That's the REAL issue. This is how I created the Remit_To_Activate table (trying to get the most recent remittance for a claim):
SELECT MAX(create_datetime) as DATE_OF_LATEST_REMIT,
MAX(claim_id) AS ClaimID,
INTO Latest_Remit_To_Activate
FROM Claims_Group2
WHERE BILLED_AMOUNT>0
GROUP BY Claim_ID
ORDER BY Claim_ID
Claims_Group2 contains these fields:
REMITTANCE_UUID,
CLAIM_ID,
BILLED_AMOUNT,
CREATE_DATETIME
Here are the 2 rows that are currently giving me the problem--they're both remitts for the SAME CLAIM, with the SAME TIMESTAMP. I only want one of them in the Remits_To_Activate table, so only ONE remittance will be "activated" per Claim:
enter image description here
You can change your query like this:
SELECT
p.*, latest_remit.DATE_OF_LATEST_REMIT
FROM
Remittance AS p inner join
(SELECT MAX(create_datetime) as DATE_OF_LATEST_REMIT,
claim_id,
FROM Claims_Group2
WHERE BILLED_AMOUNT>0
GROUP BY Claim_ID
ORDER BY Claim_ID) as latest_remit
on latest_remit.claim_id = p.claim_id;
This will give you only one row. Untested (so please run and make changes).
Without having more information on the structure of your database -- especially the structure of Claims_Group2 and REMITTANCE, and the relationship between them, it's not really possible to advise you on how to introduce a remittance UUID into DATE_OF_LATEST_REMIT.
Since you are using SQL Server, however, it is possible to use a window function to introduce a synthetic means to choose among remittances having the same timestamp. For example, it looks like you could approach the problem something like this:
select *
from (
select
r.*,
row_number() over (partition by cg2.claim_id order by cg2.create_datetime desc) as rn
from
remittance r
join claims_group2 cg2
on r.remittance_uuid = cg2.remittance_uuid
where
r.active = 0
and r.billed_amount > 0
and cg2.active = 0
and cg2.billed_amount > 0
) t
where t.rn = 1
Note that that that does not depend on your DATE_OF_LATEST_REMIT table at all, it having been subsumed into the inline view. Note also that this will introduce one extra column into your results, though you could avoid that by enumerating the columns of table remittance in the outer select clause.
It also seems odd to be filtering on two sets of active and billed_amount columns, but that appears to follow from what you were doing in your original queries. In that vein, I urge you to check the results carefully, as lifting the filter conditions on cg2 columns up to the level of the join to remittance yields a result that may return rows that the original query did not (but never more than one per claim_id).
A co-worker offered me this elegant demonstration of a solution. I'd never used "over" or "partition" before. Works great! Thank you John and Gaurasvsa for your input.
if OBJECT_ID('tempdb..#t') is not null
drop table #t
select *, ROW_NUMBER() over (partition by CLAIM_ID order by CLAIM_ID) as ROW_NUM
into #t
from
(
select '2018-08-15 13:07:50.933' as CREATE_DATE, 1 as CLAIM_ID, NEWID() as
REMIT_UUID
union select '2018-08-15 13:07:50.933', 1, NEWID()
union select '2017-12-31 10:00:00.000', 2, NEWID()
) x
select *
from #t
order by CLAIM_ID, ROW_NUM
select CREATE_DATE, MAX(CLAIM_ID), MAX(REMIT_UUID)
from #t
where ROW_NUM = 1
group by CREATE_DATE

Create View which removes multiple slices of data from table based on different criteria

The below table has PC asset information and I need to remove slices of data from it based on different criteria.
I need to create a View in SQL Server 2005 which returns my results.
I tried to accomplish my goals using temporary tables until I realized that I could not use temporary tables in a View.
I then tried to use a CTE until I realized that deleting data from a CTE would also delete data from the actual table.
I cannot delete data from the actual table. I cannot create another table in the database either.
The table has 160,000 records.
The table:
TABLE dsm_hardware_basic
(
[UUID] binary(16) -- Randomly generated 16 digit key that is unique for each record, only column with no duplicate rows.
[HostUUID] binary(16) -- Randomly generated 16 digit key, column has duplicate rows.
[Name] nvarchar(255) -- Column that contains hostnames of computer assets. Example of record: PCASSET001. Column has duplicate rows.
[LastAgentExecution] datetime -- The last time that the software agent that collects asset information ran on the PC.
[HostName] nvarchar(255) -- The fully qualified domain name of the PC. Example of record: PCASSET001.companydomain.com. Column has duplicate rows.
)
I will explain what I want to accomplish:
1) Read in all the information from the table dbo.dsm_hardware_basic. Lets call this: dsm_hardware_basic_copy.
2) Query dbo.dsm_hardware_basic and remove data that fits the following criteria from dsm_hardware_basic_copy.
This basically removes the duplicate [HostUUID] with the oldest [LastAgentExecution] time.:
SELECT ,dsm_hardware_basic.[HostUUID]
,MIN(dsm_hardware_basic.[LastAgentExecution]) AS [LastAgentExecution]
FROM dsm_hardware_basic
WHERE dsm_hardware_basic.[HostUUID] <> ''
GROUP BY dsm_hardware_basic.[HostUUID]
HAVING COUNT(*) = 2 -- The tiny amount of rows where this count is >2 will be left alone.
3) Additionaly query dbo.dsm_hardware_basic and remove data that fits the following criteria from dsm_hardware_basic_copy:
This basically removes the duplicate [HostName] with the oldest [LastAgentExecution] time.:
SELECT ,dsm_hardware_basic.[HostName]
,MIN(dsm_hardware_basic.[LastAgentExecution]) AS [LastAgentExecution]
FROM dsm_hardware_basic
WHERE dsm_hardware_basic.[HostName] <> ''
GROUP BY dsm_hardware_basic.[HostName]
HAVING COUNT(*) > 1
I wasn't sure how to do this in the above select, but not only should the COUNT of [HostName] be > 1, but [Name] should equal everything in [HostName] before the first period in [HostName]. Example [Name]: PCASSET001. Example [HostName]: PCASSET001.companydomain.com. I know this sounds strange considering the kind of PC data we are talking about in these two columns, but it is something I actually need to contend with.
3) Additionally query dbo.dsm_hardware_basic and remove data that fits the following criteria from dsm_hardware_basic_copy:
This basically removes the duplicate [Name] with the oldest [LastAgentExecution] time.:
SELECT ,dsm_hardware_basic.[Name]
,MIN(dsm_hardware_basic.[LastAgentExecution]) AS [LastAgentExecution]
FROM dsm_hardware_basic
WHERE dsm_hardware_basic.[Name] <> ''
GROUP BY dsm_hardware_basic.[Name]
HAVING COUNT(*) = 2 -- The tiny amount of rows where this count is >2 will be left alone.
You've actually asked several different questions here and I'm not sure I completely follow the logic of the query, however, constructing it should not be too difficult.
To start with, you can work dsm_hardware_basic directly rather than a copy:
SELECT
*
FROM dsm_hardware_basic
Now the part that
removes the duplicate [HostUUID] with the oldest [LastAgentExecution]
time
SELECT
dsm_hardware_basic.*
FROM dsm_hardware_basic
INNER JOIN
(
SELECT [UUID], ROW_NUMBER() OVER
(PARTITION BY [HostUUID]
ORDER BY [LastAgentExecution] DESC) AS host_UUID_rank
FROM dsm_hardware_basic
WHERE
[HostUUID] <> ''
) AS
duplicate_host_UUID_filtered ON dsm_hardware_basic.UUID = duplicate_host_UUID_filtered.UUID
AND duplicate_host_UUID_filtered.host_UUID_rank = 1
What we've done is partitioned your table by HostUUID sorted by newest LastAgentExecution and removed every UUID from the query that matches our result using a JOIN.
We can now apply the same logic to your HostName:
SELECT
dsm_hardware_basic.*
FROM dsm_hardware_basic
INNER JOIN
(
SELECT [UUID], ROW_NUMBER() OVER
(PARTITION BY [HostUUID]
ORDER BY [LastAgentExecution] DESC) AS host_UUID_rank
FROM dsm_hardware_basic
WHERE
[HostUUID] <> ''
) AS
duplicate_host_UUID_filtered ON dsm_hardware_basic.UUID = duplicate_host_UUID_filtered.UUID
AND duplicate_host_UUID_filtered.host_UUID_rank = 1
INNER JOIN
(
SELECT [UUID], ROW_NUMBER() OVER
(PARTITION BY [HostName]
ORDER BY [LastAgentExecution] DESC) AS host_UUID_rank
FROM dsm_hardware_basic
WHERE
[HostName] <> ''
) AS
duplicate_HostName_filtered ON dsm_hardware_basic.UUID = duplicate_HostName_filtered.UUID
AND duplicate_HostName_filtered.host_UUID_rank = 1
I'll leave the final part to you as an exercise. Finally, after you've done debugging, just add CREATE VIEW to this.