BigQuery MERGE unexpected row duplication

BigQuery MERGE unexpected row duplication - google-bigquery

I am using standard SQL MERGE to update at regular target table based on an source external table that is a set of CVS files in a bucket. Here is a simplified input file:
$ gsutil cat gs://dolphin-dev-raw/demo/input/demo_20191125_20200505050505.tsv
"id" "PortfolioCode" "ValuationDate" "load_checksum"
"1" "CIMDI000TT" "2020-03-28" "checksum1"
The MERGE statement is:
MERGE xx_producer_conformed.demo T
USING xx_producer_raw.demo_raw S
ON
S.id = T.id
WHEN NOT MATCHED THEN
INSERT (id, PortfolioCode, ValuationDate, load_checksum, insert_time, file_name, extract_timestamp, wf_id)
VALUES (id, PortfolioCode, ValuationDate, load_checksum, CURRENT_TIMESTAMP(), _FILE_NAME, REGEXP_EXTRACT(_FILE_NAME, '.*_[0-9]{8}_([0-9]{14}).tsv'),'scheduled__2020-08-19T16:24:00+00:00')
WHEN MATCHED AND S.load_checksum != T.load_checksum THEN UPDATE SET
T.id = S.id, T.PortfolioCode = S.PortfolioCode, T.ValuationDate = S.ValuationDate, T.load_checksum = S.load_checksum, T.file_name = S._FILE_NAME, T.extract_timestamp = REGEXP_EXTRACT(_FILE_NAME, '.*_[0-9]{8}_([0-9]{14}).tsv'), T.wf_id = 'scheduled__2020-08-19T16:24:00+00:00'
If I wipe the target table and rerun the MERGE I get an row modified count of 1:
bq query --use_legacy_sql=false --location=asia-east2 "$(cat merge.sql | awk 'ORS=" "')"
Waiting on bqjob_r288f8d33_000001740b413532_1 ... (0s) Current status: DONE
Number of affected rows: 1
This successfully results in the target table updating:
$ bq query --format=csv --max_rows=10 --use_legacy_sql=false "select * from ta_producer_conformed.demo"
Waiting on bqjob_r7f6b6a46_000001740b5057a3_1 ... (0s) Current status: DONE
id,PortfolioCode,ValuationDate,load_checksum,insert_time,file_name,extract_timestamp,wf_id
1,CIMDI000TT,2020-03-28,checksum1,2020-08-20 09:44:20,gs://dolphin-dev-raw/demo/input/demo_20191125_20200505050505.tsv,20200505050505,scheduled__2020-08-19T16:24:00+00:00
If I return the MERGE again I get row modified count of 0:
$ bq query --use_legacy_sql=false --location=asia-east2 "$(cat merge.sql | awk 'ORS=" "')"
Waiting on bqjob_r3de2f833_000001740b4161b3_1 ... (0s) Current status: DONE
Number of affected rows: 0
That results no changes to the target table. So everything is working expected.
The problem is that when I run the code on a more complex example with many input files to insert into an empty target table I end up with rows that have the same id where count(id) is not equal to count(distinct id):
$ bq query --use_legacy_sql=false --max_rows=999999 --location=asia-east2 "select count(id) as total_records from xx_producer_conformed.xxx; select count(distinct id) as unique_records from xx_producer_conformed.xxx; "
Waiting on bqjob_r5df5bec8_000001740b7dfa50_1 ... (1s) Current status: DONE
select count(id) as total_records from xx_producer_conformed.xxx; -- at [1:1]
+---------------+
| total_records |
+---------------+
| 11582 |
+---------------+
select count(distinct id) as unique_records from xx_producer_conformed.xxx; -- at [1:78]
+----------------+
| unique_records |
+----------------+
| 5722 |
+----------------+
This surprises me as my expectation is that the underlying logic would step through each line in each underlying file and only insert on the first id then on any subsequent id it would update. So my expectation is that you cannot have more rows than unique ids in the input bucket.
If I then try to run the MERGE again it fails telling me that there is more than one row in the target table with the same id:
$ bq query --use_legacy_sql=false --location=asia-east2 "$(cat merge.sql | awk 'ORS=" "')"
Waiting on bqjob_r2fe783fc_000001740b8271aa_1 ... (0s) Current status: DONE
Error in query string: Error processing job 'xxxx-10843454-datamesh-
dev:bqjob_r2fe783fc_000001740b8271aa_1': UPDATE/MERGE must match at most one
source row for each target row
I was expecting that the there would be no two rows with the same "id" when the MERGE statement does it's inserts.
All the tables and queries used are generated from from a file that lists the "business columns". So the simple demo example above is identical to the full scale queries in terms of the login and joins in the MERGE statement.
Why would the MERGE query above result in rows with duplicated "id" and how do I fix this?

The problem is very easily repeatable by wiping the target table and duplicating a relatively large input as the input:
AAAA_20200805_20200814200000.tsv
AAAA_clone_20200805_20200814200000.tsv
I believe that what is at the heart of this is parallelism. A single large MERGE of many files can spawn many worker threads in parallel. It would be very slow for any two worker threads running in parallel loading different files to immediately "see" each others inserts. Rather I expect that they would run independently and not "see" each others writes into separate buffers. When the buffers are finally combined it leads to multiple inserts with the same id.
To fix this I am using some CTEs to pick the latest record for any id based on extract_timestamp by using using ROW_NUMBER() OVER (PARTITION BY id ORDER BY extract_timestamp DESC). We can then filter by the lowest value to pick the latest version of the record. The full query is:
MERGE xx_producer_conformed.demo T
USING (
WITH cteExtractTimestamp AS (
SELECT
id, PortfolioCode, ValuationDate, load_checksum
, _FILE_NAME
, REGEXP_EXTRACT(_FILE_NAME, '.*_[0-9]{8}_([0-9]{14}).tsv') AS extract_timestamp
FROM
xx_producer_raw.demo_raw
),
cteRanked AS (
SELECT
id, PortfolioCode, ValuationDate, load_checksum
, _FILE_NAME
, extract_timestamp
, ROW_NUMBER() OVER (PARTITION BY id ORDER BY extract_timestamp DESC) AS row_num
FROM
cteExtractTimestamp
)
SELECT
id, PortfolioCode, ValuationDate, load_checksum
, _FILE_NAME
, extract_timestamp
, row_num
, "{{ task_instance.xcom_pull(task_ids='get_run_id') }}" AS wf_id
FROM cteRanked
WHERE row_num = 1
) S
ON
S.id = T.id
WHEN NOT MATCHED THEN
INSERT (id, PortfolioCode, ValuationDate, load_checksum, insert_time, file_name, extract_timestamp, wf_id)
VALUES (id, PortfolioCode, ValuationDate, load_checksum, CURRENT_TIMESTAMP(), _FILE_NAME, extract_timestamp, wf_id)
WHEN MATCHED AND S.load_checksum != T.load_checksum THEN UPDATE SET
T.id = S.id, T.PortfolioCode = S.PortfolioCode, T.ValuationDate = S.ValuationDate, T.load_checksum = S.load_checksum, T.file_name = S._FILE_NAME, T.extract_timestamp = S.extract_timestamp, T.wf_id = S.wf_id
This means that cloning a file and not changing the extract_timestamp in the filename will pick one of the two rows at random. In normal running we would expect subsequent extracts that have updated data to be a source file with a new extract_timetamp. The above query will then pick the newest record to merge into the target table.

Related

How to join 2 tables that have the values represented differently in each table?

I currently have 2 tables estimate_details and delivery_service.
estimate_details has a column called event that has events such as: checkout, buildOrder
delivery_service has a column called source that has events such as: makeBasket, buildPurchase
checkout in estimate_details is equivalent to makeBasket in delivery_service, and buildOrder is equivalent to buildPurchase.
estimate_details
id
event
...
1
checkout
...
2
buildOrder
...
delivery_service
id
source
date
...
1
makeBasket
'2022-10-01'
...
2
buildPurchase
'2022-10-02'
...
1
makeBasket
'2022-10-20'
...
I would like to be able to join the tables on the event and source columns where checkout = makeBasket and buildOrder = buildPurchase.
Also if there are multiple records for the specific ID and source in delivery_service , choose the latest one.
How would I be able to do this? I cannot UPDATE either table to have the same values as the other table.
I still want all the data from estimate_details, but would like the latest records from the delivery_service.
The Expected output in this situation would be:
id
event
Date
...
1
checkout
'2022-10-20'
...
2
buildOrder
'2022-10-02'
...

The best approach here is to use a CTE, which is like a subquery but more readable.
So first, in the 'CTE' you will use the delivery_service table to get the max date for each id and source. Then, you will handle the 'text' to manually replace it to make it match that in estimate details
WITH delivery_service_cte AS (
SELECT
id
, CASE
WHEN source = 'makeBasket' THEN 'checkout'
WHEN source = 'buildPurchase' THEN 'buildOrder'
END AS source
, MAX(date) AS date
FROM
delivery_service
GROUP BY
1, 2
)
SELECT
ed.* -- select whichever columns you want from here
, ds.id
, ds.source
, ds.date
FROM
estimate_details ed
LEFT JOIN
-- or JOIN (you didn't give enough info on what you are trying to achieve in
-- the output
delivery_service_cte ds
ON ds.source = ed.event

SQL Filtering duplicate rows due to bad ETL

The database is Postgres but any SQL logic should help.
I am retrieving the set of sales quotations that contain a given product within the bill of materials. I'm doing that in two steps: step 1, retrieve all DISTINCT quote numbers which contain a given product (by product number).
The second step, retrieve the full quote, with all products listed for each unique quote number.
So far, so good. Now the tough bit. Some rows are duplicates, some are not. Those that are duplicates (quote number & quote version & line number) might or might not have maintenance on them. I want to pick the row that has maintenance greater than 0. The duplicate rows I want to exclude are those that have a 0 maintenance. The problem is that some rows, which have no duplicates, have 0 maintenance, so I can't just filter on maintenance.
To make this exciting, the database holds quotes over 20+ years. And the data scientists guys have just admitted that maybe the ETL process has some bugs...
--- step 0
--- cleanup the workspace
SET CLIENT_ENCODING TO 'UTF8';
DROP TABLE IF EXISTS product_quotes;
--- step 1
--- get list of Product Quotes
CREATE TEMPORARY TABLE product_quotes AS (
SELECT DISTINCT master_quote_number
FROM w_quote_line_d
WHERE item_number IN ( << model numbers >> )
);
--- step 2
--- Now join on that list
SELECT
d.quote_line_number,
d.item_number,
d.item_description,
d.item_quantity,
d.unit_of_measure,
f.ref_list_price_amount,
f.quote_amount_entered,
f.negtd_discount,
--- need to calculate discount rate based on list price and negtd discount (%)
CASE
WHEN ref_list_price_amount > 0
THEN 100 - (ref_list_price_amount + negtd_discount) / ref_list_price_amount *100
ELSE 0
END AS discount_percent,
f.warranty_months,
f.master_quote_number,
f.quote_version_number,
f.maintenance_months,
f.territory_wid,
f.district_wid,
f.sales_rep_wid,
f.sales_organization_wid,
f.install_at_customer_wid,
f.ship_to_customer_wid,
f.bill_to_customer_wid,
f.sold_to_customer_wid,
d.net_value,
d.deal_score,
f.transaction_date,
f.reporting_date
FROM w_quote_line_d d
INNER JOIN product_quotes pq ON (pq.master_quote_number = d.master_quote_number)
INNER JOIN w_quote_f f ON
(f.quote_line_number = d.quote_line_number
AND f.master_quote_number = d.master_quote_number
AND f.quote_version_number = d.quote_version_number)
WHERE d.net_value >= 0 AND item_quantity > 0
ORDER BY f.master_quote_number, f.quote_version_number, d.quote_line_number
The logic to filter the duplicate rows is like this:
For each master_quote_number / version_number pair, check to see if there are duplicate line numbers. If so, pick the one with maintenance > 0.
Even in a CASE statement, I'm not sure how to write that.
Thoughts? The database is Postgres but any SQL logic should help.

I think you will want to use Window Functions. They are, in a word, awesome.
Here is a query that would "dedupe" based on your criteria:
select *
from (
select
* -- simplifying here to show the important parts
,row_number() over (
partition by master_quote_number, version_number
order by maintenance desc) as seqnum
from w_quote_line_d d
inner join product_quotes pq
on (pq.master_quote_number = d.master_quote_number)
inner join w_quote_f f
on (f.quote_line_number = d.quote_line_number
and f.master_quote_number = d.master_quote_number
and f.quote_version_number = d.quote_version_number)
) x
where seqnum = 1
The use of row_number() and the chosen partition by and order by criteria guarantee that only ONE row for each combination of quote_number/version_number will get the value of 1, and it will be the one with the highest value in maintenance (if your colleagues are right, there would only be one with a value > 0 anyway).

Can you do something like...
select
*
from
w_quote_line_d d
inner join
(
select
...
,max(maintenance)
from
w_quote_line_d
group by
...
) d1
on
d1.id = d.id
and d1.maintenance = d.maintenance;
Am I understanding your problem correctly?
Edit: Forgot the group by!

I'm not sure, but maybe you could Group By all other columns and use MAX(Maintenance) to get only the greatest.
What do you think?

SQL Server Query Slow with Smaller Database

I have 2 tables:
asset - with id_asset, name, ticker (60k rows)
quote_close - with id_asset, refdate, quote_close (22MM rows)
I want to make a filter in name and ticker and return:
id_asset
name
ticker
min(refdate) of the id_asset
max(refdate) of the id_asset
quote_close on max(refdate) of the id_asset
I wrote this query:
WITH tableAssetFiltered AS
(
SELECT
id_asset, ticker, name
FROM
asset
WHERE
ticker LIKE ('%VALE%') AND name LIKE ('%PUT%')
)
SELECT
ast.id_asset, ast.ticker, ast.name,
xx.quote_close as LastQuote, xx.MinDate,
xx.refdate as LastDate
FROM
tableAssetFiltered ast
LEFT JOIN
(SELECT
qc.id_asset, qc.refdate, qc.quote_close, tm.MinDate
FROM
quote_close qc
INNER JOIN
(SELECT
t.id_asset, max(t.refdate) as MaxDate, min(t.refdate) as MinDate
FROM
(SELECT
qc.id_asset, qc.refdate, qc.quote_close
FROM
quote_close qc
WHERE
qc.id_asset IN (SELECT id_asset
FROM tableAssetFiltered)
) t
GROUP BY
t.id_asset) tm ON qc.id_asset = tm.id_asset
AND qc.refdate = tm.MaxDate
) xx ON xx.id_asset = ast.id_asset
ORDER BY
ast.ticker
The results with different filter in name and ticker are:
With ticker like ('%VALE%') AND name like ('%PUT%') it took 00:02:28 and returns 491 rows
With name like ('%PUT%') it took 00:00:02 and returns 16697 rows
With ticker like ('%VALE%') it took 00:00:02 and returns 1102 rows
With no likes it took 00:00:03 and returns 51847 rows
What I can't understand is that the query
SELECT id_asset,ticker, name
FROM Viper.dbo.asset
WHERE ticker like ('%VALE%') AND name like ('%PUT%')
took 00:00:00 to run.
Why does a smaller table took more time to run? Any solution to make it faster?

The slowness could be caused by many things, Hardware, network, caching, etc.
To make the query faster,
1. Make sure that there is an index on ticker.
2. Run update statistics on the table.
3. Try to find a way to remove the '%' at the beginning of the string.
This is okay: 'VALE%'
This will slow down your query: '%VALE'

MSSQLSRV - filtering out results with duplicate row

I'm having a frustrating issue with SQL Server. I need to create a view from a table containing details of files loaded through ETL. The table contains a file id (unique), filename, serverid (relating to the server it has been loaded onto).
The first 2 letters of the filename is a country code, i.e. US, UK, GB, DE - there are multiple files loaded per country. I want to get the record with the highest file id for each country. The below query does this but it returns the highest record PER SERVER, so there may be multiple file ids - i.e. it would return the highest file id for that country on server1 and server2 - I only want the highest record full stop.
I've played with an equivalent query on MySQL and got it working by commenting out the last line (GROUP BY t.[server_id]), which seemed to work fine, but of course MSSQLSRV needs all non-aggregates in the SELECT to be placed in the GROUP BY statement.
So, how can I get the same result in SQL Server - i.e. get one result, with the highest file_id, without getting a duplicate row for a different server_id?
Hope I'm making myself clear.
SELECT MAX(t.[file_id]) AS FID
,LEFT(t.[full_file_name], 2) AS COUNTRYCODE
,t.[server_id]
FROM [tracking_files] t
WHERE t.server_id IS NOT NULL
AND t.[server_id] = (
SELECT TOP 1 [server_id]
FROM [tracking_files] md
WHERE md.[file_id] = t.file_id
)
GROUP BY LEFT(t.[full_file_name], 2)
,t.[server_id]
EDIT:
Here is the sample data I've been playing with in MySQL, along with the result I got (which is the desired result).
In SQL Server, as I can't comment out that last GROUP BY clause, we're seeing e.g. two file_ids for GB (one for server 1 and one for server 2)

If you are using SQL Server 2005 or later you can use ROW_NUMBER():
SELECT t.File_ID,
t.full_file_name,
t.CountryCode,
t.Server_ID
FROM ( SELECT t.[File_ID],
t.full_file_name,
CountryCode = LEFT(t.full_file_name, 2),
t.Server_ID,
RowNumber = ROW_NUMBER() OVER(PARTITION BY LEFT(t.full_file_name, 2) ORDER BY [File_ID] DESC)
FROM [tracking_files] t
) t
WHERE t.RowNumber = 1;
If you are using a previous version you will need to use a subquery to get the maximum file ID per country code, then join back to your main table:
SELECT t.[File_ID],
t.full_file_name,
CountryCode = LEFT(t.full_file_name, 2),
t.Server_ID
FROM [tracking_files] t
INNER JOIN
( SELECT MaxFileID = MAX([File_ID])
FROM [tracking_files] t
GROUP BY LEFT(t.full_file_name, 2)
) MaxT
ON MaxT.MaxFileID = t.[File_ID];

SQL to query by date dependencies

I have a table of patients which has the following columns: patient_id, obs_id, obs_date. Obs_id is the ID of a clinical observation (such as weight reading, blood pressure reading....etc), and obs_date is when that observation was taken. Each patient could have several readings on different dates...etc. Currently I have a query to get all patients that had obs_id = 1 and insert them into a temporary table (has two columns, patient_id, and flag which I set to 0 here):
insert into temp_table (select patient_id, 0 from patients_table
where obs_id = 1 group by patient_id having count(*) >= 1)
I also execute an update statement to set the flag to 1 for all patients that also had obs_id = 5:
UPDATE temp_table SET flag = 1 WHERE EXISTS (
SELECT patient_id FROM patients_table WHERE obs_id = 5 group by patient_id having count(*) >=1
) v WHERE temp_table.patient_id = v.patient_id
Here's my question: How do I modify both queries (without combining them or removing the group by statement) such that I can answer the following question:
"get all patients who had obs_id = 5 after obs_id = 1". If I add a min(obs_date) or max(obs_date) to the select of each query and then add "AND v.obs_date > temp_table.obs_date" to the second one, is that correct??
The reason why I need not remove the group by statement or combine is because these queries are generated by a code generator (from a web app), and i'd like to do that modification without messing up the code generator or re-writing it.
Many thanks in advance,

The advantage of SQL is that it works with sets. You don't need to create temporary tables or get all procedural.
As you describe the problem (find all patients who have obs_id 5 after obs_id 1), I'd start with something like this
select distinct p1.patient_id
from patients_table p1, patients_table p2
where
p1.obs_id = 1 and
p2.obs_id = 5 and
p2.patient_id = p1.patient_id and
p2.obs_date > p1.obs_date
Of course, that doesn't help you deal with your code generator. Sometimes, tools that make things easier can also get in the way.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

BigQuery MERGE unexpected row duplication - google-bigquery

Related

How to join 2 tables that have the values represented differently in each table?

SQL Filtering duplicate rows due to bad ETL

SQL Server Query Slow with Smaller Database

MSSQLSRV - filtering out results with duplicate row

SQL to query by date dependencies

Categories

Resources