BigQuery Merge - Insert new rows if not mached - sql

I'm trying to write a Merge query in Google BigQuery (part of an ETL process).
I have Source (staging) and Target tables and I have 2 ways of merge the data: the classic 'Upsert' Merge OR Insert new row if not matched all columns.
This is an example of the first way (the classic 'Upsert') query:
MERGE DS.Target T
USING DS.Source S
ON T.Key=S.Key
WHEN NOT MATCHED THEN
INSERT ROW
WHEN MATCHED THEN
UPDATE SET Col1 = S.Col1, Col2 = S.Col2
in that way if the key exist it always updates the values of the cols even if value are the same. also this will work only if the key is not Nullable.
The other way of doing it is to inserting new row when values not matched:
MERGE DS.Target T
USING DS.Source S
ON T.A = S.A and T.B = S.B and T.C = S.C
WHEN NOT MATCHED THEN
INSERT ROW
I prefer this way, BUT I found that its not possible when column type is NULL, because NULL != NULL and then the condition is false when values are Null.
I can't find a proper way of writing this query and handle Nulls comparison.
It's not possible to check for Nulls at the merge condition, Ex:
ON ((T.A IS NULL and S.A IS NULL) or T.A = S.A)
WHEN NOT MATCHED THEN
INSERT ROW
Error message:
RIGHT OUTER JOIN cannot be used without a condition that is an equality of fields from both sides of the join.
It's not possible also to use the Target table reference at the WHERE clause, Ex:
ON T.A = S.A
WHEN NOT MATCHED AND
S.A IS NOT NULL AND T.A IS NOT NULL
THEN
INSERT ROW
What do you suggest? Also, lets say both ways are possible, what would be more cost effective by BQ? I guess the performance should be the same. I also assume that I can ignore the insertions cost. Thanks!

Can you use a "magic" number or id?
This works:
CREATE OR REPLACE TABLE temp.target AS
SELECT * FROM UNNEST(
[STRUCT(1 AS A, 2 AS B, 3 AS C, 5 AS d)
, (null, 1, 3, 500)
]);
CREATE OR REPLACE TABLE temp.source AS
SELECT * FROM UNNEST(
[STRUCT(1 AS A, 2 AS B, 3 AS C, 100 AS d)
, (1, 1, 1, 1000)
, (null, null, null, 10000)
, (null, 1, 3, 10000)
]);
MERGE temp.target T
USING temp.source S
ON IFNULL(T.A, -9999999) = IFNULL(S.A, -9999999) and IFNULL(T.B, -9999999) = IFNULL(S.B, -9999999) and IFNULL(T.C, -9999999) = IFNULL(S.C, -9999999)
WHEN NOT MATCHED THEN
INSERT ROW;

Related

Update json in postgres using multiple columns

I want to update a column of JSONB objects. So If I have this table
I want to delete value1 and value2 from the rows that have a as 1 then I thought this query would work
UPDATE
test AS d
SET
b = b - s.b_value
FROM
(VALUES
(1, 'value1'),
(1, 'value2')
)
AS s(a, b_value)
WHERE
d.a = s.a
but it gives me this result where value1 was not eliminated.
Is there a simple way to fix it? I want to make a query to delete this sort of stuff but it would be a blessing if it can be done in only one query. I got the original idea from here and here you can test the SQL query
You can subtract a text[] array of keys from a jsonb value like so:
with s (a, b_value) as (
values (1, 'value1'), (1, 'value2')
), dels as (
select a, array_agg(b_value) as b_values
from s
group by a
)
update test
set b = b - dels.b_values
from dels
where dels.a = test.a;
db<>fiddle here

Postgresql update column based on set of values from another table

Dummy data to illustrate my problem:
create table table1 (category_id int,unit varchar,is_valid bool);
insert into table1 (category_id, unit, is_valid)
VALUES (1, 'a', true), (2, 'z', true);
create table table2 (category_id int,unit varchar);
insert into table2 (category_id, unit)
values(1, 'a'),(1, 'b'),(1, 'c'),(2, 'd'),(2, 'e');
So the data looks like:
Table 1:
category_id
unit
is_valid
1
a
true
2
z
true
Table 2:
category_id
unit
1
a
1
b
1
c
2
d
2
e
I want to update the is_valid column in Table 1, if the category_id/unit combination from Table 1 doesn't match any of the rows in Table 2. For example, the first row in Table 1 is valid, since (1, a) is in Table 2. However, the second row in Table 1 is not valid, since (2, z) is not in Table 2.
How can I update the column using postgresql? I tried a few different where clauses of the form
UPDATE table1 SET is_valid = false WHERE...
but I cannot get a WHERE clause that works how I want.
You can just set the value of is_valid the the result of a ` where exists (select ...). See Demo.
update table1 t1
set is_valid = exists (select null
from table2 t2
where (t2.category_id, t2.unit) = (t1.category_id, t1.unit)
);
NOTES:
Advantage: Query correctly sets the is_valid column regardless of the current value and is a vary simple query.
Disadvantage: Query sets the value of is_valid for every row in the table; even thoes already correctly set.
You need to decide whether the disadvantage out ways the advantage. If so then the same basic technique in a much more complicated query:
with to_valid (category_id, unit, is_valid) as
(select category_id
, unit
, exists (select null
from table2 t2
where (t2.category_id, t2.unit) = (t1.category_id, t1.unit)
)
from table1 t1
)
update table1 tu
set is_valid = to_valid.is_valid
from to_valid
where (tu.category_id, tu.unit) = (to_valid.category_id, to_valid.unit)
and tu.is_valid is distinct from to_valid.is_valid;

Save value in local variable HANA SQL Script

I'm trying to take value from a non-empty row and overwrite it in the subsequent rows until another non-empty row appears and then write that in the subsequent rows. Coming from ABAP Background, I'm not sure how to accomplish this in HANA SQL Script. Here's a picture to show what the data looks like.
Basically 'Doe, John' should be overwritten into all the empty rows until 'Doe, Jane' appears and then 'Doe, Jane' should be overwritten into empty rows until another name appears.
My idea is to store the non-empty row in a local variable, but I haven't had much success so far. Here's my code:
tempTab1 = SELECT
CASE WHEN EMPLOYEE <> ''
THEN lv_emp = EMPLOYEE
ELSE EMPLOYEE
END AS EMPLOYEE,
FROM :tempTab;
In general, rows in dataset are unordered until you explicitly specify ORDER BY part of SQL. If you observe some order it may be a side-effect and can vary. So first of all you have to explicitly create a row number column (assume it's name is RECORD).
Then you should go this way:
Select only rows with non-empty data in column.
Use LEAD(RECORD) over(order by RECORD) to identify the next non-empty record number.
Join your source dataset to dataset defined on step 3 on between condition for RECORD field.
with a as (
select 1 as record, 'Val1' as field1 from dummy union
select 2 as record, '' as field1 from dummy union
select 3 as record, '' as field1 from dummy union
select 4 as record, 'Val2' as field1 from dummy union
select 5 as record, '' as field1 from dummy union
select 6 as record, '' from dummy union
select 7 as record, '' from dummy union
select 8 as record, 'Val3' as field1 from dummy
)
, fill_base as (
select field1, record, lead(record, 1, record) over(order by record asc) as next_record
from a
where field1 <> '' and field1 is not null
)
select
a.record
, case
when a.field1 = '' or a.field1 is null
then f.field1
else a.field1
end as field1
, a.field1 as field1_original
from a
left join fill_base as f
on a.record > f.record
and a.record < f.next_record
The performance in HANA may be bad in some cases since it process window functions very bad.
Here is another more elegant solution with two nested window functions than does not force you to write multiple selects for each column: How to make LAG() ignore NULLS in SQL Server?
You can use window aggregate function LAST_VALUE to achieve the imputation of missing values.
Sample Data
CREATE TABLE sample (id integer, sort integer, value varchar(10));
INSERT INTO sample VALUES (4711, 1, 'Hello');
INSERT INTO sample VALUES (4712, 2, null);
INSERT INTO sample VALUES (4713, 3, null);
INSERT INTO sample VALUES (4714, 4, 'World');
INSERT INTO sample VALUES (4715, 5, null);
INSERT INTO sample VALUES (4716, 6, '!');
Generate a new column with imputed values
SELECT base.*, LAST_VALUE(fill.value ORDER BY fill.sort) AS value_imputed
FROM sample base
LEFT JOIN sample fill ON fill.sort <= base.sort AND fill.value IS NOT NULL
GROUP BY base.id, base.sort, base.value
ORDER BY base.id, base.sort
Result
Note that sort could be anything determining the order (e.g. a timestamp).

If condition in SQL file vertica

I need to execute insertions for around 10 tables, before inserting I have to check for a condition, condition remains the same for each of the tables, instead of giving that condition within insert query, I wish I could give in if condition (a select query), if satisfied then execute insert statements, is there a way to give if condition in Vertica SQL file ? If condition is not satisfied I dont want to execute any of the insert queries.
If the condition is, for example, that you only insert the data on a Sunday, try this:
a) a test table:
CREATE LOCAL TEMPORARY TABLE input(id,name)
ON COMMIT PRESERVE ROWS AS
SELECT 42,'Arthur Dent'
UNION ALL SELECT 43,'Ford Prefect'
UNION ALL SELECT 44,'Tricia McMillan'
KSAFE 0;
From this table, select with a WHERE condition that tests whether it's sunday -- that's all, see here:
SELECT
*
FROM input
WHERE TRIM(TO_CHAR(CURRENT_DATE,'Day'))='Sunday'
;
id|name
42|Arthur Dent
43|Ford Prefect
44|Tricia McMillan
With a different value for the week day (I'm writing this on a Sunday...), you get this:
SELECT
*
FROM input
WHERE TRIM(TO_CHAR(CURRENT_DATE,'Day'))='Monday'
;
id|name
select succeeded; 0 rows fetched
I use this technique in SQL generating SQL to create a script or an empty file determining on circumstances, and then to call that script (full or empty), implementing a conditional SQL execution that way....
I know it is an old post, but i want to share what i recently solved my problem.
I need to insert in one table or another based on some condition. You first should have a field or value what will be your search condition.
create table tmp1 (
Col1 int null
,Col2 varchar(100) null
)
--Insert values
insert into tmp (Col1,Col2) Values
(1,'Text1')
,(2,'Text2')
--Insert into table001
insert into table001
select
t.field1
,t.field2
,......
from table1 t
inner join tmp t2
on t2.col1 = t.ColX
where 1 = case when t2.Col2 = 'Text1' then 1 else 0 end --Search condition; if 1<>0 then it doesn't do anything; otherwise insert.
--Insert into table002
insert into table002
select
t.field1
,t.field2
,......
from table2 t
inner join tmp t2
on t2.col1 = t.ColX
where 1 = case when t2.Col2 = 'Text2' then 1 else 0 end --Search condition; if 1<>0 then it doesn't do anything; otherwise insert.
Or you can use an UNION/UNION ALL based on this if it is the same working table.
Regards!

SQL Server Merge - Output returning null deleted rows as inserted

I'm using MERGE to sync data in a table, and I'm seeing some wrong (to me) behavior from SQL Server. When I OUTPUT the INSERTED.* values, and a row was deleted, the MERGE command returns a row with all NULL columns for each row that was deleted.
For example, take this schema:
CREATE TABLE tbl
(
col1 INT NOT NULL,
col2 INT NOT NULL
);
I do an initial load of data, and all 4 rows are outputted as expected.
WITH data1 AS (
SELECT 1 [col1],1 [col2]
UNION ALL SELECT 2 [col1],2 [col2]
UNION ALL SELECT 3 [col1],3 [col2]
UNION ALL SELECT 4 [col1],4 [col2]
)
MERGE tbl t
USING data1 s
ON t.col1 = s.col1 AND t.col2 = s.col2
WHEN NOT MATCHED BY TARGET
THEN INSERT (col1,col2) VALUES (s.col1,s.col2)
WHEN NOT MATCHED BY SOURCE
THEN DELETE
OUTPUT INSERTED.*;
Now, say I remove 2 rows from the data I'm syncing with the table (in my CTE) and do the same MERGE, I see 2 rows of all NULL columns returned.
WITH data1 as (
SELECT 1 [col1],1 [col2]
UNION ALL SELECT 2 [col1],2 [col2]
)
MERGE tbl t
USING data1 s
ON t.col1 = s.col1 AND t.col2 = s.col2
WHEN NOT MATCHED BY TARGET
THEN INSERT (col1,col2) VALUES (s.col1,s.col2)
WHEN NOT MATCHED BY SOURCE
THEN DELETE
OUTPUT INSERTED.*;
To me, this seems like wrong behavior because A) I didn't as for any deleted rows and B) this makes it seem like I inserted these 2 NULL rows into my table, which I clearly did not. Can anyone shed some light on what's happening?
From the documentation:
output_clause - Returns a row for every row in target_table that is
updated, inserted, or deleted, in no particular order. $action can be
specified in the output clause. $action is a column of type
nvarchar(10) that returns one of three values for each row: 'INSERT',
'UPDATE', or 'DELETE', according to the action that was performed on
that row.
It seems that SQL Server is outputting one row for every row that changed (by an insert or delete). When I specify OUTPUT INSERTED.*, I'm really only specifying the inserted data, which is null for the 2 rows that were changed. If I specify OUTPUT INSERTED.col1 [InsCol1],INSERTED.col2 [InsCol2],DELETED.col1 [DelCol1],DELETED.col2 [DelCol2],$action then I can see a better picture of what's happening.
Thanks to Laurence for your comment.