BigQuery MERGE statement with NESTED+REPEATED fields - sql

I need to do a merge statement in BigQuery using a classic flat table, having as target a table with nested and repeated fields, and I'm having trouble understanding how this is supposed to work. Google's examples use direct values, so the syntax here is not really clear to me.
Using this example:
CREATE OR REPLACE TABLE
mydataset.DIM_PERSONA (
IdPersona STRING,
Status STRING,
Properties ARRAY<STRUCT<
Id STRING,
Value STRING,
_loadingDate TIMESTAMP,
_lastModifiedDate TIMESTAMP
>>,
_loadingDate TIMESTAMP NOT NULL,
_lastModifiedDate TIMESTAMP
);
INSERT INTO mydataset.DIM_PERSONA
values
('A', 'KO', [('FamilyMembers', '2', CURRENT_TIMESTAMP(), TIMESTAMP(NULL))], CURRENT_TIMESTAMP(), TIMESTAMP(NULL)),
('B', 'KO', [('FamilyMembers', '4', CURRENT_TIMESTAMP(), TIMESTAMP(NULL)),('Pets', '1', CURRENT_TIMESTAMP(), NULL)], CURRENT_TIMESTAMP(), TIMESTAMP(NULL))
;
CREATE OR REPLACE TABLE
mydataset.PERSONA (
IdPersona STRING,
Status STRING,
IdProperty STRING,
Value STRING
);
INSERT INTO mydataset.PERSONA
VALUES('A', 'OK','Pets','3'),('B', 'OK','FamilyMembers','5'),('C', 'OK','Pets','2')
The goal is to:
Update IdPersona='A', adding a new element in Properties and
changing Status
Update IdPersona='B', updating the existent element
in Properties
Insert IdPersona='C'
This INSERT works:
MERGE INTO mydataset.DIM_PERSONA TRG
USING (
SELECT
IdPersona,
Status,
ARRAY(
SELECT AS STRUCT
IdProperty,
Value,
CURRENT_TIMESTAMP(),
TIMESTAMP(NULL)
) Properties,
CURRENT_TIMESTAMP(),
TIMESTAMP(NULL)
FROM mydataset.PERSONA
) SRC ON TRG.IdPersona=SRC.IdPersona
WHEN NOT MATCHED THEN
INSERT VALUES (IdPersona, Status, Properties, CURRENT_TIMESTAMP(), TIMESTAMP(NULL))
But I would like to build the nested/repeated fields in the INSERT clause, because for the UPDATE I would also need (I think) to do a "SELECT AS STRUCT * REPLACE" by comparing the values of TRG with SRC.
This doesn't work:
MERGE INTO mydataset.DIM_PERSONA TRG
USING (
SELECT
*
FROM mydataset.PERSONA
) SRC ON TRG.IdPersona=SRC.IdPersona
WHEN NOT MATCHED THEN
INSERT VALUES (
IdPersona,
Status,
ARRAY(
SELECT AS STRUCT
IdProperty,
Value,
CURRENT_TIMESTAMP(),
TIMESTAMP(NULL)
),
CURRENT_TIMESTAMP(),
TIMESTAMP(NULL)
)
I get "Correlated Subquery is unsupported in INSERT clause."
Even if I used the first option, I don't get how to reference TRG.properties in the UPDATE..
WHEN MATCHED THEN
UPDATE
SET Properties = ARRAY(
SELECT AS STRUCT p_SRC.*
REPLACE (IF(p_SRC.IdProperty=p_TRG.id AND p_SRC.Value<>p_TRG.Value,p_SRC.Value,p_TRG.Value) AS Value)
FROM SRC.Properties p_SRC, TRG.Properties p_TRG
)
Obv this is wrong though.
One way to solve this, as I see it, is to pre-join everything in the USING clause, therefore doing all the replacement there, but it feels very wrong for a merge statement.
Can anyone help me figure this out, please? :\

So, I wanted to share a possible solution, although I still hope there's another way.
As mentioned, I pre-compute what I need with a CTE and a FULL OUTER JOIN, therefore recreating the array of structs I need later on (tables will be relatively small so I can afford it).
MERGE INTO mydataset.DIM_PERSONA TRG
USING (
WITH NEW_PROPERTIES AS (
SELECT
COALESCE(idp,IdPersona) IdPersona,
ARRAY_AGG((
SELECT AS STRUCT
COALESCE(idpro,Id) IdProperty,
COALESCE(vl,Value) Value,
COALESCE(_loadingDate,CURRENT_TIMESTAMP) _loadingDate,
IF(idp=IdPersona,CURRENT_TIMESTAMP,TIMESTAMP(NULL)) _lastModifiedDate
)) Properties
FROM (
SELECT DIP.IdPersona, DIP.Status, DIP_PR.*, PER.IdPersona idp, PER.Status st, PER.IdProperty idpro, PER.Value vl
FROM `clean-yew-281811.mydataset.DIM_PERSONA` DIP
CROSS JOIN UNNEST(DIP.Properties) DIP_PR
FULL OUTER JOIN mydataset.PERSONA PER
ON DIP.IdPersona=PER.IdPersona
AND DIP_PR.Id=PER.IdProperty
)
GROUP BY IdPersona
)
SELECT
IdPersona,
'subquery to do here' Status,
NP.Properties
FROM (SELECT DISTINCT IdPersona FROM mydataset.PERSONA) PE
LEFT JOIN NEW_PROPERTIES NP USING (IdPersona)
) SRC ON TRG.IdPersona=SRC.IdPersona
WHEN NOT MATCHED THEN
INSERT VALUES (IdPersona, Status, Properties, CURRENT_TIMESTAMP(), TIMESTAMP(NULL))
WHEN MATCHED THEN
UPDATE
SET
TRG.Status = SRC.Status,
TRG.Properties = SRC.Properties,
TRG._lastModifiedDate = CURRENT_TIMESTAMP()
This works but I'm pretty much avoiding the syntax to update an array of structs, as what I'm doing is a rebuild and replace operation. Hopefully someone can suggest a better way.

Also, while you did not provide your desired output, I was able to create a query based on the objectives you described and your code and with the sample data you provided.
Following the below goals:
Update IdPersona='A', adding a new element in Properties and changing Status
Update IdPersona='B', updating the existent element in Properties
Insert IdPersona='C'
Instead of doing a replace and rebuild operation, I used:
MERGE;in order to perform the updates and insert the new rows, such as IdPersona = "C"
INSERT: within merge it is not possible to use INSERT with WHEN MATCHED. Thus, in order to add a new Property when IdPerson="A", this method was used after the MERGE operations.
CREATE TABLE: after using INSERT, the new Properties when IdPersona="A" are not aggregated, since we did not use WHEN MATCHED. So, the final table DM_PERSONA is replaced in order to aggregate properly the results.
LEFT JOIN: in order to add the fields _loadingDate and *_lastModifiedDate *, which are not aggregated into the ARRAY<STRUCT<>>.
Below is the query with the proper comments:
#first step update current values and insert new IdPersonas
MERGE sample.DIM_PERSONA_test2 T
USING sample.PERSONA_test2 S
ON T.IdPersona = S.IdPersona
#update A but not insert
WHEN MATCHED AND T.IdPersona ="A" THEN
UPDATE SET STATUS = "OK"
#update B
WHEN MATCHED AND T.IdPersona ="B" THEN
UPDATE SET Properties = [( S.IdPersona, S.IdProperty,TIMESTAMP(NULL), TIMESTAMP(NULL) )]
#insert what is not in the target table
WHEN NOT MATCHED THEN
INSERT(IdPersona, Status , Properties, _loadingDate, _lastModifiedDate ) VALUES (S.IdPersona, S.Status, [( IdProperty,Value, TIMESTAMP(NULL), TIMESTAMP(NULL))], CURRENT_TIMESTAMP(), TIMESTAMP(NULL));
#insert new values when IdPersona="A"
#you will see the result won't be aggregated properly
INSERT INTO sample.DIM_PERSONA_test2(IdPersona, Status , Properties, _loadingDate, _lastModifiedDate)
SELECT IdPersona, Status,[( IdProperty,Value, TIMESTAMP(NULL), TIMESTAMP(NULL))], CURRENT_TIMESTAMP(), TIMESTAMP(NULL) from sample.PERSONA_test2
where IdPersona = "A";
#replace the above table to recriate the ARRAY<STRUCT<>>
CREATE OR REPLACE TABLE sample.DIM_PERSONA_FINAL_test2 AS(
SELECT t1.*, t2._loadingDate,t2._lastModifiedDate
FROM( SELECT a.IdPersona,
a.Status,
ARRAY_AGG(STRUCT( Properties.Id as Id, Properties.Value as Value, Properties._loadingDate ,
Properties._lastModifiedDate AS _lastModifiedDate)) AS Properties
FROM sample.DIM_PERSONA_test2 a, UNNEST(Properties) as Properties
GROUP BY 1,2
ORDER BY a.IdPersona)t1 LEFT JOIN sample.DIM_PERSONA_test2 t2 USING(IdPersona)
)
And the output,
Notice that when updating the ARRAY<STRUCT<>>, the values are wrapped within [()]. Lastly, pay attention that there are two IdPersona="A" because _loadingDate is required, so it can not be NULL and due to the CURRENT_TIMESTAMP(), there are two different values for this field. Thus, two different records.

Related

How to return ids of rows with conflicting values?

I am looking to insert or update values in an SQLite database (version > 3.35) avoiding multiple queries. upsert along with returning seems promising :
CREATE TABLE phonebook2(
name TEXT PRIMARY KEY,
phonenumber TEXT,
validDate DATE
);
INSERT INTO phonebook2(name,phonenumber,validDate)
VALUES('Alice','704-555-1212','2018-05-08')
ON CONFLICT(name) DO UPDATE SET
phonenumber=excluded.phonenumber,
validDate=excluded.validDate
WHERE excluded.validDate>phonebook2.validDate RETURNING name;
This helps me track names corresponding to inserted/modified rows. How to find rows where phonebook2 values conflict with values upserted in above statement, but no insert or update happened due to where clause?
The RETURNING clause can't be used to get non-affected rows.
What you can do is execute a SELECT statement before the UPSERT:
WITH cte(name, phonenumber, validDate) AS (VALUES
('Alice', '704-555-1212', '2018-05-08'),
('Bob','804-555-1212', '2018-05-09')
)
SELECT *
FROM phonebook2 p
WHERE EXISTS (
SELECT *
FROM cte c
WHERE c.name = p.name AND c.validDate <= p.validDate
);
In the CTE you may include as many tuples as you want

Snowflake: unable to reference column in target table inside a case predicate for notMatchedClause in a MERGE command

It seems like I am able to reference a column the target table in a case predicate for a matchedClause in a MERGE command, but am unable to do so in a notMatchedClause.
For example I create two tables and insert some values to them as below.
create table test_tab_a (
name string,
something string
);
create table test_tab_b (
name string,
something string
);
insert into test_tab_a values ('a', 'b');
insert into test_tab_a values ('c', 'z');
insert into test_tab_b values ('a', 'c');
insert into test_tab_b values ('c', 'z');
Then run a merge command as below and works just fine.
merge into public.test_tab_a as target
using (
select * from public.test_tab_b
) src
on target.name = src.name
when matched and target.SOMETHING = src.something then delete;
However when I run a command using a not matched clause, I get an invalid identifier error.
merge into public.test_tab_a as target
using (
select * from public.test_tab_b
) src
on target.name = src.name
when not matched and b.SOMETHING != a.something then insert values (name, something);
Why is the case_predicate evaluated differently depending on the type of clause?
Interesting find, I get the same thing, do not see anything in the documentation mentioning it is not available. So, I would recommend submitting a support case with Snowflake.
As a workaround, you could add logic to your subselect by joining to test_tab_a, like this:
merge into test_tab_a as a
using (
select test_tab_b.name, test_tab_b.something
from test_tab_b sub_b
inner join test_tab_a sub_a on sub_b.name = sub_a.name
where sub_b.SOMETHING != sub_a.something
) b
on a.name = b.name
when not matched then insert values (name, something);
Adding some more detail. I also checked other rdbms and haven't seen specific documentation if they support this behavior either (specifically WHEN NOT MATCHED + AND condition referencing target table). Is this query pattern coming from another rdbms?
Are there other pieces of this Merge that have been left off for simplicity? It seems like an Insert/Left Join is more useful than a Merge in this case.
insert into public.test_tab_a
select b.name,b.something from public.test_tab_b b
left join public.test_tab_a a
on a.name = b.name
where a.something != b.something
I will raise a Snowflake feature request for this use-case.

PostgreSQL can't UPSERT with a "WITH"

I want to upsert a value with a WITH, like this:
WITH counted as (
SELECT votant, count(*) as nbvotes
FROM votes
WHERE votant = '123456'
GROUP BY votant
)
INSERT INTO badges(id, badge, conditions, niveau, date_obtention)
VALUES('123456', 'category', c.nbvotes, 1, current_timestamp)
ON CONFLICT (id, badge)
DO UPDATE badges b
SET b.conditions = c.nbvotes
FROM counted c
WHERE b.id = c.votant AND b.badge = 'category'
The console tells me I have an error on "badges" just after "DO UPDATE"
I really don't understand what goes wrong here, if anybpdy could give me a hand, it would be great :)
As documented in the manual the badges b after the do update part is wrong - and unnecessary if you think of it. The target table is already defined by the INSERT part.
But you also don't need a FROM or join to the original value.
So just use:
...
ON CONFLICT (id, badge)
DO UPDATE
SET conditions = '{"a":"loooool"}';
If you need to access the original values, you can use the excluded record to refer to it, e.g.
SET conditions = EXCLUDED.conditions
which in your case would refer to the rows provided in the values clause ({"a":"lol"}' in your example)
And target columns of an UPDATE cannot be table-qualified. So just SET conditions = ...
If you want to use the result of the CTE as the source of the INSERT, you need to use an INSERT ... SELECT. You can't use a FROM clause in the DO UPDATE part of an INSERT.
WITH counted as (
SELECT votant, count(*) as nbvotes
FROM votes
WHERE votant = '123456'
GROUP BY votant
)
INSERT INTO badges(id, badge, conditions, niveau, date_obtention)
SELECT '123456', 'category', c.nbvotes, 1, current_timestamp
FROM counted c
ON CONFLICT (id, badge)
DO UPDATE
SET conditions = excluded.conditions

Insert into Postgres table only if record exists

I am writing some sql in Postgres to update an audit table. My sql will update the table being audited based on some criteria and then select that updated record to update information in an audit table. This is what I have so far:
DO $$
DECLARE
jsonValue json;
revId int;
template RECORD;
BEGIN
jsonValue = '...Some JSON...'
UPDATE projectTemplate set json = jsonValue where type='InstallationProject' AND account_id IS NULL;
template := (SELECT pt FROM ProjectTemplate pt WHERE pt.type='InstallationProject' AND pt.account_id IS NULL);
IF EXISTS (template) THEN
(
revId := nextval('hibernate_sequence');
insert into revisionentity (id, timestamp) values(revId, extract(epoch from CURRENT_TIMESTAMP));
insert into projectTemplate_aud (rev, revtype, id, name, type, category, validfrom, json, account_id)
VALUES (revId, 1, template.id, template.name, template.type, template.category, template.validfrom, jsonValue, template.account_id);
)
END $$;
My understanding is that template will be undefined if there is nothing in the table that matches that query (and there isn't currently). I want to make it so my query will not attempt to update the audit table if template doesn't exist.
What can I do to update this sql to match what I am trying to do?
You cannot use EXISTS like that, it expects a subquery expression. Plus some other issues with your code.
This single SQL DML statement with data-modifying CTEs should replace your DO command properly. And faster, too:
WITH upd AS (
UPDATE ProjectTemplate
SET json = '...Some JSON...'
WHERE type = 'InstallationProject'
AND account_id IS NULL
RETURNING *
)
, ins AS (
INSERT INTO revisionentity (id, timestamp)
SELECT nextval('hibernate_sequence'), extract(epoch FROM CURRENT_TIMESTAMP)
WHERE EXISTS (SELECT FROM upd) -- minimal valid EXISTS expression!
RETURNING id
)
INSERT INTO ProjectTemplate_aud
(rev , revtype, id, name, type, category, validfrom, json, account_id)
SELECT i.id, 1 , u.id, u.name, u.type, u.category, u.validfrom, u.json, u.account_id
FROM upd u, ins i;
Inserts a single row in revisionentity if the UPDATE found any rows.
Inserts as many rows projectTemplate_aud as rows have been updated.
About data-modifying CTEs:
Insert data in 3 tables at a time using Postgres
Aside: I see a mix of CaMeL-case, some underscores, or just lowercased names. Consider legal, lower-case names exclusively (and avoid basic type names as column names). Most importantly, though, be consistent. Related:
Are PostgreSQL column names case-sensitive?
Misnamed field in subquery leads to join

Multiple row insert or select if exists

CREATE TABLE object (
object_id serial,
object_attribute_1 integer,
object_attribute_2 VARCHAR(255)
)
-- primary key object_id
-- btree index on object_attribute_1, object_attribute_2
Here is what I currently have:
SELECT * FROM object
WHERE (object_attribute_1=100 AND object_attribute_2='Some String') OR
(object_attribute_1=200 AND object_attribute_2='Some other String') OR
(..another row..) OR
(..another row..)
When the query returns, I check for what is missing (thus, does not exist in the database).
Then I will make an multiple row insert:
INSERT INTO object (object_attribute_1, object_attribute_2)
VALUES (info, info), (info, info),(info, info)
Then I will select what I just inserted
SELECT ... WHERE (condition) OR (condition) OR ...
And at last, I will merge the two selects on the client side.
Is there a way that I can combine these 3 queries, into one single queries, where I will provide all the data, and INSERT if the records do not already exist and then do a SELECT in the end.
Your suspicion was well founded. Do it all in a single statement using a data-modifying CTE (Postgres 9.1+):
WITH list(object_attribute_1, object_attribute_2) AS (
VALUES
(100, 'Some String')
, (200, 'Some other String')
, .....
)
, ins AS (
INSERT INTO object (object_attribute_1, object_attribute_2)
SELECT l.*
FROM list l
LEFT JOIN object o1 USING (object_attribute_1, object_attribute_2)
WHERE o1.object_attribute_1 IS NULL
RETURNING *
)
SELECT * FROM ins -- newly inserted rows
UNION ALL -- append pre-existing rows
SELECT o.*
FROM list l
JOIN object o USING (object_attribute_1, object_attribute_2);
Note, there is a tiny time frame for a race condition. So this might break if many clients try it at the same time. If you are working under heavy concurrent load, consider this related answer, in particular the part on locking or serializable transaction isolation:
Postgresql batch insert or ignore