BigQuery: Concatenate two arrays and keep distinct values within MERGE statement - google-bigquery

I am working on a MERGE process and update an array field with new data but only if the value isn't already found in the array.
target table
+-----+----------+
| id | arr_col |
+-----+----------+
| a | [1,2,3] |
| b | [0] |
+-----+----------+
source table
+-----+----------+
| id | arr_col |
+-----+----------+
| a | [3,4,5] |
| b | [0,0] |
+-----+----------+
target table post-merge
+-----+-------------+
| id | arr_col |
+-----+-------------+
| a | [1,2,3,4,5] |
| b | [0] |
+-----+-------------+
I was trying to use SQL on this answer in my MERGE statement
merge into target t
using source
on target.id = source.id
when matched then
update set target.arr_col = array(
select distinct x
from unnest(array_concat(target.arr_col, source.arr_col)) x
)
but BigQuery shows me the following error:
Correlated Subquery is unsupported in UPDATE clause.
Is there any other way to update this array field via MERGE? The target and source tables can be quite large and would run daily. So it's a process I would like to have incremental updates for as opposed to recreating entire table with new data every time.

Below is for BigQuery Standard SQL
merge into target
using (
select id,
array(
select distinct x
from unnest(source.arr_col || target.arr_col) as x
order by x
) as arr_col
from source
join target
using(id)
) source
on target.id = source.id
when matched then
update set target.arr_col = source.arr_col;

Wanted to expand on Mikhail Berlyant's answer because my actual application differed a little bit from OP as I also needed to data to be inserted if merge conditions were not met.
merge into target
using (
select id,
array(
select distinct x
from unnest(
/*
concat didn't work without case-when statement for
new data (i.e. target.id is null)
*/
case when target.id is not null then source.arr_col || target.arr_col
else source.arr_col
end
) as x
order by x
) as arr_col
from source
left join target /* to be able to account for brand new data in source */
using(id)
) source
on target.id = source.id
when matched then
update set target.arr_col = source.arr_col
when not matched insert row
;

Related

PostgreSQL add new not null column and fill with ids from insert statement

I´ve got 2 tables.
CREATE TABLE content (
id bigserial NOT NULL,
name text
);
CREATE TABLE data (
id bigserial NOT NULL,
...
);
The tables are already filled with a lot of data.
Now I want to add a new column content_id (NOT NULL) to the data table.
It should be a foreign key to the content table.
Is it possible to automatically create an entry in the content table to set a content_id in the data table.
For example
**content**
| id | name |
| 1 | abc |
| 2 | cde |
data
| id |... |
| 1 |... |
| 2 |... |
| 3 |... |
Now I need an update statement that creates 3 (in this example) content entries and add the ids to the data table to get this result:
content
| id | name |
| 1 | abc |
| 2 | cde |
| 3 | ... |
| 4 | ... |
| 5 | ... |
data
| id |... | content_id |
| 1 |... | 3 |
| 2 |... | 4 |
| 3 |... | 5 |
demo:db<>fiddle
According to the answers presented here: How can I add a column that doesn't allow nulls in a Postgresql database?, there are several ways of adding a new NOT NULL column and fill this directly.
Basicly there are 3 steps. Choose the best fitting (with or without transaction, setting a default value first and remove after, leave the NOT NULL contraint first and add afterwards, ...)
Step 1: Adding new column (without NOT NULL constraint, because the values of the new column values are not available at this point)
ALTER TABLE data ADD COLUMN content_id integer;
Step 2: Inserting the data into both tables in a row:
WITH inserted AS ( -- 1
INSERT INTO content
SELECT
generate_series(
(SELECT MAX(id) + 1 FROM content),
(SELECT MAX(id) FROM content) + (SELECT COUNT(*) FROM data)
),
'dummy text'
RETURNING id
), matched AS ( -- 2
SELECT
d.id AS data_id,
i.id AS content_id
FROM (
SELECT
id,
row_number() OVER ()
FROM data
) d
JOIN (
SELECT
id,
row_number() OVER ()
FROM inserted
) i ON i.row_number = d.row_number
) -- 3
UPDATE data d
SET content_id = s.content_id
FROM (
SELECT * FROM matched
) s
WHERE d.id = s.data_id;
Executing several statements one after another by using the results of the previous one can be achieved using WITH clauses (CTEs):
Insert data into content table: This generates an integer series starting at the MAX() + 1 value of the current content's id values and has as many records as the data table. Afterwards the new ids are returned
Now we need to match the current records of the data table with the new ids. So for both sides, we use row_number() window function to generate a consecutive row count for each records. Because both, the insert result and the actual data table have the same number of records, this can be used as join criterion. So we can match the id column of the data table with the new content's id values
This matched data can used in the final update of the new content_id column
Step 3: Add the NOT NULL constraint
ALTER TABLE data ALTER COLUMN content_id SET NOT NULL;

Hive: merge or tag multiple rows based on neighboring rows

I have the following table and want to merge multiple rows based on neighboring rows.
INPUT
EXPECTED OUTPUT
The logic is that since "abc" is connected to "abcd" in the first row and "abcd" is connected to "abcde" in the second row and so on, thus "abc", "abcd", "abcde", "abcdef" are connected and put in one array. The same applied to the rest rows. The number of connected neighboring rows are arbitrary.
The question is how to do that using Hive script without any UDF. Do I have to use Spark for this type of operation? Thanks very much.
One idea I had is to tag rows first as
How to do that using Hive script only?
This is an example of a CONNECT BY query which is not supported in HIVE or SPARK, unlike DB2 or ORACLE, et al.
You can simulate such a query with Spark Scala, but it is far from handy. Putting a tag in means the question is less relevant then, imo.
Here is a work-around using Hive script to get the intermediate table.
drop table if exists step1;
create table step1 STORED as orc as
with src as
(
select split(u.tmp,",")[0] as node_1, split(u.tmp,",")[1] as node_2
from
(select stack (7,
"abc,abcd",
"abcd,abcde",
"abcde,abcdef",
"bcd,bcde",
"bcde,bcdef",
"cdef,cdefg",
"def,defg"
) as tmp
) u
)
select node_1, node_2, if(node_2 = lead(node_1, 1) over (order by node_1), 1, 0) as tag, row_number() OVER (order by node_1) as row_num
from src;
drop table if exists step2;
create table step2 STORED as orc as
SELECT tag, row_number() over (ORDER BY tag) as row_num
FROM (
SELECT cast(v.tag as int) as tag
FROM (
SELECT
split(regexp_replace(repeat(concat(cast(key as string), ","), end_idx-start_idx), ",$",""), ",") as tags --repeat the row number by the number of rows
FROM (
SELECT COALESCE(lag(row_num, 1) over(ORDER BY row_num), 0) as start_idx, row_num as end_idx, row_number() over (ORDER BY row_num) as key
FROM step1 where tag=0
) a
) b
LATERAL VIEW explode(tags) v as tag
) c ;
drop table if exists step3;
create table step3 STORED as orc as
SELECT
a.node_1, a.node_2, b.tag
FROM step1 a
JOIN step2 b
ON a.row_num=b.row_num;
The final table looks like
select * from step3;
+---------------+---------------+------------+
| step3.node_1 | step3.node_2 | step3.tag |
+---------------+---------------+------------+
| abc | abcd | 1 |
| abcd | abcde | 1 |
| abcde | abcdef | 1 |
| bcd | bcde | 2 |
| bcde | bcdef | 2 |
| cdef | cdefg | 3 |
| def | defg | 4 |
+---------------+---------------+------------+
The third column can be used to collect node pairs.

SQL Postgres Invalidate Rows that reference invalid Id's

I am trying to create a stored procedure that will invalidate rows that contain id references of an id in another table. The catch is that the rows to be invalidated contain groupings of these id's stored as a comma separated string. Let's take a look at the tables:
table_a table_b
+----+------+ +---------+-------+
| id | name | | ids | valid |
+----+------+ +---------+-------+
| 1 | a | | 1,2,3 | T |
| 2 | b | | 4,3,8 | T |
| 3 | c | | 5,2,5,4 | T |
| 4 | d | | 7 | T |
| 5 | e | | 6,8 | T |
| 6 | f | | 9,7,2 | T |
| 7 | g | +---------+-------+
| 8 | h |
+----+------+
Above you can see that table_b contains groupings of ids from table_a and as you can imagine the table_a.id is an integer while table_b.ids is text. The goal is to look at each table_b.ids and if it contains an id that does not exist in table_a.id then set its validity to false.
I have not worked with any SQL in quite sometime and I have never worked with PostgreSQL which is why I am having such difficulty. The closest query I could come up with, is not working, but is along the lines of:
CREATE FUNCTION cleanup_records() AS $func$
BEGIN
UPDATE table_b
SET valid = FALSE
WHERE COUNT(
SELECT regexp_split_to_table(table_b.ids)
EXCEPT SELECT id FROM table_a
) > 0;
END;
$func$ LANGUAGE PLPGSQL;
The general idea is that I am trying to turn each row of table_b.ids into a table and then using the EXCEPT operator against table_a to see if it has any ids that are invalid. The error I receive is:
ERROR: syntax error at or near "SELECT"
LINE 1: ...able_b SET valid = FALSE WHERE COUNT(SELECT reg...
which is not very helpful as it just indicates that I do not have the correct syntax. Is this query viable? If so can you show me where I may have gone wrong - if not is there an easier or even more complicated way to achieve this?
Sample data:
CREATE TABLE table_b
(ids text, valid boolean);
INSERT INTO table_b
(ids, valid)
VALUES
('1,2,3' , 'T'),
('4,3,8' , 'T'),
('5,2,5,4' , 'T'),
('7' , 'T'),
('6,8' , 'T'),
('9,7,2' , 'T');
CREATE TABLE table_a
(id integer, name text);
INSERT INTO table_a
(id, name)
VALUES
(1,'a'),
(2,'b'),
(3,'c'),
(4,'d'),
(5,'e'),
(6,'f'),
(7,'g'),
(8,'h');
UPDATE table_b
SET valid = FALSE
WHERE EXISTS(
SELECT regexp_split_to_table(table_b.ids)
EXCEPT SELECT id FROM table_a
);
You can use 'exists' to check for the existence of a row. The previous syntax was incorrect as count can't be used that way.
groupings of these id's stored as a comma separated string
Don't do that. It's really bad database design, and is why you're having problems. See:
Is using multiple foreign keys separated by commas wrong, and if so, why?
PostgreSQL list of integers separated by comma or integer array for performance?
Also, there's a more efficient way to do your query than that shown by vkp. If you do it that way, you're splitting the string for every ID you're testing. There is no need to do that. Instead, join on a table of expanded ID lists.
Something like:
UPDATE table_b
SET valid = 'f'
FROM table_b b
CROSS JOIN regexp_split_to_table(b.ids, ',') b_ids(id)
LEFT JOIN table_a a ON (a.id = b_ids.id::integer)
WHERE table_b.ids = b.ids
AND a.id IS NULL
AND table_b.valid = 't';
You need to join on table_b even though it's the update target because you can't make a lateral function reference to the update target table directly.

update a table from another table and add new values

How would I go about updating a table by using another table so it puts in the new data and if it doesnt match on an id it adds the new id and the data with it. My original table i much bigger than the new table that will update it. and the new table has a few ids that aren't in the old table but need to be added.
for example I have:
Table being updated-
+-------------------+
| Original Table |
+-------------------+
| ID | Initials |
|------+------------|
| 1 | ABC |
| 2 | DEF |
| 3 | GHI |
and...
the table I'm pulling data from to update the other table-
+-------------------+
| New Table |
+-------------------+
| ID | Initials |
|------+------------|
| 1 | XZY |
| 2 | QRS |
| 3 | GHI |
| 4 | ABC |
then I want my Original table to get its values that match up to be updated by the new table if they have changed, and add any new ID rows if they aren't in the original table so in this example it would look like the New Table.
+-------------------+
| Original Table |
+-------------------+
| ID | Initials |
|------+------------|
| 1 | XZY |
| 2 | QRS |
| 3 | GHI |
| 4 | ABC |
You can use MERGE statement to put this UPSERT operation in one statement but there are issues with merge statement I would split it into two Statements, UPDATE and INSERT
UPDATE
UPDATE O
SET O.Initials = N.Initials
FROM Original_Table O INNER JOIN New_Table N
ON O.ID = N.ID
INSERT
INSERT INTO Original_Table (ID , Initials)
SELECT ID , Initials
FROM New_Table
WHERE NOT EXISTS ( SELECT 1
FROM Original_Table
WHERE ID = Original_Table.ID)
Important Note
Reason why I suggested to avoid using merge statement read this article Use Caution with SQL Server's MERGE Statement by Aaron Bertrand
You need to use the MERGE statement for this:
MERGE original_table AS Target
USING updated_table as Source
ON original_table.id = updated_table.id
WHEN MATCHED THEN UPDATE SET Target.Initials = Source.Initials
WHEN NOT MATCHED THEN INSERT(id, Initials) VALUES(Source.id, Source.Initials);
You have not specified, what happens in case the valuesin original table are not found in the updated one. But, just in case, you can add this to remove them from original table:
WHEN NOT MATCHED BY SOURCE
THEN DELETE
if you can use loop in PHP and go through all tables and copy one by one to another table.
another option
DECLARE #COUT INT
SET #COUT = SELECT COUNT(*) FROM New_Table
WHILE (true)
BEGIN
IF #COUT = 0
BREAK;
SET #COUT = #COUT - 1
DECLARE #id INT
DECLARE #ini VARCHAR(20)
SET #id = (SELECT id FROM New_Table);
SET #ini = (SELECT Initials FROM New_Table);
IF (SELECT COUNT(*) FROM Original_Table WHERE id=#id ) > 0
UPDATE SET ID = #id,Initials = #ini FROM Original_Table WHERE id = #id;
insert into Original_Table values(#id,#ini);
END
GO

DB2 large update from another table

I have a table with 600 000+ rows called asset. The customer has added a new column and would like it populated with a value from another table:
ASSET TEMP
| id | ... | newcol | | id | condition |
--------------------- ------------------
|0001| ... | - | |0001| 3 |
If I try to update it all at once, it times out/claims there is a dead lock:
update asset set newcol = (
select condition from temp where asset.id = temp.id
) where newcol is null;
The way I got around it was by only doing a 100 rows at a time:
update (select id, newcol from asset where newcol is null
fetch first 100 rows only) a1
set a1.newcol = (select condition from temp a2 where a1.id = a2.id);
At the moment I am making good use of the copy/paste utility, but I'd like to know of a more elegant way to do it (as well as a faster way).
I have tried putting it in a PL/SQL loop but I can't seem to get it to work with DB2 as a standalone script.