How to delete records in BigQuery based on values in an array? - google-bigquery

In Google BigQuery, I would like to delete a subset of records, based on the value of a specific column. It's a query that I need to run repeatedly and that I would like to run automatically.
The problem is that this specific column is of the form STRUCT<column_1 ARRAY (STRING), column_2 ARRAY (STRING), ... >, and I don't know how to use such a column in the where-clause when using the delete-command.
Here is basically what I am trying to do (this code does not work):
DELETE
FROM dataset.table t
LEFT JOIN UNNEST(t.category.column_1) AS type
WHERE t.partition_date = '2020-07-22'
AND type = 'some_value'
The error that I'm getting is: Syntax error: Expected end of input but got keyword LEFT at [3:1]
If I replace the DELETE with SELECT *, it does work:
SELECT *
FROM dataset.table t
LEFT JOIN UNNEST(t.category.column_1) AS type
WHERE t.partition_date = '2020-07-22'
AND type = 'some_value'
Does somebody know how to use such a column to delete a subset of records?
EDIT:
Here is some code to create a reproducible example with some silly data (fill in your own dataset and table name in all queries):
Suppose you want to delete all rows where category.type contains the value 'food'.
1 - create a table:
CREATE TABLE <DATASET>.<TABLE_NAME>
(
article STRING,
category STRUCT<
color STRING,
type ARRAY<STRING>
>
);
2 - Insert data into the new table:
INSERT <DATASET>.<TABLE_NAME>
SELECT "apple" AS article, STRUCT('red' AS color, ['fruit','food'] as type) AS category
UNION ALL
SELECT "cabbage" AS article, STRUCT('blue' AS color, ['vegetable', 'food'] as type) AS category
UNION ALL
SELECT "book" AS article, STRUCT('red' AS color, ['object'] as type) AS category
UNION ALL
SELECT "dog" AS article, STRUCT('green' AS color, ['animal', 'pet'] as type) AS category;
3 - Show that select works (return all rows where category.type contains the value 'food'; these are the rows I want to delete):
SELECT *
FROM <DATASET>.<TABLE_NAME>
LEFT JOIN UNNEST(category.type) type
WHERE type = 'food'
Initial Result
4 - My attempt at deleting rows where category.type contains 'food' does not work:
DELETE
FROM <DATASET>.<TABLE_NAME>
LEFT JOIN UNNEST(category.type) type
WHERE type = 'food'
Syntax error: Unexpected keyword LEFT at [3:1]
Desired Result

This is the code I used to delete the desired records (the records where category.type contains the value 'food'.)
DELETE
FROM <DATASET>.<TABLE_NAME> t1
WHERE EXISTS(SELECT 1 FROM UNNEST(t1.category.type) t2 WHERE t2 = 'food')
The embarrasing thing is that I've seen these kind of answers on similar questions (for example on update-queries). But I come from Oracle-SQL and I think that there you are required to connect your subquery with your main query in the WHERE-statement of the subquery (ie. connect t1 with t2), so I didn't understand these answers. That's why I posted this question.
However, I learned that BigQuery automatically understands how to connect table t1 and 'table' t2; you don't have to explicitly connect them.
Now it is possible to still do this (perhaps even recommended?):
DELETE
FROM <DATASET>.<TABLE_NAME> t1
WHERE EXISTS (SELECT 1 FROM <DATASET>.<TABLE_NAME> t2 LEFT JOIN UNNEST(t2.category.type) AS type WHERE type = 'food' AND t1.article=t2.article)
but a second difficulty for me was that my ID in my actual data is somehow hidden in an array>struct-construction, so I got stuck connecting t1 & t2. Fortunately this is not always an absolute necessity.

Since you did not provide any sample data I am going to explain using some dummy data. In case you add your sample data, I can update the answer.
Firstly,according to your description, you have only a STRUCT not an Array[Struct <col_1, col_2>].For this reason, you do not need to use UNNEST to access the values within the data. Below is an example how to access particular data within a STRUCT.
WITH data AS (
SELECT 1 AS id, STRUCT("Alex" AS name, 30 AS age, "NYC" AS city) AS info UNION ALL
SELECT 1 AS id, STRUCT("Leo" AS name, 18 AS age, "Sydney" AS city) AS info UNION ALL
SELECT 1 AS id, STRUCT("Robert" AS name, 25 AS age, "Paris" AS city) AS info UNION ALL
SELECT 1 AS id, STRUCT("Mary" AS name, 28 AS age, "London" AS city) AS info UNION ALL
SELECT 1 AS id, STRUCT("Ralph" AS name, 45 AS age, "London" AS city) AS info
)
SELECT * FROM data
WHERE info.city = "London"
Notice that the STRUCT is named info and the data we accessed is city and used it in the WHERE clause.
Now, in order to delete the rows that contains an specific value within the STRUCT , in your case I assume it would be your_struct.column_1, you can use DELETE or MERGE and DELETE. I have saved the above data in a table to execute the below examples, which have the same output,
First method: DELETE
DELETE FROM `project.dataset.table`
WHERE info.city = "Sydney"
Second method: MERGE and DELETE
MERGE `project.dataset.table` a
USING (SELECT * from `project.dataset.table` WHERE info.city ="London") b
ON a.info.city =b.info.city
WHEN matched and b.id=1 then
Delete
And the output for both queries,
Row id info.name info.age info.city
1 1 Alex 30 NYC
2 1 Robert 25 Paris
3 1 Ralph 45 London
4 1 Mary 28 London
As you can see the row where info.city = "Sydney" was deleted in both cases.
It is important to point out that your data is excluded from your source table. Therefore, you should be careful.
Note: Since you want to run this process everyday, you could use Schedule Query within BigQuery Console, appending or overwriting the results after each run. Also, it is a good practice not deleting data from your source table. Thus, consider creating a new table from your source table without the rows you do not desire.

Related

TypeORM & Postgres: Count only unique distinct values from multiple columns

I have various SQL queries, which return me unique / distinct value from DB, (or count them),
like:
SELECT buyer as counterparty
FROM public.order
UNION
SELECT seller as counterparty
FROM public.order
or
SELECT COUNT(*)
FROM (
SELECT DISTINCT p
FROM public.order
CROSS JOIN LATERAL (VALUES(seller),(buyer)) AS C(p)
) AS internalQuery
Example structure of my table:
id buyer seller
0 A B
1 B A
2 B D
3 D A
4 A D
Desired result:
3 or A,B,D
I'd like to rewrite them with TypORM query builder, but I can't figure out, how to replace CROSS JOIN LATERAL (VALUES(seller),(buyer)) AS C(p) or UNION in my case. TypeORM is pretty poor with examples and doc coverage in this case.
Does there any option with that?
I have seen various methods like .getCount and .distinct(true) which could help me and easily find the solution for one column.
So I understood, that if I want to find the exact number, instead of doc results, I should use .getCount instead of .getMany
But I can't understand, how to select (and unite) values from multiple columns via typeORM to receive distinct values from multiple columns.
I am working with PostgrSQL, so when I am trying:
const query = repository.createQueryBuilder('order')
.distinctOn(['buyer', 'seller'])
.limit(100)
.getMany()
I receive docs with each distinct value in each field, so instead of 3 I get 6 values (3 distinct by column1, and 3 by column2)

Query for a json property with the highest (max) date as value in a jsonb column

I have a postgres table with a jsonb column that gets inserted/updated as soon as a status is reached. I would like to query the latest status with it's date.
Given I have the following jsonb status column.
{
"pending": "2018-01-12T12:34:41.785945+00:00",
"started":"2018-01-10T15:52:41.785945+00:00",
"processed":"2018-01-18T12:52:41.785945+00:00"
}
What would be the query to get?
"processed":"2018-01-18T12:52:41.785945+00:00"...
Basically select the property with the max date, Is this possible at all? And if so how would look the query?
EDIT: the Status JSON can have different properties at different times. So it not always the 3 'pending','started','processed'. There could only be pending or in future others.... The question is the latest status with it's date.
Let's create the table with couple of sample data.
DROP TABLE IF EXISTS test;
CREATE TABLE test(
id integer,
status json
);
INSERT INTO test VALUES (1, '{
"pending": "2018-01-12T12:34:41.785945+00:00",
"started":"2018-01-10T15:52:41.785945+00:00",
"processed":"2018-01-18T12:52:41.785945+00:00"
}');
INSERT INTO test VALUES (2, '{
"pending": "2018-01-12T12:31:41.785945+00:00",
"started":"2018-01-20T15:55:41.785945+00:00",
"processed":"2018-01-10T12:20:41.785945+00:00"
}');
I managed to get the max date property by extracting values from json to a separate table, and applying typical max function and case statement to get corresponding property of interest. This probably defeats the purpose of using json type (maybe try json_type(json) instead?)
--- unpack values and put them into different columns
WITH t1 AS(
SELECT id, status->>'pending' AS pending,
status->>'started' AS started,
status->>'processed' AS processed
FROM test),
--- find maximum time corresponding to each_id
t2 AS (
SELECT id, GREATEST(t1.pending, t1.processed, t1.started) AS latest
FROM t1
)
--- join to get matching column that gave latest time
SELECT t1.id, CASE WHEN t1.pending = t2.latest THEN 'pending'
WHEN t1.started = t2.latest THEN 'started'
WHEN t1.processed = t2.latest THEN 'processed'
END AS max_status,
t2.latest AS max_time
FROM t1
JOIN t2
ON t1.id = t2.id
;
You can carry along other columns of interest for each id. This results
id | max_status | max_time
----+------------+----------------------------------
1 | processed | 2018-01-18T12:52:41.785945+00:00
2 | started | 2018-01-20T15:55:41.785945+00:00
(2 rows)
EDIT::
After your edit, I made some changes to take care of arbitrary presence/absence of particular keys in your json field. Let's use json_each(json) to unpack key, value for each id to different columns
SELECT id, (json_each(status)).*
FROM test
WHERE id=1;
id | key | value
----+-----------+------------------------------------
1 | pending | "2018-01-12T12:34:41.785945+00:00"
1 | started | "2018-01-10T15:52:41.785945+00:00"
1 | processed | "2018-01-18T12:52:41.785945+00:00"
(3 rows)
Not quite what we want but almost there. First thing to notice is the column names key, value which we can use to transform data. Second, values are json objects and therefore we have to take care of double quotations after casting to text. trim handles this nicely. Rest is to partition your data by id, get maximum time, and filter rows to get corresponding status.
WITH t1 AS(
SELECT id, key as status, trim(both '"' from value::text) as time_of
FROM test, json_each(status)
),
t2 as(
SELECT id, status,
to_timestamp(time_of, 'YYYY-MM-DD"T"HH24:MI:SS') as time_of,
MAX(to_timestamp(time_of, 'YYYY-MM-DD"T"HH24:MI:SS'))
OVER(PARTITION by id) AS max_time
FROM t1)
SELECT id, status, max_time
FROM t2
WHERE time_of = max_time;

SQL unpivot & insert

Sorry for the lack of info -- SQL Server 2008.
I'm struggling to get a couple of column values from table A into a new row in table B for each row in A where a column isn't null.
Table A's structure is as:
UserID | ClientUserID | ClientSessionID | [and a load of other irrelevant columns)
Table B:
UserID | Name | Value
I want to create rows in table B for each non-null ClientUserID or ClientSessionID in A - using the column name as B's "Name", and column value as "B's Value".
I'm struggling to write my "unpivot" statement - just getting the syntax correct! I'm trying to follow along with some samples but can't
Here's my SQL query so far - any further help would be appreciated (just getting this SELECT is frustrating me, let alone doing the insert!)
SELECT UserID, ClientUserID, ClientSessionID FROM websiteuser WHERE ClientSessionID IS NOT null
This gives me the rows that I need to perform actions upon -- but I just can't get the syntax correct for UNPIVOTing this data and turning it into my insert.
You can unpivot records in this fashion by using UNION to get each new row:
INSERT INTO TableB (UserID, Name, Value)
SELECT UserID, 'ClientUserID' AS Name, ClientUserID AS Value
FROM TableA
WHERE ClientUserID IS NOT NULL
UNION ALL
SELECT UserID, 'ClientSessionID' AS Name, ClientSessionID AS Value
FROM TableA
WHERE ClientSessionID IS NOT NULL
I am using UNION ALL in this case as UNION implies a DISTINCT operation across the entire set, which should normally be unnecessary when pivoting unique records.
If your ClientUserID and ClientSessionID columns are not the same datatype, you may have to cast one or both to the same.

How to get one common value from Database using UNION

2 records in above image are from Db, in above table Constraint are (SID and LINE_ITEM_ID),
SID and LINE_ITEM_ID both column are used to find a unique record.
My issues :
I am looking for a query it should fetch the recored from DB depending on conditions
if i search for PART_NUMBER = 'PAU43-IMB-P6'
1. it should fetch one record from DB if search for PART_NUMBER = 'PAU43-IMB-P6', no mater to which SID that item belong to if there is only one recored either under SID =1 or SID = 2.
2. it should fetch one record which is under SID = 2 only, from DB on search for PART_NUMBER = 'PAU43-IMB-P6', if there are 2 items one in SID=1 and other in SID=2.
i am looking for a query which will search for a given part_number depending on Both SID 1 and 2, and it should return value under SID =2 and it can return value under SID=1 only if the there are no records under SID=2 (query has to withstand a load of Million record search).
Thank you
Select *
from Table
where SID||LINE_ITEM_ID = (
select Max(SID)||Max(LINE_ITEM_ID)
from table
where PART_NUMBER = 'PAU43-IMB-P6'
);
If I understand correctly, for each considered LINE_ITEM_ID you want to return only the one with the largest value for SID. This is a common requirement and, as with most things in SQL, can be written in many different ways; the best performing will depend on many factors, not least of which is the SQL product you are using.
Here's one possible approach:
SELECT DISTINCT * -- use a column list
FROM YourTable AS T1
INNER JOIN (
SELECT T2.LINE_ITEM_ID,
MAX(T2.SID) AS max_SID
FROM YourTable AS T2
GROUP
BY T2.LINE_ITEM_ID
) AS DT1 (LINE_ITEM_ID, max_SID)
ON T1.LINE_ITEM_ID = DT1.LINE_ITEM_ID
AND T1.SID = DT1.max_SID;
That said, I don't recall seeing one that relies on the UNION relational operator. You could easily rewrite the above using the INTERSECT relational operator but it would be more verbose.
Well in my case it worked something like this:
select LINE_ITEM_ID,SID,price_1,part_number from (
(select LINE_ITEM_ID,SID,price_1,part_number from Table where SID = 2)
UNION
(select LINE_ITEM_ID,SID,price_1,part_number from Table SID = 1 and line_item_id NOT IN (select LINE_ITEM_ID,SID,price_1,part_number from Table SID = 2)))
This query solved my issue..........

Oracle / SQL - Count number of occurrences of values in a single column

Okay, I probably could have come up with a better title, but wasn't sure how to word it so let me explain.
Say I have a table with the column 'CODE'. Each record in my table will have either 'A', 'B', or 'C' as it's value in the 'CODE' column. What I would like is to get a count of how many 'A's, 'B's, and 'C's I have.
I know I could accomplish this with 3 different queries, but I'm wondering if there is a way to do it with just 1.
Use:
SELECT t.code,
COUNT(*) AS numInstances
FROM YOUR_TABLE t
GROUP BY t.code
The output will resemble:
code numInstances
--------------------
A 3
B 5
C 1
If a code exists that has not been used, it will not show up. You'd need to LEFT JOIN to the table containing the list of codes in order to see those that don't have any references.