BigQuery long running - ARRAY_AGG - google-bigquery

We are executing below query to identify the latest record based on key value against table size >1.4 TB but query is taking more than 20 mins for completion. Upon checking the execution details - compute step is consuming more time
select
row.col1,
row.col2,
row.col3...
FROM
(SELECT
ARRAY_AGG(t ORDER BY t.timestamp DESC LIMIT 1)[OFFSET(0)] AS ROW
FROM [mytable] t
group by co11, col2, col3, col4)
What could be the reason behind this?

Related

How do you call previous row in a where clause?

I am trying to figure out how to get rid of results that occur close together. For example the rows have a create timestamp (source_time). I want to remove results that occur within 10 seconds of each other.
I thought lag() might do it, but I can't use that in the where clause.
select *
from table
where source_time - previous(source_time) >= 10 second
Very rough code, but I am not sure how to call the previous source time. I have translated them to timestamps and used timestamp_diff(source_time, x, second) >= 10 but not sure how to make x the previous value.
Hopefully this is clear.
You can do this with subqueries.
delete table t1
where t1.id in (
select t2.id
from (
select
id,
source_time - lag(source_time) over (order by source_time) as time_diff
from table
) t2
where t2.time_diff < 10 second
)
Keep in mind this can potentially leave large gaps in your records if. For example, if you get a row every 9 seconds for an hour you'll delete all but the last record in that hour.
You might instead partition the source_time every 10 seconds and delete anything with a row_number > 1.
delete table t1
where t1.id in (
select t2.id
from (
select
id,
source_time,
row_number() over(
partition by source_time - make_interval(second => extract(second from source_time) % 10)
order by source_time asc
) rownum
from table
) t2
where rownum > 1
)

Oracle sql query to GROUP BY, ORDER BY and delete the oldest records per ID

I want to write an oracle sql query to keep first three latest records ordered by TIMESTAMP and delete the rest for each MACHINE_ID.
I want to know how efficient i can do that. Hope you understand my question!!
Below is the table for example. All the records with USERFILE = 0 can be filtered out in the sql query.
**Result after - group by MACHINE_ID and sort by TIMESTAMP desc **
After leaving the first 3 latest records per MACHINE_ID and deleting the oldest records, final result should be
One method is:
delete from t
where t.timestamp not in (select t2.timestamp
from t t2
where t2.machine_id = t.machine_id
order by t2.timestamp desc
fetch first 3 rows only
);
For performance, you want an index on (machine_id, timestamp desc).
You can number the rows per machine and then delete all rows with a number greater than 3. Ideally we could simply delete from a query, but I'm getting ORA-01732: data manipulation operation not legal on this view when trying this in Oracle 19c.
We need two steps hence:
find the rows
delete the rows
The statement using rowid to acces the rows again quickly:
delete from mytable
where rowid in
(
select rowid
from
(
select
rowid,
row_number() over (partition by machine_id order by timestamp desc) as rn
from mytable
)
where rn > 3
);

How to set updating row's field with value of closest to it by date another field?

I have a huge table with 2m+ rows.
The structure is like that:
ThingName (STRING),
Date (DATE),
Value (INT64)
Sometimes Value is null and I need to fix it by setting it with NOT NULL Value of closest to it by Date row corresponding to ThingName...
And I am totally not SQL guy.
I tried to describe my task with this query (and simplified it a lot by using only previous dates (but actually I need to check future dates too)):
update my_tbl as SDP
set SDP.Value = (select SDPI.Value
from my_tbl as SDPI
where SDPI.Date < SDP.Date
and SDP.ThingName = SDPI.ThingName
and SDPI.Value is not null
order by SDPI.Date desc limit 1)
where SDP.Value is null;
There I try to set updating row Value with one that I select from same table for same ThingName and with limit 1 I leave only single result.
But query editor tell me this:
Correlated subqueries that reference other tables are not supported unless they can be de-correlated, such as by transforming them into an efficient JOIN.
Actually, I am not sure at all that my task can be solved just with query.
So, can anyone help me? If this is impossible, then tell me this, if it possible, tell me what SQL constructions may help me.
Below is for BigQuery Standard SQL
In many (if not most) cases you don't want to update your table (as it incur extra cost and limitations associated with DML statements) but rather can adjust 'missing' values in-query - like in below example:
#standardSQL
SELECT
ThingName,
date,
IFNULL(value,
LAST_VALUE(value IGNORE NULLS)
OVER(PARTITION BY thingname ORDER BY date)
) AS value
FROM `project.dataset.my_tbl`
If for some reason you actually need to update the table - above statement will not help as DML's UPDATE does not allow use of analytic functions, so you need to use another approach. For example as below one
#standardSQL
SELECT
t1.ThingName, t1.date,
ARRAY_AGG(t2.Value IGNORE NULLS ORDER BY t2.date DESC LIMIT 1)[OFFSET(0)] AS value
FROM `project.dataset.my_tbl` AS t1
LEFT JOIN `project.dataset.my_tbl` AS t2
ON t2.ThingName = t1.ThingName
AND t2.date <= t1.date
GROUP BY t1.ThingName, t1.date, t1.value
and now you can use it to update your table as in example below
#standardSQL
UPDATE `project.dataset.my_tbl` t
SET value = new_value
FROM (
SELECT TO_JSON_STRING(t1) AS id,
ARRAY_AGG(t2.Value IGNORE NULLS ORDER BY t2.date DESC LIMIT 1)[OFFSET(0)] new_value
FROM `project.dataset.my_tbl` AS t1
LEFT JOIN `project.dataset.my_tbl` AS t2
ON t2.ThingName = t1.ThingName
AND t2.date <= t1.date
GROUP BY id
)
WHERE TO_JSON_STRING(t) = id
In BigQuery, updates are rather rare. The logic you seem to want is:
select t.*,
coalesce(value,
lag(value ignore nulls) over (partition by thingname order by date)
) as value
from my_tbl;
I don't really see a reason to save this back in the table.

BigQuery - removing duplicate records sometimes taking long

We implemented following ETL process in Cloud: run a query in our local database hourly => save the result as csv and load it into the cloud storage => load the file from cloud storage into BigQuery table => remove duplicate records using the following query.
SELECT
* EXCEPT (row_number)
FROM (
SELECT
*,
ROW_NUMBER() OVER (PARTITION BY id ORDER BY timestamp DESC) row_number
FROM rawData.stock_movement
)
WHERE row_number = 1
Since 8 am (local time in Berlin) this morning the process of removing duplicate records takes much longer than it usual does, even the amount of data is not much different than it usual is: it takes usually 10s to remove duplicate records whereas this morning sometimes half an hour.
Is it the performance to remove duplicate record not stable?
It could be that you have many duplicate values for a particular id, so computing row numbers takes a long time. If you want to check for whether this is the case, you can try:
#standardSQL
SELECT id, COUNT(*) AS id_count
FROM rawData.stock_movement
GROUP BY id
ORDER BY id_count DESC LIMIT 5;
With that said, it may be faster to remove duplicates with this query instead:
#standardSQL
SELECT latest_row.*
FROM (
SELECT ARRAY_AGG(t ORDER BY timestamp DESC LIMIT 1)[OFFSET(0)] AS latest_row
FROM rawData.stock_movement AS t
GROUP BY t.id
);
Here is an example:
#standardSQL
WITH T AS (
SELECT 1 AS id, 'foo' AS x, TIMESTAMP '2017-04-01' AS timestamp UNION ALL
SELECT 2, 'bar', TIMESTAMP '2017-04-02' UNION ALL
SELECT 1, 'baz', TIMESTAMP '2017-04-03')
SELECT latest_row.*
FROM (
SELECT ARRAY_AGG(t ORDER BY timestamp DESC LIMIT 1)[OFFSET(0)] AS latest_row
FROM rawData.stock_movement AS t
GROUP BY t.id
);
The reason that this may be faster is that BigQuery will only keep the row with the largest timestamp in memory at any given point in time.

Oracle SQL: Slicing big results into chunks

I have a large table (too large, to query at one time). I need a efficient strategy to "slice" the results into chunks - allowing incremental updates and avoiding timeouts.
I guess there is a smarter solution than
SELECT
tbl1.ID,
tbl2.*
FROM
(SELECT * FROM FOOUSER.TABLE1 ORDER BY ID) tbl1
JOIN
FOOUSER.TABLE2 ON tbl1.ID = tbl2.ID2
WHERE
ID > :LASTMAXVALUE
AND ROWNUM <= 1000
ORDER BY
tbl1.ID;
.. with :LASTMAXVALUE being the maximum value of ID from the last query, and ROWNUM <= 1000 giving chunks of 1000 rows.
Thanks in advance.