BigQuery - removing duplicate records sometimes taking long

BigQuery - removing duplicate records sometimes taking long - google-bigquery

We implemented following ETL process in Cloud: run a query in our local database hourly => save the result as csv and load it into the cloud storage => load the file from cloud storage into BigQuery table => remove duplicate records using the following query.
SELECT
* EXCEPT (row_number)
FROM (
SELECT
*,
ROW_NUMBER() OVER (PARTITION BY id ORDER BY timestamp DESC) row_number
FROM rawData.stock_movement
)
WHERE row_number = 1
Since 8 am (local time in Berlin) this morning the process of removing duplicate records takes much longer than it usual does, even the amount of data is not much different than it usual is: it takes usually 10s to remove duplicate records whereas this morning sometimes half an hour.
Is it the performance to remove duplicate record not stable?

It could be that you have many duplicate values for a particular id, so computing row numbers takes a long time. If you want to check for whether this is the case, you can try:
#standardSQL
SELECT id, COUNT(*) AS id_count
FROM rawData.stock_movement
GROUP BY id
ORDER BY id_count DESC LIMIT 5;
With that said, it may be faster to remove duplicates with this query instead:
#standardSQL
SELECT latest_row.*
FROM (
SELECT ARRAY_AGG(t ORDER BY timestamp DESC LIMIT 1)[OFFSET(0)] AS latest_row
FROM rawData.stock_movement AS t
GROUP BY t.id
);
Here is an example:
#standardSQL
WITH T AS (
SELECT 1 AS id, 'foo' AS x, TIMESTAMP '2017-04-01' AS timestamp UNION ALL
SELECT 2, 'bar', TIMESTAMP '2017-04-02' UNION ALL
SELECT 1, 'baz', TIMESTAMP '2017-04-03')
SELECT latest_row.*
FROM (
SELECT ARRAY_AGG(t ORDER BY timestamp DESC LIMIT 1)[OFFSET(0)] AS latest_row
FROM rawData.stock_movement AS t
GROUP BY t.id
);
The reason that this may be faster is that BigQuery will only keep the row with the largest timestamp in memory at any given point in time.

Related

Delete repeated data in Bigquery

I am optimizing a query in Bigquery that shows non-repeated data, currently it is like this and it works.
select * from (select
ROW_NUMBER() OVER (PARTITION BY id) as num,
id,
created_at,
operator_id,
description
from NAME_TABLE
where created_at >='2018-01-01') where num=1
I wanted to ask if it is possible to make a GROUP BY with all the columns (in a simple way it cannot be done, since crated_at is not possible to group it) and keep the first data of created_at that appears for each id
PD:a DISTINCT does not work, since there are more than 80 million records (they increase 2 million per day) and it returns repeated data

Below is for BigQuery Standard SQL
#standardSQL
SELECT AS VALUE ARRAY_AGG(t ORDER BY created_at LIMIT 1)[OFFSET(0)]
FROM `project.dataset.NAME_TABLE` t
WHERE created_at >='2018-01-01'
GROUP BY id
Instead of processing / returning all columns - you can specify exact list you need as in below example
#standardSQL
SELECT AS VALUE ARRAY_AGG(STRUCT(id,created_at,operator_id,description) ORDER BY created_at LIMIT 1)[OFFSET(0)]
FROM `project.dataset.NAME_TABLE`
WHERE created_at >='2018-01-01'
GROUP BY id

You query should be fine. But you can do this without a subquery:
select array_agg(nt order by created_at desc limit 1)[ordinal(1)].*
from name_table nt
where created_at >='2018-01-01'
group by id

Using the append model to do partial row updates in BigQuery

Suppose I have the following record in BQ:
id name age timestamp
1 "tom" 20 2019-01-01
I then perform two "updates" on this record by using the streaming API to 'append' additional data -- https://cloud.google.com/bigquery/streaming-data-into-bigquery. This is mainly to get around the update quota that BQ enforces (and it is a high-write application we have).
I then append two edits to the table, one update that just modifies the name, and then one update that just modifies the age. Here are the three records after the updates:
id name age timestamp
1 "tom" 20 2019-01-01
1 "Tom" null 2019-02-01
1 null 21 2019-03-03
I then want to query this record to get the most "up-to-date" information. Here is how I have started:
SELECT id, **name**, **age**,max(timestamp)
FROM table
GROUP BY id
-- 1,"Tom",21,2019-03-03
How would I get the correct name and age here? Note that there could be thousands of updates to a record, so I don't want to have to write 1000 case statements, if at all possible.
For various other reasons, I usually won't have all row data at one time, I will only have the RowID + FieldName + FieldValue.
I suppose plan B here is to do a query to get the current data and then add my changes to insert the new row, but I'm hoping there's a way to do this in one go without having to do two queries.

Below is for BigQuery Standard SQL
#standardSQL
SELECT id,
ARRAY_AGG(name IGNORE NULLS ORDER BY ts DESC LIMIT 1)[OFFSET(0)] name,
ARRAY_AGG(age IGNORE NULLS ORDER BY ts DESC LIMIT 1)[OFFSET(0)] age,
MAX(ts) ts
FROM `project.dataset.table`
GROUP BY id
You can test, play with above using sample data from your question as in below example
#standardSQL
WITH `project.dataset.table` AS (
SELECT 1 id, "tom" name, 20 age, DATE '2019-01-01' ts UNION ALL
SELECT 1, "Tom", NULL, '2019-02-01' UNION ALL
SELECT 1, NULL, 21, '2019-03-03'
)
SELECT id,
ARRAY_AGG(name IGNORE NULLS ORDER BY ts DESC LIMIT 1)[OFFSET(0)] name,
ARRAY_AGG(age IGNORE NULLS ORDER BY ts DESC LIMIT 1)[OFFSET(0)] age,
MAX(ts) ts
FROM `project.dataset.table`
GROUP BY id
with result
Row id name age ts
1 1 Tom 21 2019-03-03

This is a classic case of application of analytic functions in Standard SQL.
Here is how you can achieve your results:
select id, name, age from (
select id, name, age, ts, rank() over (partition by id order by ts desc) rnk
from `yourdataset.yourtable`
)
where rnk = 1
This will sub-group your records based id and pick the one with most recent ts (indicating the record most recently added for a given id).

How I Can get the last version for each row SQL?

I am trying to use this query to get daily costs for each campaigns, but I got the cost for only one campaign. Every campaign has the same "_sdc_sequence" but for each day there are many "_sdc_sequence". How I can get every cost for last version daily per campaign and select somes variable like day, cost, impressions and campaign? because now I get every variable of my database
"_sdc_sequence" is a unix epoch attached to the record during replication and determine the order of all the versions of a row.
I attached a picture with table. I need select only the last sequence (max _sdc_sequence)
#standardSQL
SELECT row.* FROM (
SELECT ARRAY_AGG(t ORDER BY _sdc_sequence DESC LIMIT 1)[OFFSET(0)] row
FROM `adxxxxx_xxxxxxxx` t
GROUP BY day
thanks

#standardSQL
SELECT row.* FROM (
SELECT ARRAY_AGG(t ORDER BY _sdc_sequence DESC LIMIT 1)[OFFSET(0)] row
FROM `adxxxxx_xxxxxxxx` t
GROUP BY day, campaign

oracle sql to get min timestamp when the count of results large than a number

in order to improve the performance, I need a sql to implement the following requirement.
If there is a table and has the following column:
id timestamp value
How can I get the min timestamp(e.g. :t1) when the count of the result > 100000 ?
then the following sql result--count(*) will > 100000
select count(*) from table where timestamp < :t1

My understanding of your question is: Find the earliest timestamp in the table for which there are at least 100,000 earlier rows.
There are probably many ways to do it; the main difficulty is trying to come up with an efficient one.
I think an analytic-function approach is most likely to work well. The most obvious choice is to use COUNT:
select min(timestamp) from (
select timestamp, count(*) over (order by timestamp rows between unbounded preceding and 1 preceding) earlier_rows
from table
)
where earlier_rows >= 100000
But I suspect using RANK or something similar will be faster:
select min(timestamp) from (
select timestamp, rank() over (order by timestamp) time_rank
from table
)
where time_rank > 100000
I'm not sure off the top of my head, but these may give slightly different results if there are duplicate timestamps.

This will give you the min and max value and the count
select
count(t.*),
min(t.timestamp),
max(t.timestamp)
from table t
where ( select count(*) from table t where t.timestamp < :t1 ) > 10000

Efficient query for the first result in groups (postgresql 9)

I have a table with 200000 rows and columns: name and date. The dates and names may have repeated values. I would like get the first 300 unique names for the dates sorted in an ascending order and have this run fast as my table may have a million rows.
I am using postgresql 9.

SELECT name, date
FROM
(
SELECT DISTINCT ON (name) name, date
FROM table
ORDER BY name, date
) AS id_date
ORDER BY date
LIMIT 300;
The last query of #jachguate will miss names having two dates on the same date, however this one doesn't.
The query takes about 100 ms in a non-optimized postgresql 9.1 with about 100.000 entries, thus it may not scale to millions of entries.
An upgrade to postgresql 9.2 may help, as according to the release notes there are many performance improvements

use a CTE:
with unique_date_name as (
select date, name, count(*) rcount
from table
group by date, name
having count(*) = 1
)
select name, date
from unique_date_name
order by date limit 300;
Edit
From the comments, this result in poor performance, so try this other:
select date, name, count(*) rcount
from table
group by date, name
having count(*) = 1
order by date limit 300;
or, transforming the original query into a nested subquery in FROM instead of a CTE:
select name, date
from (
select date, name, count(*) rcount
from table
group by date, name
having count(*) = 1
) unique_date_name
order by date limit 300;
unfortunately I don't have a postgreSQL at hand to check if it works, but the optimizer will make a better work.
A Index for (date, name) is a must for optimal performance.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

BigQuery - removing duplicate records sometimes taking long - google-bigquery

Related

Delete repeated data in Bigquery

Using the append model to do partial row updates in BigQuery

How I Can get the last version for each row SQL?

oracle sql to get min timestamp when the count of results large than a number

Efficient query for the first result in groups (postgresql 9)

Categories

Resources