Delete repeated data in Bigquery - sql

I am optimizing a query in Bigquery that shows non-repeated data, currently it is like this and it works.
select * from (select
ROW_NUMBER() OVER (PARTITION BY id) as num,
id,
created_at,
operator_id,
description
from NAME_TABLE
where created_at >='2018-01-01') where num=1
I wanted to ask if it is possible to make a GROUP BY with all the columns (in a simple way it cannot be done, since crated_at is not possible to group it) and keep the first data of created_at that appears for each id
PD:a DISTINCT does not work, since there are more than 80 million records (they increase 2 million per day) and it returns repeated data

Below is for BigQuery Standard SQL
#standardSQL
SELECT AS VALUE ARRAY_AGG(t ORDER BY created_at LIMIT 1)[OFFSET(0)]
FROM `project.dataset.NAME_TABLE` t
WHERE created_at >='2018-01-01'
GROUP BY id
Instead of processing / returning all columns - you can specify exact list you need as in below example
#standardSQL
SELECT AS VALUE ARRAY_AGG(STRUCT(id,created_at,operator_id,description) ORDER BY created_at LIMIT 1)[OFFSET(0)]
FROM `project.dataset.NAME_TABLE`
WHERE created_at >='2018-01-01'
GROUP BY id

You query should be fine. But you can do this without a subquery:
select array_agg(nt order by created_at desc limit 1)[ordinal(1)].*
from name_table nt
where created_at >='2018-01-01'
group by id

Related

Select newest entry for each user without using group by (postgres)

I have a table myTable with four columns:
id UUID,
user_id UUID ,
text VARCHAR ,
date TIMESTAMP
(id is the primary key and user_id is not unique in this table)
I want to retrieve the user_ids ordered by their newest entry, which i am currently doing with this query:
SELECT user_id FROM myTable GROUP BY user_id ORDER BY MAX(date) DESC
The problem is that GROUP BY takes a long time. Is there a faster way to accomplish this? I tried using a window function with PARTITION BY as described here Retrieving the last record in each group - MySQL, but it didn't really speed things up. I've also made sure that user_id is indexed.
My postgres version is 10.4
Edit: The query above that I'm currently using is functionally correct, the problem is that it's slow.
Your query seems like a relevant approach for your requirement:
select user_id
from mytable
group by user_id
order by max(date) desc
I would recommend an index on (user, date desc) to speed things up. It needs to be a single index on both colums.
You could also give a try to distinct on, which might, or might not, give you better performance:
select user_id
from (
select distinct on(user_id) user_id, date
from mytable
order by user_id, date desc
) t
order by date desc
Start with an index on user_id, date desc. That might help.
You can also try filtering -- once you have such an index:
select t.user_id
from myTable t
where t.date = (select max(t2.date)
from myTable t2
where t2.user_id = t.user_id
)
order by t.date desc
However, you might find that the order by ends up taking almost as much time as the group by.
This version will definitely use the index for the subquery:
select user_id
from (select distinct on (user_id) user_id, date
from myTable t
order by user_id, date desc
) t
order by date desc;

Hive joining columns with milliseconds

I have a table having columns id,create_time,code.
create_time column is of type string having timestamp value in the format yyyy-MM-dd HH:mm:ss.SSSSSS
Now my requirement is to find the latest code(recent create_time) for each id. If the create_time column has no milliseconds part, I can do
select id,create_time,code from(
select id,max(unix_timestamp(create_time,"yyyy-MM-dd HH:mm:ss")) over (partition by id) as latest_time from table)a
join table b on a.latest_time=b.create_time
As unix time functions consider only seconds not milliseconds, am not able to proceed with them.
Please help
Why would you try to convert at all? Since you are only looking for the latest timestamp I would just do:
select id,create_time,code from(
select id,max(create_time) over (partition by id) as latest_time from table)a
join table b on a.latest_time=b.create_time
The ones without miliseconds will be treated, as they would have "000000" instead.
You do not need join for this.
If you need all records with max(create_time), use rank() or dense_rank(). Rank will assign 1 to all records with the latest create_time if there are many records with the same time.
If you need only one record per id even it there are many records with create_time=max(create_time), then use row_number() instead of rank():
select id,create_time,code
from
(
select id,create_time,code,
rank() over(partition by id order by create_time desc) rn
)s
where rn=1;

Using the append model to do partial row updates in BigQuery

Suppose I have the following record in BQ:
id name age timestamp
1 "tom" 20 2019-01-01
I then perform two "updates" on this record by using the streaming API to 'append' additional data -- https://cloud.google.com/bigquery/streaming-data-into-bigquery. This is mainly to get around the update quota that BQ enforces (and it is a high-write application we have).
I then append two edits to the table, one update that just modifies the name, and then one update that just modifies the age. Here are the three records after the updates:
id name age timestamp
1 "tom" 20 2019-01-01
1 "Tom" null 2019-02-01
1 null 21 2019-03-03
I then want to query this record to get the most "up-to-date" information. Here is how I have started:
SELECT id, **name**, **age**,max(timestamp)
FROM table
GROUP BY id
-- 1,"Tom",21,2019-03-03
How would I get the correct name and age here? Note that there could be thousands of updates to a record, so I don't want to have to write 1000 case statements, if at all possible.
For various other reasons, I usually won't have all row data at one time, I will only have the RowID + FieldName + FieldValue.
I suppose plan B here is to do a query to get the current data and then add my changes to insert the new row, but I'm hoping there's a way to do this in one go without having to do two queries.
Below is for BigQuery Standard SQL
#standardSQL
SELECT id,
ARRAY_AGG(name IGNORE NULLS ORDER BY ts DESC LIMIT 1)[OFFSET(0)] name,
ARRAY_AGG(age IGNORE NULLS ORDER BY ts DESC LIMIT 1)[OFFSET(0)] age,
MAX(ts) ts
FROM `project.dataset.table`
GROUP BY id
You can test, play with above using sample data from your question as in below example
#standardSQL
WITH `project.dataset.table` AS (
SELECT 1 id, "tom" name, 20 age, DATE '2019-01-01' ts UNION ALL
SELECT 1, "Tom", NULL, '2019-02-01' UNION ALL
SELECT 1, NULL, 21, '2019-03-03'
)
SELECT id,
ARRAY_AGG(name IGNORE NULLS ORDER BY ts DESC LIMIT 1)[OFFSET(0)] name,
ARRAY_AGG(age IGNORE NULLS ORDER BY ts DESC LIMIT 1)[OFFSET(0)] age,
MAX(ts) ts
FROM `project.dataset.table`
GROUP BY id
with result
Row id name age ts
1 1 Tom 21 2019-03-03
This is a classic case of application of analytic functions in Standard SQL.
Here is how you can achieve your results:
select id, name, age from (
select id, name, age, ts, rank() over (partition by id order by ts desc) rnk
from `yourdataset.yourtable`
)
where rnk = 1
This will sub-group your records based id and pick the one with most recent ts (indicating the record most recently added for a given id).

How I Can get the last version for each row SQL?

I am trying to use this query to get daily costs for each campaigns, but I got the cost for only one campaign. Every campaign has the same "_sdc_sequence" but for each day there are many "_sdc_sequence". How I can get every cost for last version daily per campaign and select somes variable like day, cost, impressions and campaign? because now I get every variable of my database
"_sdc_sequence" is a unix epoch attached to the record during replication and determine the order of all the versions of a row.
I attached a picture with table. I need select only the last sequence (max _sdc_sequence)
#standardSQL
SELECT row.* FROM (
SELECT ARRAY_AGG(t ORDER BY _sdc_sequence DESC LIMIT 1)[OFFSET(0)] row
FROM `adxxxxx_xxxxxxxx` t
GROUP BY day
thanks
#standardSQL
SELECT row.* FROM (
SELECT ARRAY_AGG(t ORDER BY _sdc_sequence DESC LIMIT 1)[OFFSET(0)] row
FROM `adxxxxx_xxxxxxxx` t
GROUP BY day, campaign

BigQuery - removing duplicate records sometimes taking long

We implemented following ETL process in Cloud: run a query in our local database hourly => save the result as csv and load it into the cloud storage => load the file from cloud storage into BigQuery table => remove duplicate records using the following query.
SELECT
* EXCEPT (row_number)
FROM (
SELECT
*,
ROW_NUMBER() OVER (PARTITION BY id ORDER BY timestamp DESC) row_number
FROM rawData.stock_movement
)
WHERE row_number = 1
Since 8 am (local time in Berlin) this morning the process of removing duplicate records takes much longer than it usual does, even the amount of data is not much different than it usual is: it takes usually 10s to remove duplicate records whereas this morning sometimes half an hour.
Is it the performance to remove duplicate record not stable?
It could be that you have many duplicate values for a particular id, so computing row numbers takes a long time. If you want to check for whether this is the case, you can try:
#standardSQL
SELECT id, COUNT(*) AS id_count
FROM rawData.stock_movement
GROUP BY id
ORDER BY id_count DESC LIMIT 5;
With that said, it may be faster to remove duplicates with this query instead:
#standardSQL
SELECT latest_row.*
FROM (
SELECT ARRAY_AGG(t ORDER BY timestamp DESC LIMIT 1)[OFFSET(0)] AS latest_row
FROM rawData.stock_movement AS t
GROUP BY t.id
);
Here is an example:
#standardSQL
WITH T AS (
SELECT 1 AS id, 'foo' AS x, TIMESTAMP '2017-04-01' AS timestamp UNION ALL
SELECT 2, 'bar', TIMESTAMP '2017-04-02' UNION ALL
SELECT 1, 'baz', TIMESTAMP '2017-04-03')
SELECT latest_row.*
FROM (
SELECT ARRAY_AGG(t ORDER BY timestamp DESC LIMIT 1)[OFFSET(0)] AS latest_row
FROM rawData.stock_movement AS t
GROUP BY t.id
);
The reason that this may be faster is that BigQuery will only keep the row with the largest timestamp in memory at any given point in time.