Using the append model to do partial row updates in BigQuery - sql

Suppose I have the following record in BQ:
id name age timestamp
1 "tom" 20 2019-01-01
I then perform two "updates" on this record by using the streaming API to 'append' additional data -- https://cloud.google.com/bigquery/streaming-data-into-bigquery. This is mainly to get around the update quota that BQ enforces (and it is a high-write application we have).
I then append two edits to the table, one update that just modifies the name, and then one update that just modifies the age. Here are the three records after the updates:
id name age timestamp
1 "tom" 20 2019-01-01
1 "Tom" null 2019-02-01
1 null 21 2019-03-03
I then want to query this record to get the most "up-to-date" information. Here is how I have started:
SELECT id, **name**, **age**,max(timestamp)
FROM table
GROUP BY id
-- 1,"Tom",21,2019-03-03
How would I get the correct name and age here? Note that there could be thousands of updates to a record, so I don't want to have to write 1000 case statements, if at all possible.
For various other reasons, I usually won't have all row data at one time, I will only have the RowID + FieldName + FieldValue.
I suppose plan B here is to do a query to get the current data and then add my changes to insert the new row, but I'm hoping there's a way to do this in one go without having to do two queries.

Below is for BigQuery Standard SQL
#standardSQL
SELECT id,
ARRAY_AGG(name IGNORE NULLS ORDER BY ts DESC LIMIT 1)[OFFSET(0)] name,
ARRAY_AGG(age IGNORE NULLS ORDER BY ts DESC LIMIT 1)[OFFSET(0)] age,
MAX(ts) ts
FROM `project.dataset.table`
GROUP BY id
You can test, play with above using sample data from your question as in below example
#standardSQL
WITH `project.dataset.table` AS (
SELECT 1 id, "tom" name, 20 age, DATE '2019-01-01' ts UNION ALL
SELECT 1, "Tom", NULL, '2019-02-01' UNION ALL
SELECT 1, NULL, 21, '2019-03-03'
)
SELECT id,
ARRAY_AGG(name IGNORE NULLS ORDER BY ts DESC LIMIT 1)[OFFSET(0)] name,
ARRAY_AGG(age IGNORE NULLS ORDER BY ts DESC LIMIT 1)[OFFSET(0)] age,
MAX(ts) ts
FROM `project.dataset.table`
GROUP BY id
with result
Row id name age ts
1 1 Tom 21 2019-03-03

This is a classic case of application of analytic functions in Standard SQL.
Here is how you can achieve your results:
select id, name, age from (
select id, name, age, ts, rank() over (partition by id order by ts desc) rnk
from `yourdataset.yourtable`
)
where rnk = 1
This will sub-group your records based id and pick the one with most recent ts (indicating the record most recently added for a given id).

Related

Delete repeated data in Bigquery

I am optimizing a query in Bigquery that shows non-repeated data, currently it is like this and it works.
select * from (select
ROW_NUMBER() OVER (PARTITION BY id) as num,
id,
created_at,
operator_id,
description
from NAME_TABLE
where created_at >='2018-01-01') where num=1
I wanted to ask if it is possible to make a GROUP BY with all the columns (in a simple way it cannot be done, since crated_at is not possible to group it) and keep the first data of created_at that appears for each id
PD:a DISTINCT does not work, since there are more than 80 million records (they increase 2 million per day) and it returns repeated data
Below is for BigQuery Standard SQL
#standardSQL
SELECT AS VALUE ARRAY_AGG(t ORDER BY created_at LIMIT 1)[OFFSET(0)]
FROM `project.dataset.NAME_TABLE` t
WHERE created_at >='2018-01-01'
GROUP BY id
Instead of processing / returning all columns - you can specify exact list you need as in below example
#standardSQL
SELECT AS VALUE ARRAY_AGG(STRUCT(id,created_at,operator_id,description) ORDER BY created_at LIMIT 1)[OFFSET(0)]
FROM `project.dataset.NAME_TABLE`
WHERE created_at >='2018-01-01'
GROUP BY id
You query should be fine. But you can do this without a subquery:
select array_agg(nt order by created_at desc limit 1)[ordinal(1)].*
from name_table nt
where created_at >='2018-01-01'
group by id

Nested partitioning and ranking in google big query

Below is how the data looks like-
I want to sort this data on different levels to achieve the final output.
Level 1:
Whenever there are duplicate values for name, I want to get the least ranking for each distinct (id, name,last_name, gender) tuple.
Level 1 Result:
Level 2:
In level 2, I want to get the least ranking for each gender category for a particular name.
Level 2 Result:
Final output:
For each name, if 'male' and 'female' rank is the same then return the whichever occurs first in the table. If it is different return the record with the least rank.
Final result expected-
Below is for BigQuery Standard SQL
#standardSQL
SELECT AS VALUE ARRAY_AGG(t ORDER BY ranking, id LIMIT 1)[OFFSET(0)]
FROM `project.dataset.table` t
GROUP BY name
I do suspect that you can just partition by name:
select *
from (
select
t.*,
row_number() over(partition by name order by ranking, id) rn
from mytable t
) t
where rn = 1
The second sort criteria on id breaks the tie.

Test whether MIN would work over ROW_NUMBER

Situation:
I have three columns:
id
date
tx_id
The primary id column is tx_id and is unique in the table. Each tx_id is tied to an id and it has a record date. I would like to test whether or not the tx_id is incremental.
Objective:
I need to extract the first tx_id by id but I want to prevent using ROW_NUMBER
i.e
select id, date, tx_id, row_number() over(partition by id order by date asc) as First_transaction_id from table
and simply use
select id, date, MIN(tx_id) as First_transaction_id from table
So how can i make sure since i have more than 50 millions of ids that by using MINtx_id will yield the earliest transaction for each id?
How can i add a flag column to segment those that don't satisfy the condition?
how can i make sure since i have more than 50 millions of ids that by using MINtx_id will yield the earliest transaction for each id?
Simply do the comparison:
You can get the exceptions with logic like this:
select t.*
from (select t.*,
min(tx_id) over (partition by id) as min_tx_id,
rank() over (partition by id order by date) as seqnum
from t
) t
where tx_id = min_tx_id and seqnum > 1;
Note: this uses rank(). It seems possible that there could be two transactions for an id on the same date.
use corelated sunquery
select t.* from table_name t
where t.date= ( select min(date) from table_name
t1 where t1.id=t.id)

Retrieve records against most recent state/attribute value

Having a de-normalized structure in Redshift and plan is to keep creating records and while retrieving only consider most recent attributes against users.
Following is the table:
user_id state created_at
1 A 15-10-2015 02:00:00 AM
2 A 15-10-2015 02:00:01 AM
3 A 15-10-2015 02:00:02 AM
1 B 15-10-2015 02:00:03 AM
4 A 15-10-2015 02:00:04 AM
5 B 15-10-2015 02:00:05 AM
And required result set is:
user_id state created_at
2 A 15-10-2015 02:00:01 AM
3 A 15-10-2015 02:00:02 AM
4 A 15-10-2015 02:00:04 AM
I have the query which retrieve the said result:
select user_id, first_value AS state
from (
select user_id, first_value(state) OVER (
PARTITION BY user_id
ORDER BY created_at desc
ROWS between UNBOUNDED PRECEDING and CURRENT ROW)
from customer_properties
order by created_at) t
where first_value = 'A'
Is this the best way to retrieve or can the query be improved?
The best query depends on various details: selectivity of the query predicate, cardinalities, data distribution. If state = 'A' is a selective condition (view rows qualify), this query should be substantially faster:
SELECT c.user_id, c.state
FROM customer_properties c
LEFT JOIN customer_properties c1 ON c1.user_id = c.user_id
AND c1.created_at > c.created_at
WHERE c.state = 'A'
AND c1.user_id IS NULL;
Provided, there is an index on (state) (or even (state, user_id, created_at)) and another one on (user_id, created_at).
There are various ways to make sure a later version of the row does not exist:
Select rows which are not present in other table
If 'A' is a common value in state, this more generic query will be faster:
SELECT user_id, state
FROM (
SELECT user_id, state
, row_number() OVER (PARTITION BY user_id ORDER BY created_at DESC) AS rn
FROM customer_properties
) t
WHERE t.rn = 1
AND t.state = 'A';
I removed NULLS LAST, assuming that created_at is defined NOT NULL. Also, I don't think Redshift has it:
PostgreSQL sort by datetime asc, null first?
Both queries should work with the limited functionality of Redshift. With modern Postgres, there are better options:
Select first row in each GROUP BY group?
Optimize GROUP BY query to retrieve latest record per user
Your original would return all rows per user_id, if the latest row matches. You would have to fold duplicates, needless work ...

Row with the highest ID

You have three fields ID, Date and Total. Your table contains multiple rows for the same day which is valid data however for reporting purpose you need to show only one row per day. The row with the highest ID per day should be returned the rest should be hidden from users (not returned).
To better picture the question below is sample data and sample output:
ID, Date, Total
1, 2011-12-22, 50
2, 2011-12-22, 150
The correct result is:
2, 2012-12-22, 150
The correct output is single row for 2011-12-22 date and this row was chosen because it has the highest ID (2>1)
Assuming that you have a database that supports window functions, and that the date column is indeed just date (and not datetime), then something like:
SELECT
* --TODO - Pick columns
FROM
(
SELECT ID,[Date],Total,ROW_NUMBER() OVER (PARTITION BY [Date] ORDER BY ID desc) rn
FROM [Table]
) t
WHERE
rn = 1
Should produce one row per day - and the selected row for any given day is that with the highest ID value.
SELECT *
FROM table
WHERE ID IN ( SELECT MAX(ID)
FROM table
GROUP BY Date )
This will work.
SELECT *
FROM tableName a
INNER JOIN
(
SELECT `DATE`, MAX(ID) maxID
FROM tableName
GROUP BY `DATE`
) b ON a.id = b.MaxID AND
a.`date` = b.`date`
SQLFiddle Demo
Probably
SELECT * FROM your_table ORDER BY ID DESC LIMIT 1
Select MAX(ID),Data,Total from foo
for MySQL
Another simple way is
SELECT TOP 1 * FROM YourTable ORDER BY ID DESC
And, I think this is the most simple way!
SELECT * FROM TABLE_SUM S WHERE S.ID =
(
SELECT MAX(ID) FROM TABLE_SUM
WHERE CDATE = GG.CDATE
GROUP BY CDATE
)