Get latest status update for every user in the database

Get latest status update for every user in the database - sql

I have status_updates table which contains rows with each status update for each user,
id nickname status timestamp
-----------------------------------------------
14638 lovely_john offline 2020-07-14 08:37:18
14640 big_papa online 2020-07-14 08:57:10
When status changes, a new row is added.
How do I select the latest single row (in accordance to the timestamp) for each user and get them in one query? So, if I have 100 users, I will get 100 rows with the latest status change.
Thanks!

This is best handled by DISTINCT ON
select distinct on (nickname) *
from status_updates
order by nickname, timestamp desc;

Please use below query. You have to use ROW_NUMBER()
select id, nickname, status, timestamp
from
(select id, nickname, status, timestamp, row_number() over(partition by user_id order
by timestamp desc) as rnk) qry
where rnk = 1;
This will provide you the latest record of each user

Related

dense_rank in sql partition by id and session id but ordered by timestamp

I have a table as following:
User ID
Session ID
Timestamp
100
7e938c4437a0
1:30:30
100
7e938c4437a0
1:30:33
100
c1fcfd8b1a25
2:40:00
100
7b5e86d91103
3:20:00
200
bda6c8743671
2:20:00
200
bda6c8743671
2:25:00
200
aac5d66421a0
3:10:00
200
aac5d66421a0
3:11:00
I am trying to rank each session_id for by user_id, sequenced(ordered by) timestamp. I want something like the second table.
I am doing the following but it does not order by timestamp:
dense_rank() over (partition by user_id order by session_id) as visit_number
it outputs in wrong order and when I add timestamp in the order by it behaves like a row_number() function.
Below is what I am really looking for to get as a result:
User ID
Session ID
Timestamp
Rank
100
7e938c4437a0
1:30:30
1
100
7e938c4437a0
1:30:33
1
100
c1fcfd8b1a25
2:40:00
2
100
7b5e86d91103
3:20:00
3
200
bda6c8743671
2:20:00
1
200
bda6c8743671
2:25:00
1
200
aac5d66421a0
3:10:00
2
200
aac5d66421a0
3:11:00
2

If you want to dense rank by the hour component of the timestamp, you can extract the hour. This should give the results you specify. In standard SQL, this looks like:
dense_rank() over (partition by user_id order by extract(hour from timestamp) as visit_number
Of course, date/time functions are highly database dependent, so your database might have a different function for extracting the hour.

I wanted to do something similar and since I found the answer I thought I would come and post here. This is what I have learned you can do.
SELECT user_id, session_id, session_timestamp,
-- This ranks the records according to the date, which is the same for each user_id, session_group
DENSE_RANK() OVER (PARTITION BY tbl.user_id ORDER BY tbl.min_dt) AS rank
FROM (
SELECT user_id, session_id, session_timestamp,
-- We want to get the MIN or MAX session_timestamp but only for each group. This allows us to keep the ordering by timestamp, but still group by user_id and session_id.
MIN(session_timestamp) OVER (PARTITION BY user_id, session_id) AS min_dt
FROM sessions) tbl
ORDER BY user_id, rank, session_timestamp
This is the results that look the same as asked for.

Last click attribution/greatest n per user in SQL

I would like to select the last campaign a user clicked in my dataset and return a table with the name of the last clicked campaign and date for each anonymous id.
This is what I have written
select anon,
source,
medium,
campaign,
max(ts) as ts
from attribution
group by 1,2,3,4
This code seems to return the last click date, but in cases where the user clicked on two campaigns it will return both campaigns with the latest date appended to the date column.
TS in this scenario refers to the timestamp

You could use row_number():
select *
from (
select
anon,
source,
medium,
campaign,
ts,
row_number() over(partition by anon order by ts desc) rn
from attribution
) where rn = 1
This assumes that anom is the column that hold the username - if that's not the case, then change it to the relevant column in the OVER(PARTITION BY ...) clause.

Hive joining columns with milliseconds

I have a table having columns id,create_time,code.
create_time column is of type string having timestamp value in the format yyyy-MM-dd HH:mm:ss.SSSSSS
Now my requirement is to find the latest code(recent create_time) for each id. If the create_time column has no milliseconds part, I can do
select id,create_time,code from(
select id,max(unix_timestamp(create_time,"yyyy-MM-dd HH:mm:ss")) over (partition by id) as latest_time from table)a
join table b on a.latest_time=b.create_time
As unix time functions consider only seconds not milliseconds, am not able to proceed with them.
Please help

Why would you try to convert at all? Since you are only looking for the latest timestamp I would just do:
select id,create_time,code from(
select id,max(create_time) over (partition by id) as latest_time from table)a
join table b on a.latest_time=b.create_time
The ones without miliseconds will be treated, as they would have "000000" instead.

You do not need join for this.
If you need all records with max(create_time), use rank() or dense_rank(). Rank will assign 1 to all records with the latest create_time if there are many records with the same time.
If you need only one record per id even it there are many records with create_time=max(create_time), then use row_number() instead of rank():
select id,create_time,code
from
(
select id,create_time,code,
rank() over(partition by id order by create_time desc) rn
)s
where rn=1;

Using the append model to do partial row updates in BigQuery

Suppose I have the following record in BQ:
id name age timestamp
1 "tom" 20 2019-01-01
I then perform two "updates" on this record by using the streaming API to 'append' additional data -- https://cloud.google.com/bigquery/streaming-data-into-bigquery. This is mainly to get around the update quota that BQ enforces (and it is a high-write application we have).
I then append two edits to the table, one update that just modifies the name, and then one update that just modifies the age. Here are the three records after the updates:
id name age timestamp
1 "tom" 20 2019-01-01
1 "Tom" null 2019-02-01
1 null 21 2019-03-03
I then want to query this record to get the most "up-to-date" information. Here is how I have started:
SELECT id, **name**, **age**,max(timestamp)
FROM table
GROUP BY id
-- 1,"Tom",21,2019-03-03
How would I get the correct name and age here? Note that there could be thousands of updates to a record, so I don't want to have to write 1000 case statements, if at all possible.
For various other reasons, I usually won't have all row data at one time, I will only have the RowID + FieldName + FieldValue.
I suppose plan B here is to do a query to get the current data and then add my changes to insert the new row, but I'm hoping there's a way to do this in one go without having to do two queries.

Below is for BigQuery Standard SQL
#standardSQL
SELECT id,
ARRAY_AGG(name IGNORE NULLS ORDER BY ts DESC LIMIT 1)[OFFSET(0)] name,
ARRAY_AGG(age IGNORE NULLS ORDER BY ts DESC LIMIT 1)[OFFSET(0)] age,
MAX(ts) ts
FROM `project.dataset.table`
GROUP BY id
You can test, play with above using sample data from your question as in below example
#standardSQL
WITH `project.dataset.table` AS (
SELECT 1 id, "tom" name, 20 age, DATE '2019-01-01' ts UNION ALL
SELECT 1, "Tom", NULL, '2019-02-01' UNION ALL
SELECT 1, NULL, 21, '2019-03-03'
)
SELECT id,
ARRAY_AGG(name IGNORE NULLS ORDER BY ts DESC LIMIT 1)[OFFSET(0)] name,
ARRAY_AGG(age IGNORE NULLS ORDER BY ts DESC LIMIT 1)[OFFSET(0)] age,
MAX(ts) ts
FROM `project.dataset.table`
GROUP BY id
with result
Row id name age ts
1 1 Tom 21 2019-03-03

This is a classic case of application of analytic functions in Standard SQL.
Here is how you can achieve your results:
select id, name, age from (
select id, name, age, ts, rank() over (partition by id order by ts desc) rnk
from `yourdataset.yourtable`
)
where rnk = 1
This will sub-group your records based id and pick the one with most recent ts (indicating the record most recently added for a given id).

Get the latest version of every row in SQL

I need to grab the latest version of every row to not get duplicate data. "_sdc_sequence" is a unix epoch attached to the record during replication and determine the order of all the versions of a row.
I would like to get cost and impressions fro each campaign everyday
I have tried to use INNER JOIN but I could not get the data. when I tried to use "account" and "clientname" for attribute (every row has the same clientname and account) I got cero in cost and impressions. Maybe the attributes are wrongs
SELECT DISTINCT day, cost, impressions, campaign
FROM `adxxxxx_xxxxxxxx` account
INNER JOIN (
SELECT
MAX(_sdc_sequence) AS seq,
campaignid
FROM `adxxxxx_xxxxxxxx`
GROUP BY campaignid) clientname
ON account.campaignid = clientname.campaignid
AND account._sdc_sequence = clientname.seq
ORDER by day
There is another way to do this? or How I can fix it?
thank you

#standardSQL
SELECT row.* FROM (
SELECT ARRAY_AGG(t ORDER BY _sdc_sequence DESC LIMIT 1)[OFFSET(0)] row
FROM `adxxxxx_xxxxxxxx` t
GROUP BY campaignid
)

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Get latest status update for every user in the database - sql

This is best handled by DISTINCT ON select distinct on (nickname) * from status_updates order by nickname, timestamp desc;

Please use below query. You have to use ROW_NUMBER() select id, nickname, status, timestamp from (select id, nickname, status, timestamp, row_number() over(partition by user_id order by timestamp desc) as rnk) qry where rnk = 1; This will provide you the latest record of each user

Related

dense_rank in sql partition by id and session id but ordered by timestamp

Last click attribution/greatest n per user in SQL

Hive joining columns with milliseconds

Using the append model to do partial row updates in BigQuery

Get the latest version of every row in SQL

Categories

Resources