BigQuery - Get most recent data for each individual user - sql

I wonder if anyone here can help with a BigQuery piece I am working on.
This will need to pull the most recent gplus/currents activity for each individual user in the domain.
I have tried the following query, but this pulls every activity for every user:
SELECT
TIMESTAMP_MICROS(time_usec) AS date,
email,
event_type,
event_name
FROM
`bqadminreporting.adminlogtracking.activity`
WHERE
record_type LIKE 'gplus'
ORDER BY
email ASC;
I have tried to use DISTINCT, but I still get multiple entries for the same user. Ideally, I need to do this looking back over 90 day... (So between today and 90 days ago, get the most recent activity for each user - if that makes sense?) which brings me to the issue with another question.
EDIT:
Example data and expected output.
Fields: There are over 500 fields, I have just listed the relevant ones
+--------------------------------+---------+----------+
| Field name | Type | Mode |
+--------------------------------+---------+----------+
| time_usec | INTEGER | NULLABLE |
| email | STRING | NULLABLE |
| event_type | STRING | NULLABLE |
| event_name | STRING | NULLABLE |
| record_type | STRING | NULLABLE |
| gplus | RECORD | NULLABLE |
| gplus. log_event_resource_name | STRING | NULLABLE |
| gplus. attachment_type | STRING | NULLABLE |
| gplus. plusone_context | STRING | NULLABLE |
| gplus. post_permalink | STRING | NULLABLE |
| gplus. post_resource_name | STRING | NULLABLE |
| gplus. comment_resource_name | STRING | NULLABLE |
| gplus. post_visibility | STRING | NULLABLE |
| gplus. user_type | STRING | NULLABLE |
| gplus. post_author_name | STRING | NULLABLE |
+--------------------------------+---------+----------+
Output from my query: This is the output I get when running my query above.
+-----+--------------------------------+------------------+----------------+----------------+
| Row | date | email | event_type | event_name |
+-----+--------------------------------+------------------+----------------+----------------+
| 1 | 2020-01-30 07:10:19.088 UTC | user1#domain.com | post_change | create_post |
| 2 | 2020-03-03 08:47:25.086485 UTC | user1#domain.com | coment_change | create_comment |
| 3 | 2020-03-23 09:10:09.522 UTC | user1#domain.com | post_change | create_post |
| 4 | 2020-03-23 09:49:00.337 UTC | user1#domain.com | plusone_change | remove_plusone |
| 5 | 2020-03-23 09:48:10.461 UTC | user1#domain.com | plusone_change | add_plusone |
| 6 | 2020-01-30 10:04:29.757005 UTC | user1#domain.com | coment_change | create_comment |
| 7 | 2020-03-28 08:52:50.711359 UTC | user2#domain.com | coment_change | create_comment |
| 8 | 2020-11-08 10:08:09.161325 UTC | user2#domain.com | coment_change | create_comment |
| 9 | 2020-04-21 15:28:10.022683 UTC | user3#domain.com | coment_change | create_comment |
| 10 | 2020-03-28 09:37:28.738863 UTC | user4#domain.com | coment_change | create_comment |
+-----+--------------------------------+------------------+----------------+----------------+
Desired result: Only 1 row of data per user, showing only the most recent event.
+-----+--------------------------------+------------------+----------------+----------------+
| Row | date | email | event_type | event_name |
+-----+--------------------------------+------------------+----------------+----------------+
| 1 | 2020-03-23 09:49:00.337 UTC | user1#domain.com | plusone_change | remove_plusone |
| 2 | 2020-11-08 10:08:09.161325 UTC | user2#domain.com | coment_change | create_comment |
| 3 | 2020-04-21 15:28:10.022683 UTC | user3#domain.com | coment_change | create_comment |
| 4 | 2020-03-28 09:37:28.738863 UTC | user4#domain.com | coment_change | create_comment |
+-----+--------------------------------+------------------+----------------+----------------+

Use array_agg:
select
email,
array_agg(STRUCT(TIMESTAMP_MICROS(time_usec) as date, event_type, event_name) ORDER BY time_usec desc LIMIT 1)[OFFSET(0)].*
from `bqadminreporting.adminlogtracking.activity`
where
record_type LIKE 'gplus'
and time_usec > unix_micros(timestamp_sub(current_timestamp(), interval 90 day))
group by email
order by email
Test example:
with mytable as (
select timestamp '2020-01-30 07:10:19.088 UTC' as date, 'user1#domain.com' as email, 'post_change' as event_type, 'create_post' as event_name union all
select timestamp '2020-03-03 08:47:25.086485 UTC', 'user1#domain.com', 'coment_change', 'create_comment' union all
select timestamp '2020-03-23 09:10:09.522 UTC', 'user1#domain.com', 'post_change', 'create_post' union all
select timestamp '2020-03-23 09:49:00.337 UTC', 'user1#domain.com', 'plusone_change', 'remove_plusone' union all
select timestamp '2020-03-23 09:48:10.461 UTC', 'user1#domain.com', 'plusone_change', 'add_plusone' union all
select timestamp '2020-01-30 10:04:29.757005 UTC', 'user1#domain.com', 'coment_change', 'create_coment' union all
select timestamp '2020-03-28 08:52:50.711359 UTC', 'user2#domain.com', 'coment_change', 'create_coment' union all
select timestamp '2020-11-08 10:08:09.161325 UTC', 'user2#domain.com', 'coment_change', 'create_coment' union all
select timestamp '2020-04-21 15:28:10.022683 UTC', 'user3#domain.com', 'coment_change', 'create_coment' union all
select timestamp '2020-03-28 09:37:28.738863 UTC', 'user4#domain.com', 'coment_change', 'create_coment'
)
select
email,
array_agg(STRUCT(date, event_type, event_name) ORDER BY date desc LIMIT 1)[OFFSET(0)].*
from mytable
group by email

If you want all columns from the most recent row, you can use this BigQuery syntax:
select array_agg(t order by date desc limit 1)[ordinal(1)].*
from mytable t
group by t.email;
If you want specific columns, then Sergey's solution might be simpler.

An alternative way to solve your problem is :-
select * from (
select
max (date1) max_dt
from mytable
group by date(date1)), mytable
where date1=max_dt

Related

How to concat two fields and use the result in WHERE clause?

I have to get all oldest records based on the date-time information.
Data
Id | External Id | Date | Time
1 | 1000 | 2020-08-18 00:00:00 | 02:30:22
2 | 1000 | 2020-08-12 00:00:00 | 12:45:51
3 | 1556 | 2020-08-17 00:00:00 | 10:09:01
4 | 1919 | 2020-08-14 00:00:00 | 18:19:18
5 | 1919 | 2020-08-14 00:00:00 | 11:45:21
6 | 1919 | 2020-08-14 00:00:00 | 15:54:15
Expected result
Id | External Id | Date | Time
2 | 1000 | 2020-08-12 00:00:00 | 12:45:51
3 | 1556 | 2020-08-17 00:00:00 | 10:09:01
5 | 1919 | 2020-08-14 00:00:00 | 11:45:21
I'm currently doing this
SELECT *
FROM RUN AS T1
WHERE CONCAT(T1.DATE, T1.TIME) = (
SELECT MIN(CONCAT(T2.DATE, T2.TIME))
FROM RUN AS T2
WHERE T2.EXTERNAL_ID = T1.EXTERNAL_ID
)
Is it a correct way to do ?
Thank you, regards
Update 1 : Data type
DATE column is datetime
TIME column is varchar
You can use a window function such as DENSE_RANK()
SELECT ID, External_ID, Date, Time
FROM
(
SELECT DENSE_RANK() OVER (PARTITION BY External_ID ORDER BY Date, Time) AS dr,
r.*
FROM run r
) AS q
WHERE dr = 1
Demo

How to check record by record if one date is between two other dates from a second table with multiple date ranges?

I am facing the following problem. I searched for hours for a similar question, but can't find an answer.
Question:
How to check if there is a range that contains a given date using SQL?
This is more of a general question as stated in the subject, but below you can find a little context.
I want to:
calculate if there was an active subscription for a specific user at a given date.
Below I attach the sample tables. I want to use this later for calculations of retention/churn/reactivations etc.
the tables are in BigQuery, so it is standard SQL question.
Given:
Table 1: User_id and a date I want to check if there was an active subscription at this date
Table 2: Subscription transactions with date of transaction and expiry date
Desired output:
Table 3: Table 1 with "check" column if there is any record in the second table that it's range contains this Table1.Date
Table 1: User_date
|---------------------|------------------|
| User_id | Date |
|---------------------|------------------|
| 1 | 2020-10-31 |
|---------------------|------------------|
| 1 | 2020-11-30 |
|---------------------|------------------|
| 2 | 2020-10-31 |
|---------------------|------------------|
| 2 | 2020-11-30 |
|---------------------|------------------|
| 3 | 2020-10-31 |
|---------------------|------------------|
Table 2: Subscription_transactions
|---------------------|------------------|------------------|
| Transaction_date |Transaction_expiry| User_id |
|---------------------|------------------|------------------|
| 2020-10-01 | 2020-10-28 | 1 |
|---------------------|------------------|------------------|
| 2020-10-29 | 2020-11-15 | 1 |
|---------------------|------------------|------------------|
| 2020-10-15 | 2020-11-15 | 2 |
|---------------------|------------------|------------------|
| 2020-09-29 | 2020-10-15 | 3 |
|---------------------|------------------|------------------|
Table 3: Desired Output
|---------------------|------------------|------------------|
| User_id | Date | Is_active |
|---------------------|------------------|------------------|
| 1 | 2020-10-31 | TRUE |
|---------------------|------------------|------------------|
| 1 | 2020-11-30 | FALSE |
|---------------------|------------------|------------------|
| 2 | 2020-10-31 | TRUE |
|---------------------|------------------|------------------|
| 2 | 2020-11-30 | FALSE |
|---------------------|------------------|------------------|
| 3 | 2020-10-31 | FALSE |
|---------------------|------------------|------------------|
Does this do what you want?
select ud.*,
exists (select 1
from Subscription_transactions st
where u.user_id = st.user_id and
u.date between st.Transaction_date and st.Transaction_expiry
) as is_active
from user_date ud;
Below is for BigQuery Standard SQL:
with user_date as (
select 1 user_id, '2020-10-31' date union all
select 1 user_id, '2020-11-30' date union all
select 2 user_id, '2020-10-31' date union all
select 2 user_id, '2020-11-30' date union all
select 3 user_id, '2020-10-31' date
),
Subscription_transactions as (
select '2020-10-01' Transaction_date, '2020-10-28' Transaction_expiry, 1 User_id union all
select '2020-10-29', '2020-11-15', 1 union all
select '2020-10-15', '2020-11-15', 2 union all
select '2020-09-29', '2020-10-15', 3
)
SELECT ud.*
, CASE WHEN st.user_id is NULL then FALSE else TRUE end
from user_date ud
left join Subscription_transactions st
on ud.user_id = st.user_id
and ud.date between st.Transaction_date and st.Transaction_expiry

How to add date rows for messages query?

I got a Messages table.
id | sender_id | message | date
1 | 1 | Cya | 10/10/2020
2 | 2 | Bye | 10/10/2020
3 | 1 | Heya | 10/11/2020
I want to insert date rows and a type column based on the date, so it looks like this.
id | sender_id | message | date | type
1 | null | null | 10/10/2020 | date
1 | 1 | Cya | 10/10/2020 | message
2 | 2 | Bye | 10/10/2020 | message
2 | null | null | 10/11/2020 | date
3 | 1 | Heya | 10/11/2020 | message
3 | null | null | 10/11/2020 | date
When ordering by date, type, the first and the last rows are dates. And there is a date row between every two messages with different dates having the later date's value.
I got no idea how to tackle this one. Please tell me if you got any ideas on how to approach this.
This is quite complicated, because you want the new rows to contain the next date but the previous max id (if it exists) and also 1 row at the end.
So you can use UNION ALL for 3 separate cases:
select id, sender_id, message, date, type
from (
select id, sender_id, message, date, 'message' as type, 2 sort
from Messages
union all
select lag(max(id), 1, min(id)) over (order by date), null, null, date, 'date', 1
from Messages
group by date
union all
select * from (
select id, null, null, date, 'date', 3
from Messages
order by date desc, id desc limit 1
)
)
order by date, sort, id
Note that this will work only if your dates are in the format YYYY-MM-DD which is comparable and the only valid date format for SQLite.
See the demo.
Results:
> id | sender_id | message | date | type
> -: | :-------- | :------ | :--------- | :------
> 1 | null | null | 2020-10-10 | date
> 1 | 1 | Cya | 2020-10-10 | message
> 2 | 2 | Bye | 2020-10-10 | message
> 2 | null | null | 2020-10-11 | date
> 3 | 1 | Heya | 2020-10-11 | message
> 3 | null | null | 2020-10-11 | date
Hmmm . . . I think you want union all:
select id, sender_id, message, date, 'message' as type
from t
union all
select id, null, null, date, 'date'
from t
order by id;
EDIT:
Based on your comment:
select id, sender_id, message, date, 'message' as type
from t
union all
select min(id), null, null, date, 'date'
from t
group by date

How to take the smallest date of a group?

I have a dataset which looks like that:
| id | status | open_date | name |
| 8 | active | 2019-3-2 | blab |
| 8 | active | 2019-3-8 | blub |
| 8 | inactive | 2019-3-9 | hans |
| 8 | active | 2019-3-10 | ana |
| 9 | active | 2019-3-4 | mars |
I want to achieve the following:
| id | status | open_date | name | status_change_date |
| 8 | active | 2019-3-2 | blab | 2019-3-2
| 8 | active | 2019-3-8 | blub | 2019-3-2
| 8 | inactive | 2019-3-9 | Hans | 2019-3-9
| 8 | active | 2019-3-10 | ana | 2019-3-10
| 9 | active | 2019-3-4 | mars | 2019-3-4
for each id I like to calculate when the status has last changed
I already tried with groupBy, but the problem is I only want to group by the rows with Active and Inactive which are next to each other. If there is an INACTIVE between ACTIVE I like to make a new group for the new ACTIVE.
Someone has an idea to solve that?
Here is a pure SQL solution that uses window functions. This works by generating a partition that contains consecutive records that have the same id and status.
SELECT
id,
status,
open_date,
name,
MIN(open_date) OVER(PARTITION BY id, rn1 - rn2 ORDER BY open_date) status_change_date
FROM (
SELECT
t.*,
ROW_NUMBER() OVER(PARTITION BY id ORDER BY open_date) rn1,
ROW_NUMBER() OVER(PARTITION BY id, status ORDER BY open_date) rn2
FROM mytable t
) x
ORDER BY id, open_date
Demo on DB Fiddle:
| id | status | open_date | name | status_change_date |
| --- | -------- | ---------- | ---- | ------------------ |
| 8 | active | 2019-03-02 | blab | 2019-03-02 |
| 8 | active | 2019-03-08 | blub | 2019-03-02 |
| 8 | inactive | 2019-03-09 | hans | 2019-03-09 |
| 8 | active | 2019-03-10 | ana | 2019-03-10 |
| 9 | active | 2019-03-04 | mars | 2019-03-04 |
Thats the answer on How to take the smallest date of a group?
let minDate = new Date('0001-01-01T00:00:00Z');
dataset.forEach(x => if( x.date > this.minDate) { this.minDate = x.date } )
console.log(this.minDate);
You can try this:
var movies = [
{title: 'The Godfather', rating: 9.2, release: '24 March 1972'},
{title: 'The Godfather: Part II', rating: 9.0, release: '20 December 1972'},
{title: 'The Shawshank Redemption', rating: 9.3, release: '14 October 1994'},
];
movies.sort(function(a, b) {
var dateA = new Date(a.release), dateB = new Date(b.release);
return dateA - dateB;
});
This sortby works because js lets you compare arithmetic on date objects, which are automatically converted to numeric representations first.
In SQL use MIN function:
ORDER
Id
OrderDate
OrderNumber
CustomerId
TotalAmount
SELECT MIN(OrderDate)
FROM [Order]
WHERE YEAR(OrderDate) = 2013

Arranging the data on the basis of column value

I have a table which has the below structure.
+ ----------------------+--------------+--------+
| timeStamp | value | type |
+ ----------------------+--------------+--------+
| '2010-01-14 00:00:00' | '11787.3743' | 'mean' |
| '2018-04-03 14:19:21' | '9.9908' | 'std' |
| '2018-04-03 14:19:21' | '11787.3743' | 'min' |
+ ----------------------+--------------+--------+
Now i want to write a select query where i can fetch the data on the basis of type.
+ ----------------------+--------------+-------------+----------+
| timeStamp | mean_type | min_type | std_type |
+ ----------------------+--------------+-------------+----------+
| '2010-01-14 00:00:00' | '11787.3743' | | |
| '2018-04-03 14:19:21' | | | '9.9908' |
| '2018-04-03 14:19:21' | | '11787.3743 | |
+ ----------------------+--------------+-------------+----------+
Please help me how can i do this in postgres DB by writing a query.I also want to get the data at the interval of 10 minutes only.
Use CASE ... WHEN ...:
with my_table(timestamp, value, type) as (
values
('2010-01-14 00:00:00', 11787.3743, 'mean'),
('2018-04-03 14:19:21', 9.9908, 'std'),
('2018-04-03 14:19:21', 11787.3743, 'min')
)
select
timestamp,
case type when 'mean' then value end as mean_type,
case type when 'min' then value end as min_type,
case type when 'std' then value end as std_type
from my_table;
timestamp | mean_type | min_type | std_type
---------------------+------------+------------+----------
2010-01-14 00:00:00 | 11787.3743 | |
2018-04-03 14:19:21 | | | 9.9908
2018-04-03 14:19:21 | | 11787.3743 |
(3 rows)