Sort group members and aggregate with navigation function - google-bigquery

I'm trying to group the table below, on columns id and due_month:
id
created_at
due_month
status
1
2021-02-05
2021-02
paused
1
2021-01-31
2021-02
normal
1
2021-01-15
2021-01
normal
2
2021-03-18
2021-03
normal
2
2021-03-07
2021-03
paused
2
2021-03-31
2021-08
normal
then within each group, sort the members on created_at in ascending order, and finally pick the last item's status value with latest created_at date (assumes that created_at never repeat on records with the same id).
Hence the output will look like this:
id
due_month
status
1
2021-01
normal
1
2021-02
paused
2
2021-03
normal
2
2021-08
normal
I tried out query like this but it didn't work (syntax error):
SELECT
`id`,
`due_month`,
LAST_VALUE(`status`) OVER (ORDER BY `created_at`) AS `status`
FROM `some_table`
GROUP BY
`id`,
`due_month`
;
Also I know it's possible to join information like MAX(`created_at`) AS latest to the original table, then filter by WHERE created_at = latest to get what's needed, but that doesn't look very efficient.
Any better ideas for writing down this type of logics in BigQuery?

Consider below approach
select id, due_month, status
from your_table
where true
qualify 1 = row_number() over win
window win as (partition by id, due_month order by created_at desc)
if applied to sample data in your question - output is

Related

redshift cumulative count records via SQL

I've been struggling to find an answer for this question. I think this question is similar to what i'm looking for but when i tried this it didn't work.
Because there's no new unique user_id added between 02-20 and 02-27, the cumulative count will be the same. Then for 02-27, there is a unique user_id which hasn't appeared on any previous dates (6)
Here's my input
date user_id
2020-02-20 1
2020-02-20 2
2020-02-20 3
2020-02-20 4
2020-02-20 4
2020-02-20 5
2020-02-21 1
2020-02-22 2
2020-02-23 3
2020-02-24 4
2020-02-25 4
2020-02-27 6
Output table:
date daily_cumulative_count
2020-02-20 5
2020-02-21 5
2020-02-22 5
2020-02-23 5
2020-02-24 5
2020-02-25 5
2020-02-27 6
This is what i tried and the result is not quite what i want
select
stat_date,count(DISTINCT user_id),
sum(count(DISTINCT user_id)) over (order by stat_date rows unbounded preceding) as cumulative_signups
from data_engineer_interview
group by stat_date
order by stat_date
it returns this instead;
date,count,cumulative_sum
2022-02-20,5,5
2022-02-21,1,6
2022-02-22,1,7
2022-02-23,1,8
2022-02-24,1,9
2022-02-25,1,10
2022-02-27,1,11
The problem with this task is that it could be done by comparing each row uniquely with all previous rows to see if there is a match in user_id. Since you are using Redshift I'll assume that your data table could be very large so attacking the problem this way will bog down in some form of a loop join.
You want to think about the problem differently to avoid this looping issue. If you derive a dataset with id and first_date_of_id you can then just do a cumulative sum sorted by date. Like this
select user_id, min("date") as first_date,
count(user_id) over (order by first_date rows unbounded preceding) as date_out
from data_engineer_interview
group by user_id
order by date_out;
This is untested and won't produce the full list of dates that you have in your example output but rather only the dates where new ids show up. If this is an issue it is simple to add in the additional dates with no count change.
We can do this via a correlated subquery followed by aggregation:
WITH cte AS (
SELECT
date,
CASE WHEN EXISTS (
SELECT 1
FROM data_engineer_interview d2
WHERE d2.date < d1.date AND
d2.user_id = d1.user_id
) THEN 0 ELSE 1 END AS flag
FROM (SELECT DISTINCT date, user_id FROM data_engineer_interview) d1
)
SELECT date, SUM(flag) AS daily_cumulative_count
FROM cte
ORDER BY date;

Counting subscriber numbers given events on SQL

I have a dataset on mysql in the following format, showing the history of events given some client IDs:
Base Data
Text of the dataset (subscriber_table):
user_id type created_at
A past_due 2021-03-27 10:15:56
A reactivate 2021-02-06 10:21:35
A past_due 2021-01-27 10:30:41
A new 2020-10-28 18:53:07
A cancel 2020-07-22 9:48:54
A reactivate 2020-07-22 9:48:53
A cancel 2020-07-15 2:53:05
A new 2020-06-20 20:24:18
B reactivate 2020-06-14 10:57:50
B past_due 2020-06-14 10:33:21
B new 2020-06-11 10:21:24
date_table:
full_date
2020-05-01
2020-06-01
2020-07-01
2020-08-01
2020-09-01
2020-10-01
2020-11-01
2020-12-01
2021-01-01
2021-02-01
2021-03-01
I have been struggling to come up with a query to count subscriber counts given a range of months, which are not necessary included in the event table either because the client is still subscribed or they cancelled and later resubscribed. The output I am looking for is this:
Output
date subscriber_count
2020-05-01 0
2020-06-01 2
2020-07-01 2
2020-08-01 1
2020-09-01 1
2020-10-01 2
2020-11-01 2
2020-12-01 2
2021-01-01 2
2021-02-01 2
2021-03-01 2
Reactivation and Past Due events do not change the subscription status of the client, however only the Cancel and New event do. If the client cancels in a month, they should still be counted as active for that month.
My initial approach was to get the latest entry given a month per subscriber ID and then join them to the premade date table, but when I have months missing I am unsure on how to fill them with the correct status. Maybe a lag function?
with last_record_per_month as (
select
date_trunc('month', created_at)::date order by created_at) as month_year ,
user_id ,
type,
created_at as created_at
from
subscriber_table
where
user_id in ('A', 'B')
order by
created_at desc
), final as (
select
month_year,
created_at,
type
from
last_record_per_month lrpm
right join (
select
date_trunc('month', full_date)::date as month_year
from
date_table
where
full_date between '2020-05-01' and '2021-03-31'
group by
1
order by
1
) dd
on lrpm.created_at = dd.month_year
and num = 1
order by
month_year
)
select
*
from
final
I do have a premade base table with every single date in many years to use as a joining table
Any help with this is GREATLY appreciated
Thanks!
The approach here is to have the subscriber rows with new connections as base and map them to the cancelled rows using a self join. Then have the date tables as base and aggregate them based on the number of users to get the result.
SELECT full_date, COUNT(DISTINCT user_id) FROM date_tbl
LEFT JOIN(
SELECT new.user_id,new.type,new.created_at created_at_new,
IFNULL(cancel.created_at,CURRENT_DATE) created_at_cancel
FROM subscriber new
LEFT JOIN subscriber cancel
ON new.user_id=cancel.user_id
AND new.type='new' AND cancel.type='cancel'
AND new.created_at<= cancel.created_at
WHERE new.type IN('new'))s
ON DATE_FORMAT(s.created_at_new, '%Y-%m')<=DATE_FORMAT(full_date, '%Y-%m')
AND DATE_FORMAT(s.created_at_cancel, '%Y-%m')>=DATE_FORMAT(full_date, '%Y-%m')
GROUP BY 1
Let me breakdown some sections
First up we need to have the subscriber table self joined based on user_id and then left table with rows as 'new' and the right one with 'cancel' new.type='new' AND cancel.type='cancel'
The new ones should always precede the canceled rows so adding this new.created_at<= cancel.created_at
Since we only care about the rows with new in the base table we filter out the rows in the WHERE clause new.type IN('new'). The result of the subquery would look something like this
We can then join this subquery with a Left join the date table such that the year and month of the created_at_new column is always less than equal to the full_date DATE_FORMAT(s.created_at_new, '%Y-%m')<=DATE_FORMAT(full_date, '%Y-%m') but greater than that of the canceled date.
Lastly we aggregate based on the full_date and consider the unique count of users
fiddle

Multiple scenario in where clause

I have the following data:
Invoice | Status | StatusDate
1111111 BackOrd null
1111111 Delivd 2020-01-01
2222222 BackOrd null
3333333 Delivd 2020-02-29
In the above invoice 11111 was on BackOrd at one time and now has been Delivered, 222222 is currently on BackOrd and 33333 was never on BackOrd and was Delivered. 22222&33333 are easy but 11111 is vexing me because I would want to only show the current Status of Delivered.
I've tried
where case when StatusDate is null then 'BackOrd' else 'Delivd' end = Status
and various iterations, however my examples of 11111 will bring back both rows which sure is was in both at one time. I feel like this shouldn't be that hard and maybe not enough coffee but something isn't making sense to me.
You want the latest row per invoice, so this is a top-1-per group problem. You can use window functions:
select *
from (
select t.*,
row_number() over(partition by invoice order by statusdate desc) rn
from mytable t
) t
where rn = 1
This works because SQL Server puts null values last when using a descending sort.

Calculating time with datetime by groups

I have two tables Tickets and Tasks. When ticket is registered then it appears in Tickets table and every action that is made with the ticket is saved in the Tasks table. Tickets table includes information like who created the ticket, start and end dates (if it is closed) etc. Tasks table looks like this:
ID Ticket_ID Task_type_ID Task_type Group_ID Submit_Date
1 120 1 Opened 3 2016-12-09 11:10:22.000
2 120 2 Assign 4 2016-12-09 12:10:22.000
3 120 3 Paused 4 2016-12-09 12:30:22.000
4 120 4 Unpause 4 2016-12-10 10:30:22.000
5 120 2 Assign 6 2016-12-12 10:30:22.000
6 120 2 Assign 7 2016-12-12 15:30:22.000
7 120 5 Modify NULL 2016-12-13 15:30:22.000
8 120 6 Closed NULL 2016-12-13 16:30:22.000
I would like to calculate the time how long each group completed their task. The start time is the time when the ticket was assigned to certain group and end time is when that group completes their task (if they assign it elsewhere or close it). But it should not include the paused time(task_type_ID 3 to 4). Also when ticket is assigned to other group the new group ID appears in the previous task/row. If the task goes through multiple groups it should calculate how long the ticket was in the hands of every group.
I know it is complicated but maybe someone has an idea that I can start to build from.
This is a quite sophisticated gaps-and-island problem.
Here is one approach at it:
select distinct
ticket_id,
group_id,
sum(sum(datediff(minute, submit_date, lead_submit_date)))
over(partition by group_id) elapsed_minutes
from (
select
t.*,
row_number() over(partition by ticket_id order by submit_date) rn1,
row_number() over(partition by ticket_id, group_id order by submit_date) rn2,
lead(submit_date) over(partition by ticket_id order by submit_date) lead_submit_date
from mytable t
) t
where task_type <> 'Paused' and group_id is not null
group by ticket_id, group_id, rn1 - rn2
In the subquery, we assign row numbers to records within two different partitions (by tickets vs by ticket and group), and recover the date of the next record with lead().
We can then use the difference between the row numbers to build groups of "adjacent" records (where the tickets stays in the same group), while not taking into account periods when the ticket was paused. Aggregation comes into play here.
The final step is to compute the overall time spent in each group : this handles the case when a ticket is assigned to the same group more than once during its lifecycle (although that's not showing in your sample data, the description of the question makes it sound like that may happen). We could do this with another level of aggregation but I went for a window sum and distinct, which avoids adding one more level of nesting to the query.
Executing the subquery independently might help understanding the logic better (see the below db fiddle).
For your sample data, the query yields:
ticket_id | group_id | minutes_elapsed
--------: | -------: | --------------:
120 | 3 | 60
120 | 4 | 2900
120 | 6 | 300
120 | 7 | 1440
I actually think this is pretty simple. Just use lead() to get the next submit time value and aggregate by the ticket and group ignoring pauses:
select ticket_id, group_id, sum(dur_sec)
from (select t.*,
datediff(second, submit_date, lead(submit_date) over (partition by ticket_id order by submit_date)) as dur_sec
from mytable t
) t
where task_type <> 'Paused' and group_id is not null
group by ticket_id, group_id;
Here is a db<>fiddle (with thanks to GMB for creating the original fiddle).

SQL Calculating time from last transaction for each ID

Hello I'm stuck trying to calculate the difference in time between each transaction for each ID.
The data looks like
Customer_ID | Transaction_Time
1 00:30
1 00:35
1 00:37
1 00:38
2 00:20
2 00:21
2 00:23
I'm trying to get the result to look something like
Customer_ID | Time_diff
1 5
1 2
1 1
2 1
2 2
I would really appreciate any help.
Thanks
Most databases support the LAG() function. However, the date/time functions can depend on the database. Here is an example for SQL Server:
select t.*
from (select t.*,
datediff(second,
lag(transaction_time) over (partition by customer_id order by transaction_time),
transaction_time
) as diff
from t
) t
where diff is not null;
The logic would be similar in most databases, although the function for calculating the time difference varies.