I have two tables Tickets and Tasks. When ticket is registered then it appears in Tickets table and every action that is made with the ticket is saved in the Tasks table. Tickets table includes information like who created the ticket, start and end dates (if it is closed) etc. Tasks table looks like this:
ID Ticket_ID Task_type_ID Task_type Group_ID Submit_Date
1 120 1 Opened 3 2016-12-09 11:10:22.000
2 120 2 Assign 4 2016-12-09 12:10:22.000
3 120 3 Paused 4 2016-12-09 12:30:22.000
4 120 4 Unpause 4 2016-12-10 10:30:22.000
5 120 2 Assign 6 2016-12-12 10:30:22.000
6 120 2 Assign 7 2016-12-12 15:30:22.000
7 120 5 Modify NULL 2016-12-13 15:30:22.000
8 120 6 Closed NULL 2016-12-13 16:30:22.000
I would like to calculate the time how long each group completed their task. The start time is the time when the ticket was assigned to certain group and end time is when that group completes their task (if they assign it elsewhere or close it). But it should not include the paused time(task_type_ID 3 to 4). Also when ticket is assigned to other group the new group ID appears in the previous task/row. If the task goes through multiple groups it should calculate how long the ticket was in the hands of every group.
I know it is complicated but maybe someone has an idea that I can start to build from.
This is a quite sophisticated gaps-and-island problem.
Here is one approach at it:
select distinct
ticket_id,
group_id,
sum(sum(datediff(minute, submit_date, lead_submit_date)))
over(partition by group_id) elapsed_minutes
from (
select
t.*,
row_number() over(partition by ticket_id order by submit_date) rn1,
row_number() over(partition by ticket_id, group_id order by submit_date) rn2,
lead(submit_date) over(partition by ticket_id order by submit_date) lead_submit_date
from mytable t
) t
where task_type <> 'Paused' and group_id is not null
group by ticket_id, group_id, rn1 - rn2
In the subquery, we assign row numbers to records within two different partitions (by tickets vs by ticket and group), and recover the date of the next record with lead().
We can then use the difference between the row numbers to build groups of "adjacent" records (where the tickets stays in the same group), while not taking into account periods when the ticket was paused. Aggregation comes into play here.
The final step is to compute the overall time spent in each group : this handles the case when a ticket is assigned to the same group more than once during its lifecycle (although that's not showing in your sample data, the description of the question makes it sound like that may happen). We could do this with another level of aggregation but I went for a window sum and distinct, which avoids adding one more level of nesting to the query.
Executing the subquery independently might help understanding the logic better (see the below db fiddle).
For your sample data, the query yields:
ticket_id | group_id | minutes_elapsed
--------: | -------: | --------------:
120 | 3 | 60
120 | 4 | 2900
120 | 6 | 300
120 | 7 | 1440
I actually think this is pretty simple. Just use lead() to get the next submit time value and aggregate by the ticket and group ignoring pauses:
select ticket_id, group_id, sum(dur_sec)
from (select t.*,
datediff(second, submit_date, lead(submit_date) over (partition by ticket_id order by submit_date)) as dur_sec
from mytable t
) t
where task_type <> 'Paused' and group_id is not null
group by ticket_id, group_id;
Here is a db<>fiddle (with thanks to GMB for creating the original fiddle).
Related
I've been struggling to find an answer for this question. I think this question is similar to what i'm looking for but when i tried this it didn't work.
Because there's no new unique user_id added between 02-20 and 02-27, the cumulative count will be the same. Then for 02-27, there is a unique user_id which hasn't appeared on any previous dates (6)
Here's my input
date user_id
2020-02-20 1
2020-02-20 2
2020-02-20 3
2020-02-20 4
2020-02-20 4
2020-02-20 5
2020-02-21 1
2020-02-22 2
2020-02-23 3
2020-02-24 4
2020-02-25 4
2020-02-27 6
Output table:
date daily_cumulative_count
2020-02-20 5
2020-02-21 5
2020-02-22 5
2020-02-23 5
2020-02-24 5
2020-02-25 5
2020-02-27 6
This is what i tried and the result is not quite what i want
select
stat_date,count(DISTINCT user_id),
sum(count(DISTINCT user_id)) over (order by stat_date rows unbounded preceding) as cumulative_signups
from data_engineer_interview
group by stat_date
order by stat_date
it returns this instead;
date,count,cumulative_sum
2022-02-20,5,5
2022-02-21,1,6
2022-02-22,1,7
2022-02-23,1,8
2022-02-24,1,9
2022-02-25,1,10
2022-02-27,1,11
The problem with this task is that it could be done by comparing each row uniquely with all previous rows to see if there is a match in user_id. Since you are using Redshift I'll assume that your data table could be very large so attacking the problem this way will bog down in some form of a loop join.
You want to think about the problem differently to avoid this looping issue. If you derive a dataset with id and first_date_of_id you can then just do a cumulative sum sorted by date. Like this
select user_id, min("date") as first_date,
count(user_id) over (order by first_date rows unbounded preceding) as date_out
from data_engineer_interview
group by user_id
order by date_out;
This is untested and won't produce the full list of dates that you have in your example output but rather only the dates where new ids show up. If this is an issue it is simple to add in the additional dates with no count change.
We can do this via a correlated subquery followed by aggregation:
WITH cte AS (
SELECT
date,
CASE WHEN EXISTS (
SELECT 1
FROM data_engineer_interview d2
WHERE d2.date < d1.date AND
d2.user_id = d1.user_id
) THEN 0 ELSE 1 END AS flag
FROM (SELECT DISTINCT date, user_id FROM data_engineer_interview) d1
)
SELECT date, SUM(flag) AS daily_cumulative_count
FROM cte
ORDER BY date;
I am working on an accounting system in which there is a way to revert transactions which are made by mistake.
There are processes which run on invoices which generate transactions.
One process can generate multiple transactions for an invoice. There can be multiple processes which can be run on an invoice.
The schema looks as under:
Transactions
========================================================
Id | InvoiceId | InvoiceProcessType | Amount | CreatedOn
1 1 23 10.00 Today
2 1 23 13.00 Today
3 1 23 17.00 Yesterday
4 1 23 32.00 Yesterday
Now 1 and 2 happened together and 3 and 4 happened together and I want to revert the latter (3,4), what would be a possible solution to group them.
One possible solution is to add a column ProcessCount which is incremented on every process.
The new schema would look as under.
Transactions
==============================================================================
Id | InvoiceId | InvoiceProcessType | Amount | CreatedOn | ProcessCount
1 1 23 10.00 Today 1
2 1 23 13.00 Today 1
3 1 23 17.00 Yesterday 2
4 1 23 32.00 Yesterday 2
Is there any other way I can implement this ?
TIA
If you are basing the batching on an arbitrary time frame between the createdon date/time values, then you can use lag() and a cumulative sum. For instance, if two rows are in the same batch if they are within an hour, then:
select t.*,
sum(case when prev_createdon > dateadd(hour, -1, createdon) then 0 else 1 end) over
(partition by invoiceid order by createdon, id) as processcount
from (select t.*,
lag(createdon) over (partition by invoiceid order by createdon, id) as prev_createdon
from transactions t
) t;
That said, it would seem that your processing needs to be enhanced. Each time the code runs, a row should be inserted into some table (say processes). The id generated from that insertion should be used to insert into transactions. That way, you can keep the information about when -- and who and what and so on -- inserted particular transactions.
You can use the dense_rank to identify it as follows:
select t.*,
dense_rank() over (partition by InvoiceId
order by CreatedOn desc) as ProcessCount
from your_table t
You can then revert (/delete) as per your requirement, There is no need to explicitly maintain the ProcessCount column. It can be derived as per the above query.
I have a table for certain failure and restore events for users. I would like to find time difference between failure and restore event of every user.
Also for some failure events, there are multiple restore events, I want to find time diff only between consecutive ones.
And, there might be multiple failures before restores. I only want difference between consecutive failure and restore (ignoring all leading and trailing extra events for that user)
sr imei time event_type
1 1 2020-01-01 14:28:06.269000+00:00 failure
2 1 2020-01-01 14:28:29.910000+00:00 failure_restored
3 5 2020-01-01 15:24:52.714000+00:00 failure
4 5 2020-01-01 15:29:59.045000+00:00 failure_restored
5 6 2020-01-01 21:21:32.715000+00:00 failure_restored
6 7 2020-01-01 21:48:43.798000+00:00 failure_restored
7 9 2020-01-01 22:18:34.112000+00:00 failure_restored
8 9 2020-01-01 22:20:16.165000+00:00 failure
9 9 2020-01-01 22:25:29.648000+00:00 failure_restored
I want to find time diff between failure/failure_restored events for every imei.
for eg: between row 1 and 2, 3 and 4, row 6, 7 and 8 should be ignored and between row 8 and 9.
I could achieve it with pandas with custom made functions but need help with doing it in SQL.
You don't specify what to do if there are two "failure"s in a row. If you always know that the next row is a restore, then:
SELECT imei, TIMESTAMP_DIFF(next_time, time, MILLISECOND) as time_diff_in_milliseconds
FROM (SELECT t.*,
LEAD(time) OVER (PARTITION BY imei ORDER BY time) as next_time
FROM t
)
WHERE event_type = 'failure';
If not, then check the next event type as well:
SELECT imei, TIMESTAMP_DIFF(next_time, time, MILLISECOND) as time_diff_in_milliseconds
FROM (SELECT t.*,
LEAD(time) OVER (PARTITION BY imei ORDER BY time) as next_time,
LEAD(event_type) OVER (PARTITION BY imei ORDER BY time) as next_event_type
FROM t
)
WHERE event_type = 'failure' and next_event_type = 'failure-restored';
Below is for BigQuery Standard SQL
#standardSQL
SELECT imei, TIMESTAMP_DIFF(restore_time, time, MILLISECOND) time_diff_in_milliseconds
FROM (
SELECT *, LEAD(IF(event_type = 'failure_restored', time, NULL)) OVER(PARTITION BY imei ORDER BY time) restore_time
FROM `project.dataset.table`
)
WHERE event_type = 'failure'
if to apply to sample data from your question - result is
Row imei time_diff_in_milliseconds
1 1 23641
2 5 306331
3 9 313483
Following on from a previous question, in which i have a table called orders with information regarding the time an order was placed and who made that order.
order_timestamp user_id
-------------------- ---------
1-JUN-20 02.56.12 123
3-JUN-20 12.01.01 533
23-JUN-20 08.42.18 123
12-JUN-20 02.53.59 238
19-JUN-20 02.33.72 34
I would like to calculate a daily rolling count of the number of days a user made an order in a past 10 days.
For example, in the last 10 days from the 20th June, user 34 made an order on 5 of those days. Then in the last 10 days from the 21st June, user 34 made an order on 6 of those days
In the end the table should be like this:
date user_id no_of_days
----------- --------- ------------
20-JUN-20 34 5
20-JUN-20 123 10
20-JUN-20 533 2
20-JUN-20 238 3
21-JUN-20 34 6
21-JUN-20 123 10
How would the query be written for this kind of analysis?
Please let me know if my question is unclear/more infor is required.
Thanks to you in advancement.
You can use window functions for this. Start by getting one row per user per day. And then use a rolling sum:
select day, user_id,
count(*) over (partition by user_id range between interval '10' day preceding and current row)
from (select distinct trunc(order_timestamp) as day, user_id
from t
) t
Assuming that a user places one order a day maximum, you can use window functions as follows:
select
t.*,
count(*) over(partition by user_id order by trunc(order_timestamp) range 10 preceding) no_of_days
from mytable t
Otherwise, you can get the distinct orders per day first:
select
order_day,
user_id,
count(*) over(partition by user_id order by order_day range 10 preceding) no_of_days
from (select distinct trunc(order_timestamp) order_day, user_id from mytable) t
So the question seems to be quite difficult I wonder if I could get some advice from here. I am trying to solve this with SQLite 3. So I have a data format of this.
customer | purchase date
1 | date 1
1 | date 2
1 | date 3
2 | date 4
2 | date 5
2 | date 6
2 | date 7
number of times the customer repeats is random.
so I just want to find whether customer 1's 1st and 2nd purchase date are fallen in between a specific time period. repeat for other customers. only need to consider 1st and 2nd dates.
Any help would be appreciated!
We can try using ROW_NUMBER here:
WITH cte AS (
SELECT *, ROW_NUMBER() OVER (PARTITION BY customer ORDER BY "purchase date") rn
FROM yourTable
)
SELECT
customer,
CAST(MAX(CASE WHEN rn = 2 THEN julianday("purchase date") END) -
MAX(CASE WHEN rn = 1 THEN julianday("purchase date") END) AS INTEGER) AS diff_in_days
FROM cte
GROUP BY
customer;
The idea here is to aggregate by customer and then take the date difference between the second and first purchase. ROW_NUMBER is used to find these first and second purchases, for each customer.