I have a Hive table with some data and i would like to split it in to 15 minutes intervals et return the total call duration for every interval
Hive Table example :
ID Start End Total Duration
1 1502296261 1502325061 28800
My output should be shown as :
ID Interval Duration
1 2017-08-09 18:30:00 839
1 2017-08-09 18:45:00 900
1 2017-08-09 19:00:00 900
...
1 2017-08-10 02:15:00 900
1 2017-08-10 02:30:00 61
What is the best solution to do that in a efficient way ?
Thanks.
This is the basic solution.
The displayed timestamp (Interval) depends on your system timezone.
with t as (select stack(1,1,1502296261,1502325061) as (`ID`,`Start`,`End`))
select t.`ID` as `ID`
,from_unixtime((t.`Start` div (15*60) + pe.pos)*(15*60)) as `Interval`
, case
when pe.pos = t.`End` div (15*60) - t.`Start` div (15*60)
then t.`End`
else (t.`Start` div (15*60) + pe.pos + 1)*(15*60)
end
- case
when pe.pos = 0
then t.`Start`
else (t.`Start` div (15*60) + pe.pos)*(15*60)
end as `Duration`
from t
lateral view
posexplode(split(space(int(t.`End` div (15*60) - t.`Start` div (15*60))),' ')) pe
;
+----+---------------------+----------+
| id | interval | duration |
+----+---------------------+----------+
| 1 | 2017-08-09 09:30:00 | 839 |
| 1 | 2017-08-09 09:45:00 | 900 |
| 1 | 2017-08-09 10:00:00 | 900 |
| 1 | 2017-08-09 10:15:00 | 900 |
| 1 | 2017-08-09 10:30:00 | 900 |
| 1 | 2017-08-09 10:45:00 | 900 |
| 1 | 2017-08-09 11:00:00 | 900 |
| 1 | 2017-08-09 11:15:00 | 900 |
| 1 | 2017-08-09 11:30:00 | 900 |
| 1 | 2017-08-09 11:45:00 | 900 |
| 1 | 2017-08-09 12:00:00 | 900 |
| 1 | 2017-08-09 12:15:00 | 900 |
| 1 | 2017-08-09 12:30:00 | 900 |
| 1 | 2017-08-09 12:45:00 | 900 |
| 1 | 2017-08-09 13:00:00 | 900 |
| 1 | 2017-08-09 13:15:00 | 900 |
| 1 | 2017-08-09 13:30:00 | 900 |
| 1 | 2017-08-09 13:45:00 | 900 |
| 1 | 2017-08-09 14:00:00 | 900 |
| 1 | 2017-08-09 14:15:00 | 900 |
| 1 | 2017-08-09 14:30:00 | 900 |
| 1 | 2017-08-09 14:45:00 | 900 |
| 1 | 2017-08-09 15:00:00 | 900 |
| 1 | 2017-08-09 15:15:00 | 900 |
| 1 | 2017-08-09 15:30:00 | 900 |
| 1 | 2017-08-09 15:45:00 | 900 |
| 1 | 2017-08-09 16:00:00 | 900 |
| 1 | 2017-08-09 16:15:00 | 900 |
| 1 | 2017-08-09 16:30:00 | 900 |
| 1 | 2017-08-09 16:45:00 | 900 |
| 1 | 2017-08-09 17:00:00 | 900 |
| 1 | 2017-08-09 17:15:00 | 900 |
| 1 | 2017-08-09 17:30:00 | 61 |
+----+---------------------+----------+
Related
Purpose: I work in Hospitality Industry. I want to understand at what time the Restaurant is full and what time it is less busy. I have the opening and closing times, I want to split it 30 minute interval period.
I would really appreciate if you could ease help me.
Thanking you in advance
Table
Check# Open CloseTime
25484 17:34 18:06
25488 18:04 21:22
Output
Check# Open Close Duration
25484 17:34 18:00 0:25
25484 18:00 18:30 0:30
25488 18:08 18:30 0:21
25488 18:30 19:00 0:30
25488 19:00 19:30 0:30
25488 19:30 20:00 0:30
25488 20:00 20:30 0:30
25488 20:30 21:00 0:30
25488 21:00 21:30 0:30
I am new to SQL. I am good at Excel, but due to its limitations i want to use SQL. I just know the basics in SQL.
I have tried on the google, but could not find solution to it. All i can see use of Date Keywords, but not the Field name in the code, hence i am unable to use them.
Could you try this, it works in MySQL 8.0:
WITH RECURSIVE times AS (
SELECT time '0:00' AS `Open`, time '0:30' as `Close`
UNION ALL
SELECT addtime(`Open`, '0:30'), addtime(`Close`, '0:30')
FROM times
WHERE `Open` < time '23:30'
)
SELECT c.`Check`,
greatest(t.`Open`, c.`Open`) `Open`,
least(t.`Close`, c.`CloseTime`) `Close`,
timediff(least(t.`Close`, c.`CloseTime`), greatest(t.`Open`, c.`Open`)) `Duration`
FROM times t
JOIN checks c ON (c.`Open` < t.`Close` AND c.`CloseTime` > t.`Open`);
| Check | Open | Close | Duration |
| ----- | -------- | -------- | -------- |
| 25484 | 17:34:00 | 18:00:00 | 00:26:00 |
| 25484 | 18:00:00 | 18:06:00 | 00:06:00 |
| 25488 | 18:04:00 | 18:30:00 | 00:26:00 |
| 25488 | 18:30:00 | 19:00:00 | 00:30:00 |
| 25488 | 19:00:00 | 19:30:00 | 00:30:00 |
| 25488 | 19:30:00 | 20:00:00 | 00:30:00 |
| 25488 | 20:00:00 | 20:30:00 | 00:30:00 |
| 25488 | 20:30:00 | 21:00:00 | 00:30:00 |
| 25488 | 21:00:00 | 21:22:00 | 00:22:00 |
->Fiddle
This works for SQL Server 2019:
WITH times([Open], [Close]) AS (
SELECT cast({t'00:00:00'} as time) as "Open",
cast({t'00:30:00'} as time) as "Close"
UNION ALL
SELECT dateadd(minute, 30, [Open]), dateadd(minute, 30, [Close])
FROM times
WHERE [Open] < cast({t'23:30:00'} as time)
)
SELECT c.[Check],
iif(t.[Open] > c.[Open], t.[Open], c.[Open]) as [Open],
iif(t.[Close] < c.[CloseTime], t.[Close], c.[CloseTime]) as [Close],
datediff(minute,
iif(t.[Open] > c.[Open], t.[Open], c.[Open]),
iif(t.[Close] < c.[CloseTime], t.[Close], c.[CloseTime])) Duration
FROM times t
JOIN checks c ON (c.[Open] < t.[Close] AND c.[CloseTime] > t.[Open]);
Check | Open | Close | Duration
25484 | 17:34:00.0000000 | 18:00:00.0000000 | 26
25484 | 18:00:00.0000000 | 18:06:00.0000000 | 6
25488 | 18:04:00.0000000 | 18:30:00.0000000 | 26
25488 | 18:30:00.0000000 | 19:00:00.0000000 | 30
25488 | 19:00:00.0000000 | 19:30:00.0000000 | 30
25488 | 19:30:00.0000000 | 20:00:00.0000000 | 30
25488 | 20:00:00.0000000 | 20:30:00.0000000 | 30
25488 | 20:30:00.0000000 | 21:00:00.0000000 | 30
25488 | 21:00:00.0000000 | 21:22:00.0000000 | 22
->Fiddle
I' have searched a lot through the site trying to find a solution to my problem and I have found similar problems but I haven't managed to find a solution that works in my case.
I have a tickets table like this (which has a lot more data than this):
TICKET:
+---------+--------------+------------+------------+
| ticketid| report_date | impact | open |
+---------+--------------+------------+------------+
| 1 | 29/01/2019 | 1 | true |
| 2 | 29/01/2019 | 2 | true |
| 3 | 30/01/2019 | 4 | true |
| 4 | 27/01/2019 | 1 | true |
| 5 | 29/01/2019 | 1 | true |
| 6 | 30/01/2019 | 2 | true |
+---------+--------------+------------+------------+
There is another table that holds the possible values for the impact column in the table above:
IMPACT:
+---------+
| impact |
+---------+
| 1 |
| 2 |
| 3 |
| 4 |
+---------+
My objective is to extract a result set from the ticket table where I group by the impact, report_date and open flag and count the number of tickets in each group. Therefore, for the example above, I would like to extract the following result set.
+--------------+------------+------------+-----------+
| report_date | impact | open | tkt_count |
+--------------+------------+------------+-----------+
| 27/01/2019 | 1 | true | 1 |
| 27/01/2019 | 1 | false | 0 |
| 27/01/2019 | 2 | true | 0 |
| 27/01/2019 | 2 | false | 0 |
| 27/01/2019 | 3 | true | 0 |
| 27/01/2019 | 3 | false | 0 |
| 27/01/2019 | 4 | true | 0 |
| 27/01/2019 | 4 | false | 0 |
| 29/01/2019 | 1 | true | 2 |
| 29/01/2019 | 1 | false | 0 |
| 29/01/2019 | 2 | true | 1 |
| 29/01/2019 | 2 | false | 0 |
| 29/01/2019 | 3 | true | 0 |
| 29/01/2019 | 3 | false | 0 |
| 29/01/2019 | 4 | true | 0 |
| 29/01/2019 | 4 | false | 0 |
| 30/01/2019 | 1 | true | 0 |
| 30/01/2019 | 1 | false | 0 |
| 30/01/2019 | 2 | true | 1 |
| 30/01/2019 | 2 | false | 0 |
| 30/01/2019 | 3 | true | 0 |
| 30/01/2019 | 3 | false | 0 |
| 30/01/2019 | 4 | true | 1 |
| 30/01/2019 | 4 | false | 0 |
+--------------+------------+------------+-----------+
It seems simple enough, but the problem is with the "zero" rows.
For the example that I showed here, there are no tickets with impact 3 or tickets with the open flag flase for the range of dates given. And I cannot come up with a query that will show me all the counts, even if there are no rows for some values.
Can anyone help me?
Thanks in advance.
To solve this type of problem, one way to proceed is to generate a intermediate resultset that contains all records for which a value needs to be computed, and then LEFT JOIN it with the original data, using aggregation.
SELECT
dt.report_date,
i.impact,
op.[open],
COUNT(t.report_date) tkt_count
FROM
(SELECT DISTINCT report_date FROM ticket) dt
CROSS JOIN impact i
CROSS JOIN (SELECT 'true' [open] UNION ALL SELECT 'false') op
LEFT JOIN ticket t
ON t.report_date = dt.report_date
AND t.impact = i.impact
AND t.[open] = op.[open]
GROUP BY
dt.report_date,
i.impact,
op.[open]
This query generates the intermediate resultset as follows :
report_date : all distinct dates in the original data (report_date)
impact : contains of table impact
open : fixed list containing true or false (could also have been built from distinct values in the original data, but value false was not available is your sample data)
You can choose to change the above rules, the logic should remain the same. For example if there are gaps in the report_date, another widely used option is to create a calendar table.
Demo on DB Fiddle:
report_date | impact | open | tkt_count
:------------------ | -----: | :---- | --------:
27/01/2019 00:00:00 | 1 | false | 0
27/01/2019 00:00:00 | 1 | true | 1
27/01/2019 00:00:00 | 2 | false | 0
27/01/2019 00:00:00 | 2 | true | 0
27/01/2019 00:00:00 | 3 | false | 0
27/01/2019 00:00:00 | 3 | true | 0
27/01/2019 00:00:00 | 4 | false | 0
27/01/2019 00:00:00 | 4 | true | 0
29/01/2019 00:00:00 | 1 | false | 0
29/01/2019 00:00:00 | 1 | true | 2
29/01/2019 00:00:00 | 2 | false | 0
29/01/2019 00:00:00 | 2 | true | 1
29/01/2019 00:00:00 | 3 | false | 0
29/01/2019 00:00:00 | 3 | true | 0
29/01/2019 00:00:00 | 4 | false | 0
29/01/2019 00:00:00 | 4 | true | 0
30/01/2019 00:00:00 | 1 | false | 0
30/01/2019 00:00:00 | 1 | true | 0
30/01/2019 00:00:00 | 2 | false | 0
30/01/2019 00:00:00 | 2 | true | 1
30/01/2019 00:00:00 | 3 | false | 0
30/01/2019 00:00:00 | 3 | true | 0
30/01/2019 00:00:00 | 4 | false | 0
30/01/2019 00:00:00 | 4 | true | 1
I queried against a start and end calendar table by day and cross joined all available impact/open combos and finally bought in the ticket data, counting the non-null matches.
DECLARE #Impact TABLE(Impact INT)
INSERT #Impact VALUES(1),(2),(3),(4)
DECLARE #Tickets TABLE(report_date DATETIME, Impact INT, IsOpen BIT)
INSERT #Tickets VALUES
('01/29/2019',1,1),('01/29/2019',2,1),('01/30/2019',3,1),('01/27/2019',4,1),('01/29/2019',5,1),('01/30/2019',6,1)
DECLARE #StartDate DATETIME='01/01/2019'
DECLARE #EndDate DATETIME='02/01/2019'
;WITH AllDates AS
(
SELECT Date = #StartDate
UNION ALL
SELECT Date= DATEADD(DAY, 1, Date) FROM AllDates WHERE DATEADD(DAY, 1,Date) <= #EndDate
)
,AllImpacts AS
(
SELECT DISTINCT Impact,IsOpen = 1 FROM #Impact
UNION
SELECT DISTINCT Impact,IsOpen = 0 FROM #Impact
),
AllData AS
(
SELECT D.Date,A.impact,A.IsOpen
FROM AllDates D
CROSS APPLY AllImpacts A
)
SELECT
A.Date,A.Impact,A.IsOpen,
GroupCount = COUNT(T.Impact)
FROM
AllData A
LEFT OUTER JOIN #Tickets T ON T.report_date=A.Date AND T.Impact=A.Impact AND T.IsOpen = A.IsOpen
GROUP BY
A.Date,A.Impact,A.IsOpen
ORDER BY
A.Date,A.Impact,A.IsOpen
OPTION (MAXRECURSION 0);
GO
Date | Impact | IsOpen | GroupCount
:------------------ | -----: | -----: | ---------:
01/01/2019 00:00:00 | 1 | 0 | 0
01/01/2019 00:00:00 | 1 | 1 | 0
01/01/2019 00:00:00 | 2 | 0 | 0
01/01/2019 00:00:00 | 2 | 1 | 0
01/01/2019 00:00:00 | 3 | 0 | 0
01/01/2019 00:00:00 | 3 | 1 | 0
01/01/2019 00:00:00 | 4 | 0 | 0
01/01/2019 00:00:00 | 4 | 1 | 0
02/01/2019 00:00:00 | 1 | 0 | 0
02/01/2019 00:00:00 | 1 | 1 | 0
02/01/2019 00:00:00 | 2 | 0 | 0
02/01/2019 00:00:00 | 2 | 1 | 0
02/01/2019 00:00:00 | 3 | 0 | 0
02/01/2019 00:00:00 | 3 | 1 | 0
02/01/2019 00:00:00 | 4 | 0 | 0
02/01/2019 00:00:00 | 4 | 1 | 0
03/01/2019 00:00:00 | 1 | 0 | 0
03/01/2019 00:00:00 | 1 | 1 | 0
03/01/2019 00:00:00 | 2 | 0 | 0
03/01/2019 00:00:00 | 2 | 1 | 0
03/01/2019 00:00:00 | 3 | 0 | 0
03/01/2019 00:00:00 | 3 | 1 | 0
03/01/2019 00:00:00 | 4 | 0 | 0
03/01/2019 00:00:00 | 4 | 1 | 0
04/01/2019 00:00:00 | 1 | 0 | 0
04/01/2019 00:00:00 | 1 | 1 | 0
04/01/2019 00:00:00 | 2 | 0 | 0
04/01/2019 00:00:00 | 2 | 1 | 0
04/01/2019 00:00:00 | 3 | 0 | 0
04/01/2019 00:00:00 | 3 | 1 | 0
04/01/2019 00:00:00 | 4 | 0 | 0
04/01/2019 00:00:00 | 4 | 1 | 0
05/01/2019 00:00:00 | 1 | 0 | 0
05/01/2019 00:00:00 | 1 | 1 | 0
05/01/2019 00:00:00 | 2 | 0 | 0
05/01/2019 00:00:00 | 2 | 1 | 0
05/01/2019 00:00:00 | 3 | 0 | 0
05/01/2019 00:00:00 | 3 | 1 | 0
05/01/2019 00:00:00 | 4 | 0 | 0
05/01/2019 00:00:00 | 4 | 1 | 0
06/01/2019 00:00:00 | 1 | 0 | 0
06/01/2019 00:00:00 | 1 | 1 | 0
06/01/2019 00:00:00 | 2 | 0 | 0
06/01/2019 00:00:00 | 2 | 1 | 0
06/01/2019 00:00:00 | 3 | 0 | 0
06/01/2019 00:00:00 | 3 | 1 | 0
06/01/2019 00:00:00 | 4 | 0 | 0
06/01/2019 00:00:00 | 4 | 1 | 0
07/01/2019 00:00:00 | 1 | 0 | 0
07/01/2019 00:00:00 | 1 | 1 | 0
07/01/2019 00:00:00 | 2 | 0 | 0
07/01/2019 00:00:00 | 2 | 1 | 0
07/01/2019 00:00:00 | 3 | 0 | 0
07/01/2019 00:00:00 | 3 | 1 | 0
07/01/2019 00:00:00 | 4 | 0 | 0
07/01/2019 00:00:00 | 4 | 1 | 0
08/01/2019 00:00:00 | 1 | 0 | 0
08/01/2019 00:00:00 | 1 | 1 | 0
08/01/2019 00:00:00 | 2 | 0 | 0
08/01/2019 00:00:00 | 2 | 1 | 0
08/01/2019 00:00:00 | 3 | 0 | 0
08/01/2019 00:00:00 | 3 | 1 | 0
08/01/2019 00:00:00 | 4 | 0 | 0
08/01/2019 00:00:00 | 4 | 1 | 0
09/01/2019 00:00:00 | 1 | 0 | 0
09/01/2019 00:00:00 | 1 | 1 | 0
09/01/2019 00:00:00 | 2 | 0 | 0
09/01/2019 00:00:00 | 2 | 1 | 0
09/01/2019 00:00:00 | 3 | 0 | 0
09/01/2019 00:00:00 | 3 | 1 | 0
09/01/2019 00:00:00 | 4 | 0 | 0
09/01/2019 00:00:00 | 4 | 1 | 0
10/01/2019 00:00:00 | 1 | 0 | 0
10/01/2019 00:00:00 | 1 | 1 | 0
10/01/2019 00:00:00 | 2 | 0 | 0
10/01/2019 00:00:00 | 2 | 1 | 0
10/01/2019 00:00:00 | 3 | 0 | 0
10/01/2019 00:00:00 | 3 | 1 | 0
10/01/2019 00:00:00 | 4 | 0 | 0
10/01/2019 00:00:00 | 4 | 1 | 0
11/01/2019 00:00:00 | 1 | 0 | 0
11/01/2019 00:00:00 | 1 | 1 | 0
11/01/2019 00:00:00 | 2 | 0 | 0
11/01/2019 00:00:00 | 2 | 1 | 0
11/01/2019 00:00:00 | 3 | 0 | 0
11/01/2019 00:00:00 | 3 | 1 | 0
11/01/2019 00:00:00 | 4 | 0 | 0
11/01/2019 00:00:00 | 4 | 1 | 0
12/01/2019 00:00:00 | 1 | 0 | 0
12/01/2019 00:00:00 | 1 | 1 | 0
12/01/2019 00:00:00 | 2 | 0 | 0
12/01/2019 00:00:00 | 2 | 1 | 0
12/01/2019 00:00:00 | 3 | 0 | 0
12/01/2019 00:00:00 | 3 | 1 | 0
12/01/2019 00:00:00 | 4 | 0 | 0
12/01/2019 00:00:00 | 4 | 1 | 0
13/01/2019 00:00:00 | 1 | 0 | 0
13/01/2019 00:00:00 | 1 | 1 | 0
13/01/2019 00:00:00 | 2 | 0 | 0
13/01/2019 00:00:00 | 2 | 1 | 0
13/01/2019 00:00:00 | 3 | 0 | 0
13/01/2019 00:00:00 | 3 | 1 | 0
13/01/2019 00:00:00 | 4 | 0 | 0
13/01/2019 00:00:00 | 4 | 1 | 0
14/01/2019 00:00:00 | 1 | 0 | 0
14/01/2019 00:00:00 | 1 | 1 | 0
14/01/2019 00:00:00 | 2 | 0 | 0
14/01/2019 00:00:00 | 2 | 1 | 0
14/01/2019 00:00:00 | 3 | 0 | 0
14/01/2019 00:00:00 | 3 | 1 | 0
14/01/2019 00:00:00 | 4 | 0 | 0
14/01/2019 00:00:00 | 4 | 1 | 0
15/01/2019 00:00:00 | 1 | 0 | 0
15/01/2019 00:00:00 | 1 | 1 | 0
15/01/2019 00:00:00 | 2 | 0 | 0
15/01/2019 00:00:00 | 2 | 1 | 0
15/01/2019 00:00:00 | 3 | 0 | 0
15/01/2019 00:00:00 | 3 | 1 | 0
15/01/2019 00:00:00 | 4 | 0 | 0
15/01/2019 00:00:00 | 4 | 1 | 0
16/01/2019 00:00:00 | 1 | 0 | 0
16/01/2019 00:00:00 | 1 | 1 | 0
16/01/2019 00:00:00 | 2 | 0 | 0
16/01/2019 00:00:00 | 2 | 1 | 0
16/01/2019 00:00:00 | 3 | 0 | 0
16/01/2019 00:00:00 | 3 | 1 | 0
16/01/2019 00:00:00 | 4 | 0 | 0
16/01/2019 00:00:00 | 4 | 1 | 0
17/01/2019 00:00:00 | 1 | 0 | 0
17/01/2019 00:00:00 | 1 | 1 | 0
17/01/2019 00:00:00 | 2 | 0 | 0
17/01/2019 00:00:00 | 2 | 1 | 0
17/01/2019 00:00:00 | 3 | 0 | 0
17/01/2019 00:00:00 | 3 | 1 | 0
17/01/2019 00:00:00 | 4 | 0 | 0
17/01/2019 00:00:00 | 4 | 1 | 0
18/01/2019 00:00:00 | 1 | 0 | 0
18/01/2019 00:00:00 | 1 | 1 | 0
18/01/2019 00:00:00 | 2 | 0 | 0
18/01/2019 00:00:00 | 2 | 1 | 0
18/01/2019 00:00:00 | 3 | 0 | 0
18/01/2019 00:00:00 | 3 | 1 | 0
18/01/2019 00:00:00 | 4 | 0 | 0
18/01/2019 00:00:00 | 4 | 1 | 0
19/01/2019 00:00:00 | 1 | 0 | 0
19/01/2019 00:00:00 | 1 | 1 | 0
19/01/2019 00:00:00 | 2 | 0 | 0
19/01/2019 00:00:00 | 2 | 1 | 0
19/01/2019 00:00:00 | 3 | 0 | 0
19/01/2019 00:00:00 | 3 | 1 | 0
19/01/2019 00:00:00 | 4 | 0 | 0
19/01/2019 00:00:00 | 4 | 1 | 0
20/01/2019 00:00:00 | 1 | 0 | 0
20/01/2019 00:00:00 | 1 | 1 | 0
20/01/2019 00:00:00 | 2 | 0 | 0
20/01/2019 00:00:00 | 2 | 1 | 0
20/01/2019 00:00:00 | 3 | 0 | 0
20/01/2019 00:00:00 | 3 | 1 | 0
20/01/2019 00:00:00 | 4 | 0 | 0
20/01/2019 00:00:00 | 4 | 1 | 0
21/01/2019 00:00:00 | 1 | 0 | 0
21/01/2019 00:00:00 | 1 | 1 | 0
21/01/2019 00:00:00 | 2 | 0 | 0
21/01/2019 00:00:00 | 2 | 1 | 0
21/01/2019 00:00:00 | 3 | 0 | 0
21/01/2019 00:00:00 | 3 | 1 | 0
21/01/2019 00:00:00 | 4 | 0 | 0
21/01/2019 00:00:00 | 4 | 1 | 0
22/01/2019 00:00:00 | 1 | 0 | 0
22/01/2019 00:00:00 | 1 | 1 | 0
22/01/2019 00:00:00 | 2 | 0 | 0
22/01/2019 00:00:00 | 2 | 1 | 0
22/01/2019 00:00:00 | 3 | 0 | 0
22/01/2019 00:00:00 | 3 | 1 | 0
22/01/2019 00:00:00 | 4 | 0 | 0
22/01/2019 00:00:00 | 4 | 1 | 0
23/01/2019 00:00:00 | 1 | 0 | 0
23/01/2019 00:00:00 | 1 | 1 | 0
23/01/2019 00:00:00 | 2 | 0 | 0
23/01/2019 00:00:00 | 2 | 1 | 0
23/01/2019 00:00:00 | 3 | 0 | 0
23/01/2019 00:00:00 | 3 | 1 | 0
23/01/2019 00:00:00 | 4 | 0 | 0
23/01/2019 00:00:00 | 4 | 1 | 0
24/01/2019 00:00:00 | 1 | 0 | 0
24/01/2019 00:00:00 | 1 | 1 | 0
24/01/2019 00:00:00 | 2 | 0 | 0
24/01/2019 00:00:00 | 2 | 1 | 0
24/01/2019 00:00:00 | 3 | 0 | 0
24/01/2019 00:00:00 | 3 | 1 | 0
24/01/2019 00:00:00 | 4 | 0 | 0
24/01/2019 00:00:00 | 4 | 1 | 0
25/01/2019 00:00:00 | 1 | 0 | 0
25/01/2019 00:00:00 | 1 | 1 | 0
25/01/2019 00:00:00 | 2 | 0 | 0
25/01/2019 00:00:00 | 2 | 1 | 0
25/01/2019 00:00:00 | 3 | 0 | 0
25/01/2019 00:00:00 | 3 | 1 | 0
25/01/2019 00:00:00 | 4 | 0 | 0
25/01/2019 00:00:00 | 4 | 1 | 0
26/01/2019 00:00:00 | 1 | 0 | 0
26/01/2019 00:00:00 | 1 | 1 | 0
26/01/2019 00:00:00 | 2 | 0 | 0
26/01/2019 00:00:00 | 2 | 1 | 0
26/01/2019 00:00:00 | 3 | 0 | 0
26/01/2019 00:00:00 | 3 | 1 | 0
26/01/2019 00:00:00 | 4 | 0 | 0
26/01/2019 00:00:00 | 4 | 1 | 0
27/01/2019 00:00:00 | 1 | 0 | 0
27/01/2019 00:00:00 | 1 | 1 | 0
27/01/2019 00:00:00 | 2 | 0 | 0
27/01/2019 00:00:00 | 2 | 1 | 0
27/01/2019 00:00:00 | 3 | 0 | 0
27/01/2019 00:00:00 | 3 | 1 | 0
27/01/2019 00:00:00 | 4 | 0 | 0
27/01/2019 00:00:00 | 4 | 1 | 1
28/01/2019 00:00:00 | 1 | 0 | 0
28/01/2019 00:00:00 | 1 | 1 | 0
28/01/2019 00:00:00 | 2 | 0 | 0
28/01/2019 00:00:00 | 2 | 1 | 0
28/01/2019 00:00:00 | 3 | 0 | 0
28/01/2019 00:00:00 | 3 | 1 | 0
28/01/2019 00:00:00 | 4 | 0 | 0
28/01/2019 00:00:00 | 4 | 1 | 0
29/01/2019 00:00:00 | 1 | 0 | 0
29/01/2019 00:00:00 | 1 | 1 | 1
29/01/2019 00:00:00 | 2 | 0 | 0
29/01/2019 00:00:00 | 2 | 1 | 1
29/01/2019 00:00:00 | 3 | 0 | 0
29/01/2019 00:00:00 | 3 | 1 | 0
29/01/2019 00:00:00 | 4 | 0 | 0
29/01/2019 00:00:00 | 4 | 1 | 0
30/01/2019 00:00:00 | 1 | 0 | 0
30/01/2019 00:00:00 | 1 | 1 | 0
30/01/2019 00:00:00 | 2 | 0 | 0
30/01/2019 00:00:00 | 2 | 1 | 0
30/01/2019 00:00:00 | 3 | 0 | 0
30/01/2019 00:00:00 | 3 | 1 | 1
30/01/2019 00:00:00 | 4 | 0 | 0
30/01/2019 00:00:00 | 4 | 1 | 0
31/01/2019 00:00:00 | 1 | 0 | 0
31/01/2019 00:00:00 | 1 | 1 | 0
31/01/2019 00:00:00 | 2 | 0 | 0
31/01/2019 00:00:00 | 2 | 1 | 0
31/01/2019 00:00:00 | 3 | 0 | 0
31/01/2019 00:00:00 | 3 | 1 | 0
31/01/2019 00:00:00 | 4 | 0 | 0
31/01/2019 00:00:00 | 4 | 1 | 0
01/02/2019 00:00:00 | 1 | 0 | 0
01/02/2019 00:00:00 | 1 | 1 | 0
01/02/2019 00:00:00 | 2 | 0 | 0
01/02/2019 00:00:00 | 2 | 1 | 0
01/02/2019 00:00:00 | 3 | 0 | 0
01/02/2019 00:00:00 | 3 | 1 | 0
01/02/2019 00:00:00 | 4 | 0 | 0
01/02/2019 00:00:00 | 4 | 1 | 0
db<>fiddle here
Problem
I am having trouble figuring out how to create a query that can tell if any userentry is preceded by 7 days without any activity (secondsPlayed == 0) and if so, then indicate it with the value of 1, otherwise 0.
This also means that if the user has less than 7 entries, the value will be 0 across all entries.
Input table:
+------------------------------+-------------------------+---------------+
| userid | estimationDate | secondsPlayed |
+------------------------------+-------------------------+---------------+
| a | 2016-07-14 00:00:00 UTC | 192.5 |
| a | 2016-07-15 00:00:00 UTC | 357.3 |
| a | 2016-07-16 00:00:00 UTC | 0 |
| a | 2016-07-17 00:00:00 UTC | 0 |
| a | 2016-07-18 00:00:00 UTC | 0 |
| a | 2016-07-19 00:00:00 UTC | 0 |
| a | 2016-07-20 00:00:00 UTC | 0 |
| a | 2016-07-21 00:00:00 UTC | 0 |
| a | 2016-07-22 00:00:00 UTC | 0 |
| a | 2016-07-23 00:00:00 UTC | 0 |
| a | 2016-07-24 00:00:00 UTC | 0 |
| ---------------------------- | ---------------------- | ---- |
| b | 2016-07-02 00:00:00 UTC | 31.2 |
| b | 2016-07-03 00:00:00 UTC | 42.1 |
| b | 2016-07-04 00:00:00 UTC | 41.9 |
| b | 2016-07-05 00:00:00 UTC | 43.2 |
| b | 2016-07-06 00:00:00 UTC | 91.5 |
| b | 2016-07-07 00:00:00 UTC | 0 |
| b | 2016-07-08 00:00:00 UTC | 0 |
| b | 2016-07-09 00:00:00 UTC | 239.1 |
| b | 2016-07-10 00:00:00 UTC | 0 |
+------------------------------+-------------------------+---------------+
The intended output table should look like this:
Output table:
+------------------------------+-------------------------+---------------+----------+
| userid | estimationDate | secondsPlayed | inactive |
+------------------------------+-------------------------+---------------+----------+
| a | 2016-07-14 00:00:00 UTC | 192.5 | 0 |
| a | 2016-07-15 00:00:00 UTC | 357.3 | 0 |
| a | 2016-07-16 00:00:00 UTC | 0 | 0 |
| a | 2016-07-17 00:00:00 UTC | 0 | 0 |
| a | 2016-07-18 00:00:00 UTC | 0 | 0 |
| a | 2016-07-19 00:00:00 UTC | 0 | 0 |
| a | 2016-07-20 00:00:00 UTC | 0 | 0 |
| a | 2016-07-21 00:00:00 UTC | 0 | 0 |
| a | 2016-07-22 00:00:00 UTC | 0 | 1 |
| a | 2016-07-23 00:00:00 UTC | 0 | 1 |
| a | 2016-07-24 00:00:00 UTC | 0 | 1 |
| ---------------------------- | ----------------------- | ----- | ----- |
| b | 2016-07-02 00:00:00 UTC | 31.2 | 0 |
| b | 2016-07-03 00:00:00 UTC | 42.1 | 0 |
| b | 2016-07-04 00:00:00 UTC | 41.9 | 0 |
| b | 2016-07-05 00:00:00 UTC | 43.2 | 0 |
| b | 2016-07-06 00:00:00 UTC | 91.5 | 0 |
| b | 2016-07-07 00:00:00 UTC | 0 | 0 |
| b | 2016-07-08 00:00:00 UTC | 0 | 0 |
| b | 2016-07-09 00:00:00 UTC | 239.1 | 0 |
| b | 2016-07-10 00:00:00 UTC | 0 | 0 |
+------------------------------+-------------------------+---------------+----------+
Thoughts
At first I was thinking about using the Lag function with a 7 offset, but this would obviously not relate to any of the subjects in between.
I was also thinking about creating a rolling window/average for a period of 7 days and evaluating if this is above 0. However this might be a bit above my skill level.
Any chance anyone has a good solution to this problem.
Assuming that you have data every day (which seems like a reasonable assumption), you can sum a window function:
select t.*,
(case when sum(secondsplayed) over (partition by userid
order by estimationdate
rows between 6 preceding and current row
) = 0 and
row_number() over (partition by userid order by estimationdate) >= 7
then 1
else 0
end) as inactive
from t;
In addition to no holes in the dates, this also assumes that secondsplayed is never negative. (Negative values can easily be incorporated into the logic, but that seems unnecessary.)
In my experience this type of input tables do not consist of inactivity entries and usually look like this (only activity entries are present here)
Input table:
+------------------------------+-------------------------+---------------+
| userid | estimationDate | secondsPlayed |
+------------------------------+-------------------------+---------------+
| a | 2016-07-14 00:00:00 UTC | 192.5 |
| a | 2016-07-15 00:00:00 UTC | 357.3 |
| ---------------------------- | ---------------------- | ---- |
| b | 2016-07-02 00:00:00 UTC | 31.2 |
| b | 2016-07-03 00:00:00 UTC | 42.1 |
| b | 2016-07-04 00:00:00 UTC | 41.9 |
| b | 2016-07-05 00:00:00 UTC | 43.2 |
| b | 2016-07-06 00:00:00 UTC | 91.5 |
| b | 2016-07-09 00:00:00 UTC | 239.1 |
+------------------------------+-------------------------+---------------+
So, below is for BigQuery Standard SQL and input as above
#standardSQL
WITH `project.dataset.table` AS (
SELECT 'a' userid, TIMESTAMP '2016-07-14 00:00:00 UTC' estimationDate, 192.5 secondsPlayed UNION ALL
SELECT 'a', '2016-07-15 00:00:00 UTC', 357.3 UNION ALL
SELECT 'b', '2016-07-02 00:00:00 UTC', 31.2 UNION ALL
SELECT 'b', '2016-07-03 00:00:00 UTC', 42.1 UNION ALL
SELECT 'b', '2016-07-04 00:00:00 UTC', 41.9 UNION ALL
SELECT 'b', '2016-07-05 00:00:00 UTC', 43.2 UNION ALL
SELECT 'b', '2016-07-06 00:00:00 UTC', 91.5 UNION ALL
SELECT 'b', '2016-07-09 00:00:00 UTC', 239.1
), time_frame AS (
SELECT day
FROM UNNEST(GENERATE_DATE_ARRAY('2016-07-02', '2016-07-24')) day
)
SELECT
users.userid,
day,
IFNULL(secondsPlayed, 0) secondsPlayed,
CAST(1 - SIGN(SUM(IFNULL(secondsPlayed, 0))
OVER(
PARTITION BY users.userid
ORDER BY UNIX_DATE(day)
RANGE BETWEEN 6 PRECEDING AND CURRENT ROW
)) AS INT64) AS inactive
FROM time_frame tf
CROSS JOIN (SELECT DISTINCT userid FROM `project.dataset.table`) users
LEFT JOIN `project.dataset.table` t
ON day = DATE(estimationDate) AND users.userid = t.userid
ORDER BY userid, day
with result
Row userid day secondsPlayed inactive
...
13 a 2016-07-14 192.5 0
14 a 2016-07-15 357.3 0
15 a 2016-07-15 357.3 0
16 a 2016-07-16 0.0 0
17 a 2016-07-17 0.0 0
18 a 2016-07-18 0.0 0
19 a 2016-07-19 0.0 0
20 a 2016-07-20 0.0 0
21 a 2016-07-21 0.0 0
22 a 2016-07-22 0.0 1
23 a 2016-07-23 0.0 1
24 a 2016-07-24 0.0 1
25 b 2016-07-02 31.2 0
26 b 2016-07-03 42.1 0
27 b 2016-07-04 41.9 0
28 b 2016-07-05 43.2 0
29 b 2016-07-06 91.5 0
30 b 2016-07-07 0.0 0
31 b 2016-07-08 0.0 0
32 b 2016-07-09 239.1 0
33 b 2016-07-10 0.0 0
...
I need a query to group an aggregate in one table by date ranges in another table.
Table 1
weeknumber | weekyear | weekstart | weekend
------------+----------+------------+------------
18 | 2016 | 2016-02-01 | 2016-02-08
19 | 2016 | 2016-02-08 | 2016-02-15
20 | 2016 | 2016-02-15 | 2016-02-22
21 | 2016 | 2016-02-22 | 2016-02-29
22 | 2016 | 2016-02-29 | 2016-03-07
23 | 2016 | 2016-03-07 | 2016-03-14
24 | 2016 | 2016-03-14 | 2016-03-21
25 | 2016 | 2016-03-21 | 2016-03-28
26 | 2016 | 2016-03-28 | 2016-04-04
27 | 2016 | 2016-04-04 | 2016-04-11
28 | 2016 | 2016-04-11 | 2016-04-18
29 | 2016 | 2016-04-18 | 2016-04-25
30 | 2016 | 2016-04-25 | 2016-05-02
31 | 2016 | 2016-05-02 | 2016-05-09
32 | 2016 | 2016-05-09 | 2016-05-16
33 | 2016 | 2016-05-16 | 2016-05-23
34 | 2016 | 2016-05-23 | 2016-05-30
35 | 2016 | 2016-05-30 | 2016-06-06
36 | 2016 | 2016-06-06 | 2016-06-13
37 | 2016 | 2016-06-13 | 2016-06-20
38 | 2016 | 2016-06-20 | 2016-06-27
39 | 2016 | 2016-06-27 | 2016-07-04
40 | 2016 | 2016-07-04 | 2016-07-11
41 | 2016 | 2016-07-11 | 2016-07-18
42 | 2016 | 2016-07-18 | 2016-07-25
43 | 2016 | 2016-07-25 | 2016-08-01
44 | 2016 | 2016-08-01 | 2016-08-08
45 | 2016 | 2016-08-08 | 2016-08-15
46 | 2016 | 2016-08-15 | 2016-08-22
47 | 2016 | 2016-08-22 | 2016-08-29
48 | 2016 | 2016-08-29 | 2016-09-05
49 | 2016 | 2016-09-05 | 2016-09-12
Table 2
accountid | rdate | fee1 | fee2 | fee3 | fee4
-----------+------------+------+------+------+------
481164 | 2015-12-22 | 8 | 1 | 5 | 1
481164 | 2002-12-22 | 1 | 0 | 0 | 0
481166 | 2015-12-22 | 1 | 0 | 0 | 0
481166 | 2016-10-20 | 14 | 0 | 0 | 0
481166 | 2016-10-02 | 5 | 0 | 0 | 0
481166 | 2016-01-06 | 18 | 4 | 0 | 5
482136 | 2016-07-04 | 18 | 0 | 0 | 0
481164 | 2016-07-04 | 2 | 3 | 4 | 5
481164 | 2016-06-28 | 34 | 0 | 0 | 0
481166 | 2016-07-20 | 50 | 0 | 0 | 69
481166 | 2016-07-13 | 16 | 0 | 0 | 5
481166 | 2016-09-15 | 8 | 0 | 0 | 2
481166 | 2016-10-03 | 8 | 0 | 0 | 0
I need to aggregate fee1+fee2+fee3+fee4 for rdates in each date range(weekstart,weekend) in table 1 and then group by accountid. Something like this:
accountid | weekstart | weekend | SUM
-----------+------------+------------+------
481164 | 2016-02-01 | 2016-02-08 | 69
481166 | 2016-02-01 | 2016-02-08 | 44
481164 | 2016-02-08 | 2016-02-15 | 22
481166 | 2016-02-08 | 2016-02-15 | 12
select accountid, weekstart, weekend,
sum(fee1 + fee2 + fee3 + fee4) as total_fee
from table2
inner join table1 on table2.rdate >= table1.weekstart and table2.rdate < table1.weekend
group by accountid, weekstart, weekend;
Just a thing:
weeknumber | weekyear | weekstart | weekend
------------+----------+------------+------------
18 | 2016 | 2016-02-01 | 2016-02-08
19 | 2016 | 2016-02-08 | 2016-02-15
weekend for week 18 should be 2016-02-07, because 2016-02-08 is weekstart for week 19.
weeknumber | weekyear | weekstart | weekend
------------+----------+------------+------------
18 | 2016 | 2016-02-01 | 2016-02-07
19 | 2016 | 2016-02-08 | 2016-02-14
Check it here: http://rextester.com/NCBO56250
I am searching the most efficient way to make a relatively complicated query in a relatively large table.
The concept is that:
I have a table that holds records of phases that can run parallel to each other
The amount of records exceeds the 5 millions (and increases)
The time period starts about 5 years ago
Due to performance reasons, this select could be applied on the last 3 months period of time with 300.000 records (only if it is not physically possible to do it for the whole table)
Oracle version: 11g
The data sample seems as following
Table Phases (ID, START_TS, END_TS, PRIO)
1 10:00:00 10:20:10 10
2 10:05:00 10:10:00 11
3 10:05:20 10:15:00 9
4 10:16:00 10:25:00 8
5 10:24:00 10:45:15 1
6 10:26:00 10:30:00 10
7 10:27:00 10:35:00 15
8 10:34:00 10:50:00 5
9 10:50:00 10:55:00 20
10 10:55:00 11:00:00 15
Above you can see how the information is currently stored (of course there are several other columns with irrelevant information).
There are two requirements (or problems to be solved)
If we sum the duration of all the phases, the result is MUCH more than an hour that the above data represent. (There could be holes between the phases, so taking the first start_ts and the last end_ts would not be sufficient).
The data should be displayed in a form that it would be visible which phases run parallel with which and which phase had the highest priority at each time, as shown in the expected view below
Here it is easy to distinct the highest priority phase at each time (HIGHEST_PRIO), and adding their duration would result the actual total duration.
View V_Parallel_Phases (ID, START_TS, END_TS, PRIO, HIGHEST_PRIO)
-> Optional Columns: Part_of_ID / Runs_Parallel
1 10:00:00 10:05:20 10 True (--> Part_1 / False)
1 10:05:20 10:15:00 10 False (--> Part_2 / True)
2 10:05:00 10:10:00 11 False (--> Part_1 / True)
3 10:05:20 10:15:00 9 True (--> Part_1 / True)
1 10:15:00 10:16:00 10 True (--> Part_3 / True)
1 10:16:00 10:20:10 10 False (--> Part_4 / True)
4 10:16:00 10:24:00 8 True (--> Part_1 / True)
4 10:24:00 10:25:00 8 False (--> Part_2 / True)
5 10:24:00 10:45:15 1 True (--> Part_1 / True)
6 10:26:00 10:30:00 10 False (--> Part_1 / True)
7 10:27:00 10:35:00 15 False (--> Part_1 / True)
8 10:34:00 10:45:15 5 False (--> Part_1 / True)
8 10:45:15 10:50:00 5 True (--> Part_2 / True)
9 10:50:00 10:55:00 20 True (--> Part_2 / False)
10 10:55:00 11:00:00 15 True (--> Part_2 / False)
Unfortunately I am not aware of an efficient way to make this query. The current solution was to make the above calculations programmatically in the tool that generates a large report but it was a total failure. From the 30 seconds that were needed before this calculations, now it needs over 10 minutes without taking event into consideration the priorities of the phases..
Then I thought of translating this code into sql in either: a) a view b) a materialized view c) a table that I would fill with a procedure once in a while (depending on the required duration).
PS: I am aware that oracle has some analytical functions that can handle complicated queries but I am not aware of which could actually help me in the current problem.
Thank you in advance!
This is an incomplete answer, but I need to know if this approach is viable before going on. I believe it is possible to do completely in SQL, but I am not sure how the performance will be.
First find out all points in time where there is a transition:
CREATE VIEW Events AS
SELECT START_TS AS TS
FROM Phases
UNION
SELECT END_TS AS TS
FROM Phases
;
Then create (start, end) tuples from those points in time:
CREATE VIEW Segments AS
SELECT START.TS AS START_TS,
MIN(END.TS) AS END_TS
FROM Events AS START
JOIN Events AS END
WHERE START.TS < END.TS
;
From here on, doing the rest should be fairly straight forward. Here is a query that lists the segments and all the phases that are active in the given segment:
SELECT *
FROM Segments
JOIN Phases
WHERE Segments.START_TS BETWEEN Phases.START_TS AND Phases.END_TS
AND Segments.END_TS BETWEEN Phases.START_TS AND Phases.END_TS
ORDER BY Segments.START_TS
;
The rest can be done with subselects and some aggregates.
| START_TS | END_TS | ID | START_TS | END_TS | PRIO |
|----------|----------|----|----------|----------|------|
| 10:00:00 | 10:05:00 | 1 | 10:00:00 | 10:20:10 | 10 |
| 10:05:00 | 10:05:20 | 1 | 10:00:00 | 10:20:10 | 10 |
| 10:05:00 | 10:05:20 | 2 | 10:05:00 | 10:10:00 | 11 |
| 10:05:20 | 10:10:00 | 1 | 10:00:00 | 10:20:10 | 10 |
| 10:05:20 | 10:10:00 | 2 | 10:05:00 | 10:10:00 | 11 |
| 10:05:20 | 10:10:00 | 3 | 10:05:20 | 10:15:00 | 9 |
| 10:10:00 | 10:15:00 | 1 | 10:00:00 | 10:20:10 | 10 |
| 10:10:00 | 10:15:00 | 3 | 10:05:20 | 10:15:00 | 9 |
| 10:15:00 | 10:16:00 | 1 | 10:00:00 | 10:20:10 | 10 |
| 10:16:00 | 10:20:10 | 1 | 10:00:00 | 10:20:10 | 10 |
| 10:16:00 | 10:20:10 | 4 | 10:16:00 | 10:25:00 | 8 |
| 10:20:10 | 10:24:00 | 4 | 10:16:00 | 10:25:00 | 8 |
| 10:24:00 | 10:25:00 | 4 | 10:16:00 | 10:25:00 | 8 |
| 10:24:00 | 10:25:00 | 5 | 10:24:00 | 10:45:15 | 1 |
| 10:25:00 | 10:26:00 | 5 | 10:24:00 | 10:45:15 | 1 |
| 10:26:00 | 10:27:00 | 5 | 10:24:00 | 10:45:15 | 1 |
| 10:26:00 | 10:27:00 | 6 | 10:26:00 | 10:30:00 | 10 |
| 10:27:00 | 10:30:00 | 5 | 10:24:00 | 10:45:15 | 1 |
| 10:27:00 | 10:30:00 | 6 | 10:26:00 | 10:30:00 | 10 |
| 10:27:00 | 10:30:00 | 7 | 10:27:00 | 10:35:00 | 15 |
| 10:30:00 | 10:34:00 | 5 | 10:24:00 | 10:45:15 | 1 |
| 10:30:00 | 10:34:00 | 7 | 10:27:00 | 10:35:00 | 15 |
| 10:34:00 | 10:35:00 | 8 | 10:34:00 | 10:50:00 | 5 |
| 10:34:00 | 10:35:00 | 5 | 10:24:00 | 10:45:15 | 1 |
| 10:34:00 | 10:35:00 | 7 | 10:27:00 | 10:35:00 | 15 |
| 10:35:00 | 10:45:15 | 5 | 10:24:00 | 10:45:15 | 1 |
| 10:35:00 | 10:45:15 | 8 | 10:34:00 | 10:50:00 | 5 |
| 10:45:15 | 10:50:00 | 8 | 10:34:00 | 10:50:00 | 5 |
| 10:50:00 | 10:55:00 | 9 | 10:50:00 | 10:55:00 | 20 |
| 10:55:00 | 11:00:00 | 10 | 10:55:00 | 11:00:00 | 15 |
There is a SQL fiddle demonstrating the whole thing here:
http://sqlfiddle.com/#!9/d801b/2