How to use BigQuery Analytic Functions to calculate time between timestamped rows? - sql

I have a data set that represents analytics events like:
Row timestamp account_id type
1 2018-11-14 21:05:40 UTC abc start
2 2018-11-14 21:05:40 UTC xyz another_type
3 2018-11-26 22:01:19 UTC xyz start
4 2018-11-26 22:01:23 UTC abc start
5 2018-11-26 22:01:29 UTC xyz some_other_type
11 2018-11-26 22:13:58 UTC xyz start
...
With some number of account_ids. I need to find the average time between start records per account_id.
I'm trying to use analytic functions as described here. My end goal would be a table like:
Row account_id avg_time_between_events_mins
1 xyz 53
2 abc 47
3 pqr 65
...
my best attempt--based on this post--looks like this:
WITH
events AS (
SELECT
COUNTIF(type = 'start' AND account_id='abc') OVER (ORDER BY timestamp) as diff,
timestamp
FROM
`myproject.dataset.events`
WHERE
account_id='abc')
SELECT
min(timestamp) AS start_time,
max(timestamp) AS next_start_time,
ABS(timestamp_diff(min(timestamp), max(timestamp), MINUTE)) AS minutes_between
FROM
events
GROUP BY
diff
This calculates the time between each start event and the last non-start event prior to the next start event for a specific account_id.
I tried to use PARTITION and a WINDOW FRAME CLAUSE like this:
WITH
events AS (
SELECT
COUNT(*) OVER (PARTITION BY account_id ORDER BY timestamp ROWS BETWEEN CURRENT ROW AND 1 FOLLOWING) as diff,
timestamp
FROM
`myproject.dataset.events`
WHERE
type = 'start')
SELECT
min(timestamp) AS start_time,
max(timestamp) AS next_start_time,
ABS(timestamp_diff(min(timestamp), max(timestamp), MINUTE)) AS minutes_between
FROM
events
GROUP BY
diff
But I got a nonsense result table. Can anyone walk me through how I would write and reason about a query like this?

You don't really need analytic functions for this:
select timestamp_diff(min(timestamp), max(timestamp), MINUTE)) / nullif(count(*) - 1, 0)
from `myproject.dataset.events`
where type = 'start'
group by account_id;
This is the timestamp of the most recent minus the oldest, divided by one less than the number of starts. That is the average between the starts.

Related

Extract previous row calculated value for use in current row calculations - Postgres

Have a requirement where I would need to rope the calculated value of the previous row for calculation in the current row.
The following is a sample of how the data currently looks :-
ID
Date
Days
1
2022-01-15
30
2
2022-02-18
30
3
2022-03-15
90
4
2022-05-15
30
The following is the output What I am expecting :-
ID
Date
Days
CalVal
1
2022-01-15
30
2022-02-14
2
2022-02-18
30
2022-03-16
3
2022-03-15
90
2022-06-14
4
2022-05-15
30
2022-07-14
The value of CalVal for the first row is Date + Days
From the second row onwards it should take the CalVal value of the previous row and add it with the current row Days
Essentially, what I am looking for is means to access the previous rows calculated value for use in the current row.
Is there anyway we can achieve the above via Postgres SQL? I have been tinkering with window functions and even recursive CTEs but have had no luck :(
Would appreciate any direction!
Thanks in advance!
select
id,
date,
coalesce(
days - (lag(days, 1) over (order by date, days))
, days) as days,
first_date + cast(days as integer) as newdate
from
(
select
-- get a running sum of days
id,
first_date,
date,
sum(days) over (order by date, days) as days
from
(
select
-- get the first date
id,
(select min(date) from table1) as first_date,
date,
days
from
table1
) A
) B
This query get the exact output you described. I'm not at all ready to say it is the best solution but the strategy employed is to essential create a running total of the "days" ... this means that we can just add this running total to the first date and that will always be the next date in the desired sequence. One finesse: to put the "days" back into the result, we calculated the current running total less the previous running total to arrive at the original amount.
assuming that table name is table1
select
id,
date,
days,
first_value(date) over (order by id) +
(sum(days) over (order by id rows between unbounded preceding and current row))
*interval '1 day' calval
from table1;
We just add cumulative sum of days to first date in table. It's not really what you want to do (we don't need date from previous row, just cumulative days sum)
Solution with recursion
with recursive prev_row as (
select id, date, days, date+ days*interval '1 day' calval
from table1
where id = 1
union all
select t.id, t.date, t.days, p.calval + t.days*interval '1 day' calval
from prev_row p
join table1 t on t.id = p.id+ 1
)
select *
from prev_row

SQL Server : count distinct every 30 minutes or more

We have an activity database that records user interaction to a website, storing a log that includes values such as Time1, session_id and customer_id e.g.
2022-05-12 08:00:00|11|1
2022-05-12 08:20:00|11|1
2022-05-12 08:30:01|11|1
2022-05-12 08:14:00|22|2
2022-05-12 08:18:00|22|2
2022-05-12 08:16:00|33|1
2022-05-12 08:50:00|33|1
I need to have two separate queries:
Query #1: I need to count sessions multiple times if they have a log of 30 minutes or more grouping them on sessions on daily basis.
For example: Initially count=0
For session_id = 11, it starts at 08:00 and the last time with the same session_id is 08:30 -- count=1
For session_id = 22 it starts at 08:14 and the last time with the same session is 08:14 -- still the count=1 since it was less than 30 min
I tried this query, but it didn't work
select
count(session_id)
from
table1
where
#datetime between Time1 and dateadd(minute, 30, Time1);
Expected result:
Query #2: it's an extension of the above query where I need the unique customers on daily basis whose sessions were 30 min or more.
For example: from the above table I will have two unique customers on May 8th
Expected result
For the Time1 column, the input is in timestamp format when I show it in output I will group it on a basis.
This is a two-level aggregation (GROUP BY) problem. You need to start with a subquery to get the first and last timestamp of each session.
SELECT MIN(Time1) start_time,
MAX(Time1) end_time,
session_id, customer_id
FROM table1
GROUP BY session_id, customer_id
Next you need to use the subquery like this:
SELECT COUNT(session_id),
COUNT(DISTINCT customer_id),
CAST(start_time AS DATE)
FROM (
SELECT MIN(Time1) start_time,
MAX(Time1) end_time,
session_id, customer_id
FROM table1
GROUP BY session_id, customer_id
) a
WHERE DATEDIFF(MINUTE, start_time, end_time) >= 30
GROUP BY CAST(start_time AS DATE);

Get all entries where the following entry is not +1 minute

I have a table which stores all my executed jobs. Know I want to know if all the jobs get executed properly (every minute). Each entry has a created_at timestamp.
My question now is how can I select all entries which where not executed 1 minute after the last entry. This is a very complex query I feel like. So far I just have all the entries orderd by created_at.
SELECT *
FROM jobs
WHERE created_at IS NOT NULL
ORDER By created_at
created_at is a timestamp. Something like 2020-02-02 10:00:00.
Table Structure:
id job_name created_at
-----------------------------------
1 ABC 2020-02-02 10:00:00
2 ABC 2020-02-02 10:01:00
3 ABC 2020-02-02 10:02:00
4 ABC 2020-02-02 10:04:00
5 ABC 2020-02-02 10:07:00
The result I want is:
Now I want to get all dates where the job didn't get executed. So, at 10:03:00, 10:05:00 and 10:06:00 the job didn't get executed!
Do you guys have any idea? I guess its a recursive query. This query needs to be written in postgres.
WITH Table_with_next AS (
SELECT
id
,job_name
,created_at
,LEAD(created_at) OVER (PARTITION BY job_name ORDER BY created_at) as next_created_at
FROM jobs
)
SELECT
job_name
,generate_series(created_at + interval '1 min'
,next_created_at - interval '1 min'
,interval '1 min') as time_not_run
FROM Table_with_next
WHERE next_created_at-created_at > interval '1 min'
I used a CTE that contains the LEAD analytical function to get the next run timestamp. I then filtered the rows that have more than 1 minute between the runs and for those rows I generated 1 minute intervals between the run timestamp and the next run timestamp.
You can play around with it here: http://sqlfiddle.com/#!17/b237e/11
I am guessing you want one job run per calendar minute. That way, you are immune from 59 versus 61 second lags.
You don't need lead() for this. Just generate the time series and join or use not exists:
select gs.job_name, gs.dt
from (select job_name,
generate_series(min(date_trunc('minute', created_at)),
max(created_at),
interval '1 minute'
) as dt
from tests
group by job_name
) gs
where not exists (select 1
from tests t
where t.job_name = gs.job_name and
date_trunc('minute', t.created_at) = gs.dt
);

How do i give the condition to group by time period?

I need to get the count of records using PostgreSQL from time 7:00:00 am till next day 6:59:59 am and the count resets again from 7:00am to 6:59:59 am.
Where I am using backend as java (Spring boot).
The columns in my table are
id (primary_id)
createdon (timestamp)
name
department
createdby
How do I give the condition for shift wise?
You'd need to pick a slice based on the current time-of-day (I am assuming this to be some kind of counter which will be auto-refreshed in some application).
One way to do that is using time ranges:
SELECT COUNT(*)
FROM mytable
WHERE createdon <# (
SELECT CASE
WHEN current_time < '07:00'::time THEN
tsrange(CURRENT_DATE - '1d'::interval + '07:00'::time, CURRENT_DATE + '07:00'::time, '[)')
ELSE
tsrange(CURRENT_DATE + '07:00'::time, CURRENT_DATE + '1d'::interval + '07:00'::time, '[)')
END
)
;
Example with data: https://rextester.com/LGIJ9639
As I understand the question, you need to have a separate group for values in each 24-hour period that starts at 07:00:00.
SELECT
(
date_trunc('day', (createdon - '7h'::interval))
+ '7h'::interval
) AS date_bucket,
count(id) AS count
FROM lorem
GROUP BY date_bucket
ORDER BY date_bucket
This uses the date and time functions and the GROUP BY clause:
Shift the timestamp value back 7 hours ((createdon - '7h'::interval)), so the distinction can be made by a change of date (at 00:00:00). Then,
Truncate the value to the date (date_trunc('day', …)), so that all values in a bucket are flattened to a single value (the date at midnight). Then,
Add 7 hours again to the value (… + '7h'::interval), so that it represents the starting time of the bucket. Then,
Group by that value (GROUP BY date_bucket).
A more complete example, with schema and data:
DROP TABLE IF EXISTS lorem;
CREATE TABLE lorem (
id serial PRIMARY KEY,
createdon timestamp not null
);
INSERT INTO lorem (createdon) (
SELECT
generate_series(
CURRENT_TIMESTAMP - '36h'::interval,
CURRENT_TIMESTAMP + '36h'::interval,
'45m'::interval)
);
Now the query:
SELECT
(
date_trunc('day', (createdon - '7h'::interval))
+ '7h'::interval
) AS date_bucket,
count(id) AS count
FROM lorem
GROUP BY date_bucket
ORDER BY date_bucket
;
produces this result:
date_bucket | count
---------------------+-------
2019-03-06 07:00:00 | 17
2019-03-07 07:00:00 | 32
2019-03-08 07:00:00 | 32
2019-03-09 07:00:00 | 16
(4 rows)
You can use aggregation -- by subtracting 7 hours:
select (createdon - interval '7 hour')::date as dy, count(*)
from t
group by dy
order by dy;

Grouping Timestamps based on the interval between them

I have a table in Hive (SQL) with a bunch of timestamps that need to be grouped in order to create separate sessions based on the time difference between the timestamps.
Example:
Consider the following timestamps(Given in HH:MM for simplicity):
9.00
9.10
9.20
9.40
9.43
10.30
10.45
11.25
12.30
12.33
and so on..
So now, all timestamps that fall within 30 mins of the next timestamp come under the same session,
i.e. 9.00,9.10,9.20,9.40,9.43 form 1 session.
But since the difference between 9.43 and 10.30 is more than 30 mins, the time stamp 10.30 falls under a different session. Again, 10.30 and 10.45 fall under one session.
After we have created these sessions, we have to obtain the minimum timestamp for that session and the max timestamp.
I tried to subtract the current timestamp with its LEAD and place a flag if it is greater than 30 mins, but I'm having difficulty with this.
Any suggestion from you guys would be greatly appreciated. Please let me know if the question isn't clear enough.
Expected Output for this sample data:
Session_start Session_end
9.00 9.43
10.30 10.45
11.25 11.25 (same because the next time is not within 30 mins)
12.30 12.33
Hope this helps.
So it's not MySQL but Hive. I don't know Hive, but if it supports LAG, as you say, try this PostgreSQL query. You will probably have to change the time difference calculation, that's usually different from one dbms to another.
select min(thetime) as start_time, max(thetime) as end_time
from
(
select thetime, count(gap) over (rows between unbounded preceding and current row) as groupid
from
(
select thetime, case when thetime - lag(thetime) over (order by thetime) > interval '30 minutes' then 1 end as gap
from mytable
) times
) groups
group by groupid
order by min(thetime);
The query finds gaps, then uses a running total of gap counts to build group IDs, and the rest is aggregation.
SQL fiddle: http://www.sqlfiddle.com/#!17/8bc4a/6.
With MySQL lacking LAG and LEAD functions, getting the previous or next record is some work already. Here is how:
select
thetime,
(select max(thetime) from mytable afore where afore.thetime < mytable.thetime) as afore_time,
(select min(thetime) from mytable after where after.thetime > mytable.thetime) as after_time
from mytable;
Based on this we can build the whole query where we are looking for gaps (i.e. the time difference to the previous or next record is more than 30 minutes = 1800 seconds).
select
startrec.thetime as start_time,
(
select min(endrec.thetime)
from
(
select
thetime,
coalesce(time_to_sec(timediff((select min(thetime) from mytable after where after.thetime > mytable.thetime), thetime)), 1801) > 1800 as gap
from mytable
) endrec
where gap
and endrec.thetime >= startrec.thetime
) as end_time
from
(
select
thetime,
coalesce(time_to_sec(timediff(thetime, (select max(thetime) from mytable afore where afore.thetime < mytable.thetime))), 1801) > 1800 as gap
from mytable
) startrec
where gap;
SQL fiddle: http://www.sqlfiddle.com/#!2/d307b/20.
Try this..
SELECT MIN(session_time_tmp) session_start, MAX(session_time_tmp) session_end FROM
(
SELECT IF((TIME_TO_SEC(TIMEDIFF(your_time_field, COALESCE(#previousValue, your_time_field))) / 60) > 30 ,
#sessionCount := #sessionCount + 1, #sessionCount ) sessCount,
( #previousValue := your_time_field ) session_time_tmp FROM
(
SELECT your_time_field, #previousValue:= NULL, #sessionCount := 1 FROM yourtable ORDER BY your_time_field
) a
) b
GROUP BY sessCount
Just replace yourtable and your_time_field
Try this:
SELECT DATE_FORMAT(MIN(STR_TO_DATE(B.column1, '%H.%i')), '%H.%i') AS Session_start,
DATE_FORMAT(MAX(STR_TO_DATE(B.column1, '%H.%i')), '%H.%i') AS Session_end
FROM tableA A
LEFT JOIN ( SELECT A.column1, diff, IF(#diff:=diff < 30, #id, #id:=#id+1) AS rnk
FROM (SELECT B.column1, TIME_TO_SEC(TIMEDIFF(STR_TO_DATE(B.column1, '%H.%i'), STR_TO_DATE(A.column1, '%H.%i'))) / 60 AS diff
FROM tableA A
INNER JOIN tableA B ON STR_TO_DATE(A.column1, '%H.%i') < STR_TO_DATE(B.column1, '%H.%i')
GROUP BY STR_TO_DATE(A.column1, '%H.%i')
) AS A, (SELECT #diff:=0, #id:= 1) AS B
) AS B ON A.column1 = B.column1
GROUP BY IFNULL(B.rnk, 1);
Check the SQL FIDDLE DEMO
OUTPUT
| SESSION_START | SESSION_END |
|---------------|-------------|
| 9.00 | 9.43 |
| 10.30 | 10.45 |
| 11.25 | 11.25 |
| 12.30 | 12.33 |