Get all entries where the following entry is not +1 minute - sql

I have a table which stores all my executed jobs. Know I want to know if all the jobs get executed properly (every minute). Each entry has a created_at timestamp.
My question now is how can I select all entries which where not executed 1 minute after the last entry. This is a very complex query I feel like. So far I just have all the entries orderd by created_at.
SELECT *
FROM jobs
WHERE created_at IS NOT NULL
ORDER By created_at
created_at is a timestamp. Something like 2020-02-02 10:00:00.
Table Structure:
id job_name created_at
-----------------------------------
1 ABC 2020-02-02 10:00:00
2 ABC 2020-02-02 10:01:00
3 ABC 2020-02-02 10:02:00
4 ABC 2020-02-02 10:04:00
5 ABC 2020-02-02 10:07:00
The result I want is:
Now I want to get all dates where the job didn't get executed. So, at 10:03:00, 10:05:00 and 10:06:00 the job didn't get executed!
Do you guys have any idea? I guess its a recursive query. This query needs to be written in postgres.

WITH Table_with_next AS (
SELECT
id
,job_name
,created_at
,LEAD(created_at) OVER (PARTITION BY job_name ORDER BY created_at) as next_created_at
FROM jobs
)
SELECT
job_name
,generate_series(created_at + interval '1 min'
,next_created_at - interval '1 min'
,interval '1 min') as time_not_run
FROM Table_with_next
WHERE next_created_at-created_at > interval '1 min'
I used a CTE that contains the LEAD analytical function to get the next run timestamp. I then filtered the rows that have more than 1 minute between the runs and for those rows I generated 1 minute intervals between the run timestamp and the next run timestamp.
You can play around with it here: http://sqlfiddle.com/#!17/b237e/11

I am guessing you want one job run per calendar minute. That way, you are immune from 59 versus 61 second lags.
You don't need lead() for this. Just generate the time series and join or use not exists:
select gs.job_name, gs.dt
from (select job_name,
generate_series(min(date_trunc('minute', created_at)),
max(created_at),
interval '1 minute'
) as dt
from tests
group by job_name
) gs
where not exists (select 1
from tests t
where t.job_name = gs.job_name and
date_trunc('minute', t.created_at) = gs.dt
);

Related

SQL Server : count distinct every 30 minutes or more

We have an activity database that records user interaction to a website, storing a log that includes values such as Time1, session_id and customer_id e.g.
2022-05-12 08:00:00|11|1
2022-05-12 08:20:00|11|1
2022-05-12 08:30:01|11|1
2022-05-12 08:14:00|22|2
2022-05-12 08:18:00|22|2
2022-05-12 08:16:00|33|1
2022-05-12 08:50:00|33|1
I need to have two separate queries:
Query #1: I need to count sessions multiple times if they have a log of 30 minutes or more grouping them on sessions on daily basis.
For example: Initially count=0
For session_id = 11, it starts at 08:00 and the last time with the same session_id is 08:30 -- count=1
For session_id = 22 it starts at 08:14 and the last time with the same session is 08:14 -- still the count=1 since it was less than 30 min
I tried this query, but it didn't work
select
count(session_id)
from
table1
where
#datetime between Time1 and dateadd(minute, 30, Time1);
Expected result:
Query #2: it's an extension of the above query where I need the unique customers on daily basis whose sessions were 30 min or more.
For example: from the above table I will have two unique customers on May 8th
Expected result
For the Time1 column, the input is in timestamp format when I show it in output I will group it on a basis.
This is a two-level aggregation (GROUP BY) problem. You need to start with a subquery to get the first and last timestamp of each session.
SELECT MIN(Time1) start_time,
MAX(Time1) end_time,
session_id, customer_id
FROM table1
GROUP BY session_id, customer_id
Next you need to use the subquery like this:
SELECT COUNT(session_id),
COUNT(DISTINCT customer_id),
CAST(start_time AS DATE)
FROM (
SELECT MIN(Time1) start_time,
MAX(Time1) end_time,
session_id, customer_id
FROM table1
GROUP BY session_id, customer_id
) a
WHERE DATEDIFF(MINUTE, start_time, end_time) >= 30
GROUP BY CAST(start_time AS DATE);

Getting counts for overlapping time periods

I have a table data in PostgreSQL with this structure:
created_at. customer_email status
2020-12-31 xxx#gmail.com opened
...
2020-12-24 yyy#gmail.com delivered
2020-12-24 xxx#gmail.com opened
...
2020-12-17 zzz#gmail.com opened
2020-12-10 xxx#gmail.com opened
2020-12-03 hhh#gmail.com enqueued
2020-11-27 xxx#gmail.com opened
...
2020-11-20 rrr#gmail.com opened
2020-11-13 ttt#gmail.com opened
There are many rows for each day.
Basically I need 2021-W01 for this week with the count of unique emails with status "opened" within the last 90 days. Likewise for every week before that.
Desired output:
period active
2021-W01 1539
2020-W53 1480
2020-W52 1630
2020-W51 1820
2020-W50 1910
2020-W49 1890
2020-W48 2000
How can I do that?
Window functions would come to mind. Alas, those don't allow DISTINCT aggregations.
Instead, get distinct counts from a LATERAL subquery:
WITH weekly_dist AS (
SELECT DISTINCT date_trunc('week', created_at) AS wk, customer_email
FROM tbl
WHERE status = 'opened'
)
SELECT to_char(t.wk, 'YYYY"-W"IW') AS period, ct.active
FROM (
SELECT generate_series(date_trunc('week', min(created_at) + interval '1 week')
, date_trunc('week', now()::timestamp)
, interval '1 week') AS wk
FROM tbl
) t
LEFT JOIN LATERAL (
SELECT count(DISTINCT customer_email) AS active
FROM weekly_dist d
WHERE d.wk >= t.wk - interval '91 days'
AND d.wk < t.wk
) ct ON true;
db<>fiddle here
I operate with timestamp, not timestamptz, might make a corner case difference.
The CTE weekly_dist reduces the set to distinct "opened" emails. This step is strictly optional, but increases performance significantly if there can be more than a few duplicates per week.
The derived table t generates a timestamp for the begin of each week since the earliest entry in the table up to "now". This way I make sure no week is skipped,even if there are no rows for it. See:
PostgreSQL: running count of rows for a query 'by minute'
Generating time series between two dates in PostgreSQL
But I do skip the first week since I count active emails before each start of the week.
Then LEFT JOIN LATERAL to a subquery computing the distinct count for the 90-day time-range. To be precise, I deduct 91 days, and exclude the start of the current week. This happens to fall in line with the weekly pre-aggregated data from the CTE. Be wary of that if you shift bounds.
Finally, to_char(t.wk, 'YYYY"-W"IW') is a compact expression to get your desired format for week numbers. Details in the manual here.
You can combine the date_part() function with a group by like this:
SELECT
DATE_PART('year', created_at)::varchar || '-W' || DATE_PART('week', created_at)::varchar,
SUM(CASE WHEN status = 'opened' THEN 1 ELSE 0 END)
FROM
your_table
GROUP BY 1
ORDER BY created_at DESC

How to use BigQuery Analytic Functions to calculate time between timestamped rows?

I have a data set that represents analytics events like:
Row timestamp account_id type
1 2018-11-14 21:05:40 UTC abc start
2 2018-11-14 21:05:40 UTC xyz another_type
3 2018-11-26 22:01:19 UTC xyz start
4 2018-11-26 22:01:23 UTC abc start
5 2018-11-26 22:01:29 UTC xyz some_other_type
11 2018-11-26 22:13:58 UTC xyz start
...
With some number of account_ids. I need to find the average time between start records per account_id.
I'm trying to use analytic functions as described here. My end goal would be a table like:
Row account_id avg_time_between_events_mins
1 xyz 53
2 abc 47
3 pqr 65
...
my best attempt--based on this post--looks like this:
WITH
events AS (
SELECT
COUNTIF(type = 'start' AND account_id='abc') OVER (ORDER BY timestamp) as diff,
timestamp
FROM
`myproject.dataset.events`
WHERE
account_id='abc')
SELECT
min(timestamp) AS start_time,
max(timestamp) AS next_start_time,
ABS(timestamp_diff(min(timestamp), max(timestamp), MINUTE)) AS minutes_between
FROM
events
GROUP BY
diff
This calculates the time between each start event and the last non-start event prior to the next start event for a specific account_id.
I tried to use PARTITION and a WINDOW FRAME CLAUSE like this:
WITH
events AS (
SELECT
COUNT(*) OVER (PARTITION BY account_id ORDER BY timestamp ROWS BETWEEN CURRENT ROW AND 1 FOLLOWING) as diff,
timestamp
FROM
`myproject.dataset.events`
WHERE
type = 'start')
SELECT
min(timestamp) AS start_time,
max(timestamp) AS next_start_time,
ABS(timestamp_diff(min(timestamp), max(timestamp), MINUTE)) AS minutes_between
FROM
events
GROUP BY
diff
But I got a nonsense result table. Can anyone walk me through how I would write and reason about a query like this?
You don't really need analytic functions for this:
select timestamp_diff(min(timestamp), max(timestamp), MINUTE)) / nullif(count(*) - 1, 0)
from `myproject.dataset.events`
where type = 'start'
group by account_id;
This is the timestamp of the most recent minus the oldest, divided by one less than the number of starts. That is the average between the starts.

How do i give the condition to group by time period?

I need to get the count of records using PostgreSQL from time 7:00:00 am till next day 6:59:59 am and the count resets again from 7:00am to 6:59:59 am.
Where I am using backend as java (Spring boot).
The columns in my table are
id (primary_id)
createdon (timestamp)
name
department
createdby
How do I give the condition for shift wise?
You'd need to pick a slice based on the current time-of-day (I am assuming this to be some kind of counter which will be auto-refreshed in some application).
One way to do that is using time ranges:
SELECT COUNT(*)
FROM mytable
WHERE createdon <# (
SELECT CASE
WHEN current_time < '07:00'::time THEN
tsrange(CURRENT_DATE - '1d'::interval + '07:00'::time, CURRENT_DATE + '07:00'::time, '[)')
ELSE
tsrange(CURRENT_DATE + '07:00'::time, CURRENT_DATE + '1d'::interval + '07:00'::time, '[)')
END
)
;
Example with data: https://rextester.com/LGIJ9639
As I understand the question, you need to have a separate group for values in each 24-hour period that starts at 07:00:00.
SELECT
(
date_trunc('day', (createdon - '7h'::interval))
+ '7h'::interval
) AS date_bucket,
count(id) AS count
FROM lorem
GROUP BY date_bucket
ORDER BY date_bucket
This uses the date and time functions and the GROUP BY clause:
Shift the timestamp value back 7 hours ((createdon - '7h'::interval)), so the distinction can be made by a change of date (at 00:00:00). Then,
Truncate the value to the date (date_trunc('day', …)), so that all values in a bucket are flattened to a single value (the date at midnight). Then,
Add 7 hours again to the value (… + '7h'::interval), so that it represents the starting time of the bucket. Then,
Group by that value (GROUP BY date_bucket).
A more complete example, with schema and data:
DROP TABLE IF EXISTS lorem;
CREATE TABLE lorem (
id serial PRIMARY KEY,
createdon timestamp not null
);
INSERT INTO lorem (createdon) (
SELECT
generate_series(
CURRENT_TIMESTAMP - '36h'::interval,
CURRENT_TIMESTAMP + '36h'::interval,
'45m'::interval)
);
Now the query:
SELECT
(
date_trunc('day', (createdon - '7h'::interval))
+ '7h'::interval
) AS date_bucket,
count(id) AS count
FROM lorem
GROUP BY date_bucket
ORDER BY date_bucket
;
produces this result:
date_bucket | count
---------------------+-------
2019-03-06 07:00:00 | 17
2019-03-07 07:00:00 | 32
2019-03-08 07:00:00 | 32
2019-03-09 07:00:00 | 16
(4 rows)
You can use aggregation -- by subtracting 7 hours:
select (createdon - interval '7 hour')::date as dy, count(*)
from t
group by dy
order by dy;

Grouping Timestamps based on the interval between them

I have a table in Hive (SQL) with a bunch of timestamps that need to be grouped in order to create separate sessions based on the time difference between the timestamps.
Example:
Consider the following timestamps(Given in HH:MM for simplicity):
9.00
9.10
9.20
9.40
9.43
10.30
10.45
11.25
12.30
12.33
and so on..
So now, all timestamps that fall within 30 mins of the next timestamp come under the same session,
i.e. 9.00,9.10,9.20,9.40,9.43 form 1 session.
But since the difference between 9.43 and 10.30 is more than 30 mins, the time stamp 10.30 falls under a different session. Again, 10.30 and 10.45 fall under one session.
After we have created these sessions, we have to obtain the minimum timestamp for that session and the max timestamp.
I tried to subtract the current timestamp with its LEAD and place a flag if it is greater than 30 mins, but I'm having difficulty with this.
Any suggestion from you guys would be greatly appreciated. Please let me know if the question isn't clear enough.
Expected Output for this sample data:
Session_start Session_end
9.00 9.43
10.30 10.45
11.25 11.25 (same because the next time is not within 30 mins)
12.30 12.33
Hope this helps.
So it's not MySQL but Hive. I don't know Hive, but if it supports LAG, as you say, try this PostgreSQL query. You will probably have to change the time difference calculation, that's usually different from one dbms to another.
select min(thetime) as start_time, max(thetime) as end_time
from
(
select thetime, count(gap) over (rows between unbounded preceding and current row) as groupid
from
(
select thetime, case when thetime - lag(thetime) over (order by thetime) > interval '30 minutes' then 1 end as gap
from mytable
) times
) groups
group by groupid
order by min(thetime);
The query finds gaps, then uses a running total of gap counts to build group IDs, and the rest is aggregation.
SQL fiddle: http://www.sqlfiddle.com/#!17/8bc4a/6.
With MySQL lacking LAG and LEAD functions, getting the previous or next record is some work already. Here is how:
select
thetime,
(select max(thetime) from mytable afore where afore.thetime < mytable.thetime) as afore_time,
(select min(thetime) from mytable after where after.thetime > mytable.thetime) as after_time
from mytable;
Based on this we can build the whole query where we are looking for gaps (i.e. the time difference to the previous or next record is more than 30 minutes = 1800 seconds).
select
startrec.thetime as start_time,
(
select min(endrec.thetime)
from
(
select
thetime,
coalesce(time_to_sec(timediff((select min(thetime) from mytable after where after.thetime > mytable.thetime), thetime)), 1801) > 1800 as gap
from mytable
) endrec
where gap
and endrec.thetime >= startrec.thetime
) as end_time
from
(
select
thetime,
coalesce(time_to_sec(timediff(thetime, (select max(thetime) from mytable afore where afore.thetime < mytable.thetime))), 1801) > 1800 as gap
from mytable
) startrec
where gap;
SQL fiddle: http://www.sqlfiddle.com/#!2/d307b/20.
Try this..
SELECT MIN(session_time_tmp) session_start, MAX(session_time_tmp) session_end FROM
(
SELECT IF((TIME_TO_SEC(TIMEDIFF(your_time_field, COALESCE(#previousValue, your_time_field))) / 60) > 30 ,
#sessionCount := #sessionCount + 1, #sessionCount ) sessCount,
( #previousValue := your_time_field ) session_time_tmp FROM
(
SELECT your_time_field, #previousValue:= NULL, #sessionCount := 1 FROM yourtable ORDER BY your_time_field
) a
) b
GROUP BY sessCount
Just replace yourtable and your_time_field
Try this:
SELECT DATE_FORMAT(MIN(STR_TO_DATE(B.column1, '%H.%i')), '%H.%i') AS Session_start,
DATE_FORMAT(MAX(STR_TO_DATE(B.column1, '%H.%i')), '%H.%i') AS Session_end
FROM tableA A
LEFT JOIN ( SELECT A.column1, diff, IF(#diff:=diff < 30, #id, #id:=#id+1) AS rnk
FROM (SELECT B.column1, TIME_TO_SEC(TIMEDIFF(STR_TO_DATE(B.column1, '%H.%i'), STR_TO_DATE(A.column1, '%H.%i'))) / 60 AS diff
FROM tableA A
INNER JOIN tableA B ON STR_TO_DATE(A.column1, '%H.%i') < STR_TO_DATE(B.column1, '%H.%i')
GROUP BY STR_TO_DATE(A.column1, '%H.%i')
) AS A, (SELECT #diff:=0, #id:= 1) AS B
) AS B ON A.column1 = B.column1
GROUP BY IFNULL(B.rnk, 1);
Check the SQL FIDDLE DEMO
OUTPUT
| SESSION_START | SESSION_END |
|---------------|-------------|
| 9.00 | 9.43 |
| 10.30 | 10.45 |
| 11.25 | 11.25 |
| 12.30 | 12.33 |