Calculating gaps between time in PostgreSQL - sql

In my PostgreSQL database I have the following schema:
CREATE TABLE programs (
id integer,
description text
);
CREATE TABLE public.messages (
id integer,
program_id integer,
text text,
created_at timestamp with time zone
);
INSERT INTO programs VALUES(1, 'Test program');
INSERT INTO messages VALUES(1,1, 'Test message 1', now() - interval '7 days');
INSERT INTO messages VALUES(2,1, 'Test message 2', now() - interval '4 days');
INSERT INTO messages VALUES(3,1, 'Test message 3', now() - interval '1 days');
I want to calculate gaps between created_at in messages table. It should work this way:
Calculate gap between created_at of first and second message.
Calculate gap between created_at of second and third message.
Calculate average gap based on those values.
Is there any way of doing such a thing in PostgreSQL?
https://www.db-fiddle.com/f/gvxijmp8u6wr6mYcSoAeVV/0

Using LAG and windowed AVG to get both difference and average gap:
WITH cte AS (
SELECT *,
created_at-LAG(created_at) OVER(PARTITION BY program_id ORDER BY created_at) gap
FROM messages
)
SELECT *, AVG(gap) OVER(PARTITION BY program_id) AS avg_gap
FROM cte;
db<>fiddle demo

If you want the average time between messages, there is no need to get the successive differences. Simply do look at the oldest and newest messages:
select program_id,
(max(created_at) - min(created_at)) / nullif(count(*) - 1, 0)
from messages
group by program_id;

Related

How do I get the matching id for every record?

My table is called platform_statuses, here is its schema:
CREATE TABLE platform_statuses (
id SERIAL,
account INTEGER REFERENCES users (id),
time TIMESTAMP DEFAULT NOW(),
value_in_cents INTEGER NOT NULL,
status INTEGER NOT NULL CHECK (
0 < status
AND status < 4
) DEFAULT 2,
-- 1: Active, 2: Suspended, 3: Market is closed
open_trades INTEGER NOT NULL,
PRIMARY KEY (id)
);
And this is my query, I would like to also get the matching id for the returned records.
SELECT
max(timediff),
time :: date,
account
FROM
(
SELECT
id,
time,
account,
abs(time - date_trunc('day', time + '12 hours')) as timediff
FROM
platform_statuses
) AS subquery
GROUP BY
account,
time :: date
Also note that the abs function you see in the query is a custom one I got off this answer. Here is its definition:
CREATE FUNCTION abs(interval) RETURNS interval AS $ $
SELECT
CASE
WHEN ($ 1 < interval '0') THEN - $ 1
else $ 1
END;
$ $ LANGUAGE sql immutable;
I understand that, for each account and day, you want the record with the greatest timediff. If so, you can use distinct on() directly on your existing subquery:
select distinct on (account, time::date)
id,
time,
account,
abs(time - date_trunc('day', time + '12 hours')) as timediff
from platform_statuses
order by account, time::date, timediff desc
Why not just:
select * from platform_statuses
where abs(time - date_trunc('day', time + '12 hours')) =
(
select max(abs(time - date_trunc('day', time + '12 hours')))
from platform_statuses
)

list of all users who watched at least a movie every week in this month

create table active_users(
user_id numeric,
movie_streamed date
)
insert into active_users values (1,'2020-01-2'::date);
insert into active_users values (1,'2020-01-9'::date);
insert into active_users values (1,'2020-01-16'::date);
insert into active_users values (1,'2020-01-23'::date);
insert into active_users values (1,'2020-01-30'::date);
insert into active_users values (2,'2020-01-14'::date);
insert into active_users values (2,'2020-01-16'::date);
Hi all,
I am looking for a query which will help me to get the users who watched at least a movie every week in this month(being the test data). Given the data where every record has the user_id and when that particular person has watched the movie given the date. I want a generic answer. Not like every month has 4 weeks. Because there could be some scenarios where there are 5 weeks in some cases too.
You can use generate_series(1,5) by counting from 1 upto 5, since there should 5 different weeks might exist even uncompleted as you already mentioned.
The trick is to compare the distinct count of the beginning dates for each week within the current month :
SELECT u.user_id
FROM active_users u
JOIN generate_series( 1, 5 ) g
ON date_trunc('week', movie_streamed)
= date_trunc('week', current_date) + interval '7' day * (g-1)
GROUP BY u.user_id
HAVING COUNT(DISTINCT date_trunc('week', movie_streamed)) =
(
SELECT COUNT(*)
FROM generate_series( 1, 5 ) g
WHERE to_char(current_date,'yyyymm')
= to_char(date_trunc('week', current_date)
+ interval '7' day * (g-1),'yyyymm')
);
Demo

Calculating average time interval length

I have prepared a simple SQL Fiddle demonstrating my problem -
In PostgreSQL 10.3 I store user information, two-player games and the moves in the following 3 tables:
CREATE TABLE players (
uid SERIAL PRIMARY KEY,
name text NOT NULL
);
CREATE TABLE games (
gid SERIAL PRIMARY KEY,
player1 integer NOT NULL REFERENCES players ON DELETE CASCADE,
player2 integer NOT NULL REFERENCES players ON DELETE CASCADE
);
CREATE TABLE moves (
mid BIGSERIAL PRIMARY KEY,
uid integer NOT NULL REFERENCES players ON DELETE CASCADE,
gid integer NOT NULL REFERENCES games ON DELETE CASCADE,
played timestamptz NOT NULL
);
Let's assume that 2 players, Alice and Bob have played 3 games with each other:
INSERT INTO players (name) VALUES ('Alice'), ('Bob');
INSERT INTO games (player1, player2) VALUES (1, 2);
INSERT INTO games (player1, player2) VALUES (1, 2);
INSERT INTO games (player1, player2) VALUES (1, 2);
And let's assume that the 1st game was played quickly, with moves being played every minute.
But then they chilled :-) and played 2 slow games, with moves every 10 minutes:
INSERT INTO moves (uid, gid, played) VALUES
(1, 1, now() + interval '1 min'),
(2, 1, now() + interval '2 min'),
(1, 1, now() + interval '3 min'),
(2, 1, now() + interval '4 min'),
(1, 1, now() + interval '5 min'),
(2, 1, now() + interval '6 min'),
(1, 2, now() + interval '10 min'),
(2, 2, now() + interval '20 min'),
(1, 2, now() + interval '30 min'),
(2, 2, now() + interval '40 min'),
(1, 2, now() + interval '50 min'),
(2, 2, now() + interval '60 min'),
(1, 3, now() + interval '110 min'),
(2, 3, now() + interval '120 min'),
(1, 3, now() + interval '130 min'),
(2, 3, now() + interval '140 min'),
(1, 3, now() + interval '150 min'),
(2, 3, now() + interval '160 min');
At a web page with gaming statistics I would like to display average time passing between moves for each player.
So I suppose I have to use the LAG window function of PostgreSQL.
Since several games can be played simultaneously, I am trying to PARTITION BY gid (i.e. by the "game id").
Unfortunately, I get a syntax error window function calls cannot be nested with my SQL query:
SELECT AVG(played - LAG(played) OVER (PARTITION BY gid order by played))
OVER (PARTITION BY gid order by played)
FROM moves
-- trying to calculate average thinking time for player Alice
WHERE uid = 1;
UPDATE:
Since the number of games in my database is large and grows day by day, I have tried (here the new SQL Fiddle) adding a condition to the inner select query:
SELECT AVG(played - prev_played)
FROM (SELECT m.*,
LAG(m.played) OVER (PARTITION BY m.gid ORDER BY played) AS prev_played
FROM moves m
JOIN games g ON (m.uid in (g.player1, g.player2))
WHERE m.played > now() - interval '1 month'
) m
WHERE uid = 1;
However for some reason this changes the returned value quite radically to 1 min 45 sec.
And I wonder, why does the inner SELECT query suddenly return much more rows, is maybe some condition missing in my JOIN?
UPDATE 2:
Oh ok, I get why the average value decreases: through multiple rows with same timestamps (i.e. played - prev_played = 0), but how to fix the JOIN?
UPDATE 3:
Nevermind, I was missing the m.gid = g.gid AND condition in my SQL JOIN, now it works:
SELECT AVG(played - prev_played)
FROM (SELECT m.*,
LAG(m.played) OVER (PARTITION BY m.gid ORDER BY played) AS prev_played
FROM moves m
JOIN games g ON (m.gid = g.gid AND m.uid in (g.player1, g.player2))
WHERE m.played > now() - interval '1 month'
) m
WHERE uid = 1;
You need subqueries to nest the window functions. I think this does what you want:
select avg(played - prev_played)
from (select m.*,
lag(m.played) over (partition by gid order by played) as prev_played
from moves m
) m
where uid = 1;
Note: The where needs to go in the outer query, so it doesn't affect the lag().
Probably #gordon answer is good enough. But that isn't the result you ask in your comment. Only works because the data have same number of rows for each game so average of games is the same as complete average. But if you want average of the games you need one additional level.
With cte as (
SELECT gid, AVG(played - prev_played) as play_avg
FROM (select m.*,
lag(m.played) over (partition by gid order by played) as prev_played
from moves m
) m
WHERE uid = 1
GROUP BY gid
)
SELECT AVG(play_avg)
FROM cte
;

How can I select stats on lateness of a record expected x days after previous record?

I have entities with config info in one table. If the 'vendor' doesn't do something within 'reminder_days' of the last time of doing it, then it becomes overdue.
CREATE TABLE t_vendors
(
vendor_id NUMBER,
vendor_name VARCHAR2 (250),
reminder_days NUMBER
);
Insert into T_VENDORS (vendor_id, vendor_name, reminder_days)
Values (12, 'sanity-test', 7);
and an app records what they do whenever they do it into this table with this sort of data:
CREATE TABLE t_vendor_events
(
vendor_event_id,
vendor_id NUMBER (19,0),
description VARCHAR2 (250),
event_date DATE
);
Insert into t_vendor_events (vendor_event_id, vendor_id, description, event_date)
Values (10015, 12, TO_DATE('11/9/2015 21:22:55', 'MM/DD/YYYY HH24:MI:SS'), 'one');
Insert into t_vendor_events (vendor_event_id, vendor_id, description, event_date)
Values (10016, 12, TO_DATE('11/16/2015 21:23:55', 'MM/DD/YYYY HH24:MI:SS'), 'two');
Insert into t_vendor_events (vendor_event_id, vendor_id, description, event_date)
Values (10017, 12, TO_DATE('11/30/2015 21:24:55', 'MM/DD/YYYY HH24:MI:SS'), 'three');
Insert into t_vendor_events (vendor_event_id, vendor_id, description, event_date)
Values (10018, 12, TO_DATE('12/01/2015 21:25:55', 'MM/DD/YYYY HH24:MI:SS'), 'four');
Once I've got the comparative values, I need to aggregate the data to quantify the lateness:
how many events occurred
how often they were overdue
what was expected (the reminder days value)
how much they were late on average
how much they were late at worst (max)
I need to see all the vendors in the result, including those that failed to produce an event at all.
All the solutions that I can think of involve creating extra columns and storing some kind of 'lateness' data on every event. This though strikes me as a redundancy, since I know the required interval (reminder_days) but I don't know what kind of nested selects would produce what I need.
I would prefer to stick to standard SQL and I'm not using PL-SQL, but am able to use Oracle-specific syntax in selects where necessary.
The result would look something like this (Expected Days is the 'reminder days' column):
Vendor Event Overdue Expected Avg Max
Count Count Days Elapsed Elapsed
Mega1 5 2 10 12 20
Ole! 6 0 10 9 10
GoPunk 0 0 0 0 0
X-Dan 0 0 0 0 0
RetroB 1 1 30 60 60
You can use lag to get the previous event_date and calculate the difference with the current event_date. Then select rows where the difference is > reminder_days by vendor. Just aggregate the final result, to know how often a vendor was late.
with prev as
(select lag(event_date) over(partition by vendor_id order by event_date) prevdt
, t.* from t_vendor_events)
select v.vendor_id, v.vendor_name, event_date - nvl(prevdt, event_date) diff
from prev p
join t_vendors v on p.vendor_id = v.vendor_id
where event_date - nvl(prevdt, event_date) > v.reminder_days

Retrieve IDs with a minimum time gap between consecutive rows

I have the following event table in Postgres 9.3:
CREATE TABLE event (
event_id integer PRIMARY KEY,
user_id integer,
event_type varchar,
event_time timestamptz
);
My goal is to retrieve all user_id's with a gap of at least 30 days between any of their events (or between their last event and the current time). An additional complication is that I only want the users who have one of these gaps occur at a later time than them performing a certain event_type 'convert'. How can this be done easily?
Some example data in the event table might look like:
INSERT INTO event (event_id, user_id, event_type, event_time)
VALUES
(10, 1, 'signIn', '2015-05-05 00:11'),
(11, 1, 'browse', '2015-05-05 00:12'), -- no 'convert' event
(20, 2, 'signIn', '2015-06-07 02:35'),
(21, 2, 'browse', '2015-06-07 02:35'),
(22, 2, 'convert', '2015-06-07 02:36'), -- only 'convert' event
(23, 2, 'signIn', '2015-08-10 11:00'), -- gap of >= 30 days
(24, 2, 'signIn', '2015-08-11 11:00'),
(30, 3, 'convert', '2015-08-07 02:36'), -- starting with 1st 'convert' event
(31, 3, 'signIn', '2015-08-07 02:36'),
(32, 3, 'convert', '2015-08-08 02:36'),
(33, 3, 'signIn', '2015-08-12 11:00'), -- all gaps below 30 days
(33, 3, 'browse', '2015-08-12 11:00'), -- gap until today (2015-08-20) too small
(40, 4, 'convert', '2015-05-07 02:36'),
(41, 4, 'signIn', '2015-05-12 11:00'); -- gap until today (2015-08-20) >= 30 days
Expected result:
user_id
--------
2
4
One way to do it:
SELECT user_id
FROM (
SELECT user_id
, lead(e.event_time, 1, now()) OVER (PARTITION BY e.user_id ORDER BY e.event_time)
- event_time AS gap
FROM ( -- only users with 'convert' event
SELECT user_id, min(event_time) AS first_time
FROM event
WHERE event_type = 'convert'
GROUP BY 1
) e1
JOIN event e USING (user_id)
WHERE e.event_time >= e1.first_time
) sub
WHERE gap >= interval '30 days'
GROUP BY 1;
The window function lead() allows to include a default value if there is no "next row", which is convenient to cover your additional requirement "or between their last event and the current time".
Indexes
You should at least have an index on (user_id, event_time) if your table is big:
CREATE INDEX event_user_time_idx ON event(user_id, event_time);
If you do that often and the event_type 'convert' is rare, add another partial index:
CREATE INDEX event_user_time_convert_idx ON event(user_id, event_time)
WHERE event_type = 'convert';
For many events per user
And only if gaps of 30 days are common (not a rare case).
Indexes become even more important.
Try this recursive CTE for better performance:
WITH RECURSIVE cte AS (
( -- parentheses required
SELECT DISTINCT ON (user_id)
user_id, event_time, interval '0 days' AS gap
FROM event
WHERE event_type = 'convert'
ORDER BY user_id, event_time
)
UNION ALL
SELECT c.user_id, e.event_time, COALESCE(e.event_time, now()) - c.event_time
FROM cte c
LEFT JOIN LATERAL (
SELECT e.event_time
FROM event e
WHERE e.user_id = c.user_id
AND e.event_time > c.event_time
ORDER BY e.event_time
LIMIT 1 -- the next later event
) e ON true -- add 1 row after last to consider gap till "now"
WHERE c.event_time IS NOT NULL
AND c.gap < interval '30 days'
)
SELECT * FROM cte
WHERE gap >= interval '30 days';
It has considerably more overhead, but can stop - per user - at the first gap that's big enough. If that should be the gap between the last event now, then event_time in the result is NULL.
New SQL Fiddle with more revealing test data demonstrating both queries.
Detailed explanation in these related answers:
Optimize GROUP BY query to retrieve latest record per user
Select first row in each GROUP BY group?
SQL Fiddle
This is another way, probably not as neat as #Erwin but have all the step separated so is easy to adapt.
include_today: add a dummy event to indicate current date.
event_convert: calculate the first time the event convert appear for each user_id (in this case only user_id = 2222)
event_row: asign an unique consecutive id to each event. starting from 1 for each user_id
last part join all together and using rnum = rnum + 1 so could calculate date difference.
also the result show both event involve in the 30 days range so you can see if that is the result you want.
.
WITH include_today as (
(SELECT 'xxxx' event_id, user_id, 'today' event_type, current_date as event_time
FROM users)
UNION
(SELECT *
FROM event)
),
event_convert as (
SELECT user_id, MIN(event_time) min_time
FROM event
WHERE event_type = 'convert'
GROUP BY user_id
),
event_row as (
SELECT *, row_number() OVER (PARTITION BY user_id ORDER BY event_time desc) as rnum
FROM
include_today
)
SELECT
A.user_id,
A.event_id eventA,
A.event_type typeA,
A.event_time timeA,
B.event_id eventB,
B.event_type typeB,
B.event_time timeB,
(B.event_time - A.event_time) days
FROM
event_convert e
Inner Join event_row A
ON e.user_id = A.user_id and e.min_time <= a. event_time
Inner Join event_row B
ON A.rnum = B.rnum + 1
AND A.user_id = B.user_id
WHERE
(B.event_time - A.event_time) > interval '30 days'
ORDER BY 1,4