Finding gaps in huge event streams? - sql

I have about 1 million events in a PostgreSQL database that are of this format:
id | stream_id | timestamp
----------+-----------------+-----------------
1 | 7 | ....
2 | 8 | ....
There are about 50,000 unique streams.
I need to find all of the events where the time between any two of the events is over a certain time period. In other words, I need to find event pairs where there was no event in a certain period of time.
For example:
a b c d e f g h i j k
| | | | | | | | | | |
\____2 mins____/
In this scenario, I would want to find the pair (f, g) since those are the events immediately surrounding a gap.
I don't care if the query is (that) slow, i.e. on 1 million records it's fine if it takes an hour or so. However, the data set will keep growing, so hopefully if it's slow it scales sanely.
I also have the data in MongoDB.
What's the best way to perform this query?

You can do this with the lag() window function over a partition by the stream_id which is ordered by the timestamp. The lag() function gives you access to previous rows in the partition; without a lag value, it is the previous row. So if the partition on stream_id is ordered by time, then the previous row is the previous event for that stream_id.
SELECT stream_id, lag(id) OVER pair AS start_id, id AS end_id,
("timestamp" - lag("timestamp") OVER pair) AS diff
FROM my_table
WHERE diff > interval '2 minutes'
WINDOW pair AS (PARTITION BY stream_id ORDER BY "timestamp");

In postgres it can be done very easily with a help of the lag() window function. Check the fiddle below as an example:
SQL Fiddle
PostgreSQL 9.3 Schema Setup:
CREATE TABLE Table1
("id" int, "stream_id" int, "timestamp" timestamp)
;
INSERT INTO Table1
("id", "stream_id", "timestamp")
VALUES
(1, 7, '2015-06-01 15:20:30'),
(2, 7, '2015-06-01 15:20:31'),
(3, 7, '2015-06-01 15:20:32'),
(4, 7, '2015-06-01 15:25:30'),
(5, 7, '2015-06-01 15:25:31')
;
Query 1:
with c as (select *,
lag("timestamp") over(partition by stream_id order by id) as pre_time,
lag(id) over(partition by stream_id order by id) as pre_id
from Table1
)
select * from c where "timestamp" - pre_time > interval '2 sec'
Results:
| id | stream_id | timestamp | pre_time | pre_id |
|----|-----------|------------------------|------------------------|--------|
| 4 | 7 | June, 01 2015 15:25:30 | June, 01 2015 15:20:32 | 3 |

Related

Postgres query for difference between latest and first record of the day

Postgres data alike this:
| id | read_at | value_1 |
| ------|------------------------|---------|
| 16239 | 2021-11-28 16:13:00+00 | 1509 |
| 16238 | 2021-11-28 16:12:00+00 | 1506 |
| 16237 | 2021-11-28 16:11:00+00 | 1505 |
| 16236 | 2021-11-28 16:10:00+00 | 1501 |
| 16235 | 2021-11-28 16:09:00+00 | 1501 |
| ..... | .......................| .... |
| 15266 | 2021-11-28 00:00:00+00 | 1288 |
A value is added every minute and increases over time.
I would like to get the current total for the day and have this in a Grafana stat panel. Above it would be: 221 (1509-1288). Latest record minus first record of today.
SELECT id,read_at,value_1
FROM xyz
ORDER BY id DESC
LIMIT 1;
With this the latest record is given (A).
SELECT id,read_at,value_1
FROM xyz
WHERE read_at = CURRENT_DATE
ORDER BY id DESC
LIMIT 1;
With this the first record of the day is given (B).
Grafana cannot do math on this (A-B). Single query would be best.
Sadly my database knowledge is low and attempts at building queries have not succeeded, and have taken all afternoon now.
Theoretical ideas to solve this:
Subtract the min from the max value where time frame is today.
Using a lag, lag it for the count of records that are recorded today. Subtract lag value from latest value.
Window function.
What is the best way (performance wise) forward and how would such query be written?
Calculate the cumulative total last_value - first_value for each record for the current day using window functions (this is the t subquery) and then pick the latest one.
select current_total, read_at::date as read_at_date
from
(
select last_value(value_1) over w - first_value(value_1) over w as current_total,
read_at
from the_table
where read_at >= current_date and read_at < current_date + 1
window w as (partition by read_at::date order by read_at)
) as t
order by read_at desc limit 1;
However if it is certain that value_1 only "increases over time" then simple grouping will do and that is by far the best way performance wise:
select max(value_1) - min(value_1) as current_total,
read_at::date as read_at_date
from the_table
where read_at >= current_date and read_at < current_date + 1
group by read_at::date;
Please, check if it works.
Since you intend to publish it in Grafana, the query does not impose a period filter.
https://www.db-fiddle.com/f/4jyoMCicNSZpjMt4jFYoz5/3080
create table g (id int, read_at timestamp, value_1 int);
insert into g
values
(16239, '2021-11-28 16:13:00+00', 1509),
(16238, '2021-11-28 16:12:00+00', 1506),
(16237, '2021-11-28 16:11:00+00', 1505),
(16236, '2021-11-28 16:10:00+00', 1501),
(16235, '2021-11-28 16:09:00+00', 1501),
(15266, '2021-11-28 00:00:00+00', 1288);
select date(read_at), max(value_1) - min(value_1)
from g
group by date(read_at);
Since you data contains multiple values for 2 distinct times (16:09 and 16:10), this indicates the possibility that min and max values do not always increase in the time interval. Leaving open the possibility of a decrease. So do you want max - min reading or the difference in reading at min/max time. The following get value difference to get difference between the first and latest reading of the day as indicated in the title.
with parm(dt) as
( values (date '2021-11-28') )
, first_read (f_read,f_value) as
( select read_at, value_1
from test_tbl
where read_at at time zone 'UTC'=
( select min(read_at at time zone 'UTC')
from test_tbl
join parm
on ((read_at at time zone 'UTC')::date = dt)
)
)
, last_read (l_read, l_value) as
( select read_at,value_1
from test_tbl
where read_at at time zone 'UTC'=
( select max(read_at at time zone 'UTC')
from test_tbl
join parm
on ((read_at at time zone 'UTC')::date = dt)
)
)
select l_read, f_read, l_value, f_value, l_value - f_value as "Day Difference"
from last_read
join first_read on true;

Sum duration of overlapping periods with priority by excluding the overlap itself

I have an R code and I am trying to rewrite it in PostgreSQL that feeds grafana dashboard. I do have the basics so I am almost done with the other parts of the script but what I am trying to accomplish now in PostgreSQL is beyond my league. I see very similar solved issues on StackOverflow but I can't seem to get them to work for me. Here are some links with code that I was trying to adapt
https://stackoverflow.com/a/54370027/7885817
https://stackoverflow.com/a/44139381/7885817
I applogize for the repetitive issue that I post.
Any help is highly appreciated!
So, my issue is:
I have messages with overlapping timestamps. These messages have priorities: A and B (A is more important), start time and end ime.
Strictly speaking: I would like to sum the durations for A and B
BUT if there is an overlap I want to find the duration between the first Start Time and the last End Time of messages with priority A and the same for messages with priority B. And if a A message overlaps with a B message I want to split this duration at the End time of A message, till that point the duration of B message is allocated to A.
I made a visual to support my cryptic explanations and simplified version of my data:
CREATE TABLE activities(
id int,
name text,
start timestamp,
"end" timestamp
);
INSERT INTO activitiesVALUES
(1, 'A', '2018-01-09 17:00:00', '2018-01-09 20:00:00'),
(2, 'A', '2018-01-09 18:00:00', '2018-01-09 20:30:00'),
(3, 'B', '2018-01-09 19:00:00', '2018-01-09 21:30:00'),
(4, 'B', '2018-01-09 22:00:00', '2018-01-09 23:00:00');
SELECT * FROM activities;
Thank you very much for your time!
Update
My original solution was not correct. The consolidation of ranges cannot be handled in a regular window. I confused myself by using the same name, trange, forgetting that the window is over the source rows rather than the result rows. Please see the updated SQL Fiddle with the full query as well as an added record to illustrate the problem.
You can simplify the overlapping requirement as well as identifying gaps and islands using PostgreSQL range types.
The following query is intentionally verbose to show each step of the process. A number of steps can be combined.
SQL Fiddle
First, add an inclusive [start, end] range to each record.
with add_ranges as (
select id, name, tsrange(start, "end", '[]') as t_range
from activities
),
id | name | t_range
----+------+-----------------------------------------------
1 | A | ["2018-01-09 17:00:00","2018-01-09 20:00:00"]
2 | A | ["2018-01-09 18:00:00","2018-01-09 20:30:00"]
3 | B | ["2018-01-09 19:00:00","2018-01-09 21:30:00"]
4 | B | ["2018-01-09 22:00:00","2018-01-09 23:00:00"]
(4 rows)
Identify overlapping ranges as determined by the && operator and mark the beginning of new islands with a 1.
mark_islands as (
select id, name, t_range,
case
when t_range && lag(t_range) over w then 0
else 1
end as new_range
from add_ranges
window w as (partition by name order by t_range)
),
id | name | t_range | new_range
----+------+-----------------------------------------------+-----------
1 | A | ["2018-01-09 17:00:00","2018-01-09 20:00:00"] | 1
2 | A | ["2018-01-09 18:00:00","2018-01-09 20:30:00"] | 0
3 | B | ["2018-01-09 19:00:00","2018-01-09 21:30:00"] | 1
4 | B | ["2018-01-09 22:00:00","2018-01-09 23:00:00"] | 1
(4 rows)
Number the groups based on the sum of the new_range within name.
group_nums as (
select id, name, t_range,
sum(new_range) over (partition by name order by t_range) as group_num
from mark_islands
),
id | name | t_range | group_num
----+------+-----------------------------------------------+-----------
1 | A | ["2018-01-09 17:00:00","2018-01-09 20:00:00"] | 1
2 | A | ["2018-01-09 18:00:00","2018-01-09 20:30:00"] | 1
3 | B | ["2018-01-09 19:00:00","2018-01-09 21:30:00"] | 1
4 | B | ["2018-01-09 22:00:00","2018-01-09 23:00:00"] | 2
Group by name, group_num to get the total time spent on the island as well as a complete t_range to be used in overlap deduction.
islands as (
select name,
tsrange(min(lower(t_range)), max(upper(t_range)), '[]') as t_range,
max(upper(t_range)) - min(lower(t_range)) as island_time_interval
from group_nums
group by name, group_num
),
name | t_range | island_time_interval
------+-----------------------------------------------+----------------------
A | ["2018-01-09 17:00:00","2018-01-09 20:30:00"] | 03:30:00
B | ["2018-01-09 19:00:00","2018-01-09 21:30:00"] | 02:30:00
B | ["2018-01-09 22:00:00","2018-01-09 23:00:00"] | 01:00:00
(3 rows)
For the requirement to count overlap time between A messages and B messages, find occurrences of when an A message overlaps a B message, and use the * intersect operator to find the intersection.
priority_overlaps as (
select b.name, a.t_range * b.t_range as overlap_range
from islands a
join islands b
on a.t_range && b.t_range
and a.name = 'A' and b.name != 'A'
),
name | overlap_range
------+-----------------------------------------------
B | ["2018-01-09 19:00:00","2018-01-09 20:30:00"]
(1 row)
Sum the total time of each overlap by name.
overlap_time as (
select name, sum(upper(overlap_range) - lower(overlap_range)) as total_overlap_interval
from priority_overlaps
group by name
),
name | total_overlap_interval
------+------------------------
B | 01:30:00
(1 row)
Calculate the total time for each name.
island_times as (
select name, sum(island_time_interval) as name_time_interval
from islands
group by name
)
name | name_time_interval
------+--------------------
B | 03:30:00
A | 03:30:00
(2 rows)
Join the total time for each name to adjustments from the overlap_time CTE, and subtract the adjustment for the final duration value.
select i.name,
i.name_time_interval - coalesce(o.total_overlap_interval, interval '0') as duration
from island_times i
left join overlap_time o
on o.name = i.name
;
name | duration
------+----------
B | 02:00:00
A | 03:30:00
(2 rows)
This is a type of gaps-and-islands problem. To solve this, find where the "islands" begin and then aggregate. So, to get the islands:
select a.name, min(start) as startt, max("end") as endt
from (select a.*,
count(*) filter (where prev_end is null or prev_end < start) over (partition by name order by start, id) as grp
from (select a.*,
max("end") over (partition by name
order by start, id
rows between unbounded preceding and 1 preceding
) as prev_end
from activities a
) a
) a
group by name, grp;
The next step is just to aggregate again:
with islands as (
select a.name, min(start) as startt, max("end") as endt
from (select a.*,
count(*) filter (where prev_end is null or prev_end < start) over (partition by name order by start, id) as grp
from (select a.*,
max("end") over (partition by name
order by start, id
rows between unbounded preceding and 1 preceding
) as prev_end
from activities a
) a
) a
group by name, grp
)
select name, sum(endt - startt)
from islands i
group by name;
Here is a db<>fiddle.
Note that this uses a cumulative trailing maximum to define the overlaps. This is the most general method for determining overlaps. I think this will work on all edge cases, including:
1----------2---2----3--3-----1
It also handles ties on the start time.

Unable to calculate difference between CTE subquery outputs for use in larger PostgreSQL query output column

Using PostgreSQL v9.4.5 from Shell I created a database called moments in psql by running create database moments. I then created a moments table:
CREATE TABLE moments
(
id SERIAL4 PRIMARY KEY,
moment_type BIGINT NOT NULL,
flag BIGINT NOT NULL,
time TIMESTAMP NOT NULL,
UNIQUE(moment_type, time)
);
INSERT INTO moments (moment_type, flag, time) VALUES (1, 7, '2016-10-29 12:00:00');
INSERT INTO moments (moment_type, flag, time) VALUES (1, -30, '2016-10-29 13:00:00');
INSERT INTO moments (moment_type, flag, time) VALUES (3, 5, '2016-10-29 14:00:00');
INSERT INTO moments (moment_type, flag, time) VALUES (2, 9, '2016-10-29 18:00:00');
INSERT INTO moments (moment_type, flag, time) VALUES (2, -20, '2016-10-29 17:00:00');
INSERT INTO moments (moment_type, flag, time) VALUES (3, 10, '2016-10-29 16:00:00');
I run select * from moments to view the table:
Moments Table
id | moment_type | flag | time
----+-------------+------+---------------------
1 | 1 | 7 | 2016-10-29 12:00:00
2 | 1 | -30 | 2016-10-29 13:00:00
3 | 3 | 5 | 2016-10-29 14:00:00
4 | 2 | 9 | 2016-10-29 18:00:00
5 | 2 | -20 | 2016-10-29 17:00:00
6 | 3 | 10 | 2016-10-29 16:00:00
I then try to write an SQL query that produces the following output, whereby for each pair of duplicate moment_type values it returns the difference between the flag value of the moment_type having the most recent timestamp value, and the flag value of the second most recent timestamp value, and lists the results in ascending order by moment_type.
Expected SQL Query Output
moment_type | flag |
------------+------+
1 | -37 | (i.e. -30 - 7)
2 | 29 | (i.e. 9 - -20)
3 | 5 | (i.e. 10 - 5)
The SQL query that I came up with is as follows, which uses the WITH query to write multiple Common Table Expressions (CET) subqueries for use as temporary tables in the larger SELECT query at the end. I also use an SQL function to calculate the difference between two of the subquery outputs (alternatively I think I could have just used DIFFERENCE DIFFERENCE(most_recent_flag, second_most_recent_flag) AS flag instead of the function):
CREATE FUNCTION difference(most_recent_flag, second_most_recent_flag) RETURNS numeric AS $$
SELECT $1 - $2;
$$ LANGUAGE SQL;
-- get two flags that have the most recent timestamps
WITH two_most_recent_flags AS (
SELECT moments.flag
FROM moments
ORDER BY moments.time DESC
LIMIT 2
),
-- get one flag that has the most recent timestamp
most_recent_flag AS (
SELECT *
FROM two_most_recent_flags
ORDER BY flag DESC
LIMIT 1
),
-- get one flag that has the second most recent timestamp
second_most_recent_flag AS (
SELECT *
FROM two_most_recent_flags
ORDER BY flag ASC
LIMIT 1
)
SELECT DISTINCT ON (moments.moment_type)
moments.moment_type,
difference(most_recent_flag, second_most_recent_flag) AS flag
FROM moments
ORDER BY moment_type ASC
LIMIT 2;
But when I run the above SQL query in PostgreSQL, it returns the following error:
ERROR: column "most_recent_flag" does not exist
LINE 21: difference(most_recent_flag, second_most_recent_flag) AS fla...
Question
What techniques can I use and how may I apply them to overcome this error, and calculate and display the differences in the flag column to achieve the Expected SQL Query Output?
Note: Perhaps the Window Function may be used somehow as it performs calculations across table rows
Use the lag() window function:
select moment_type, difference
from (
select *, flag- lag(flag) over w difference
from moments
window w as (partition by moment_type order by time)
) s
where difference is not null
order by moment_type
moment_type | difference
-------------+------------
1 | -37
2 | 29
3 | 5
(3 rows)
One method is to use conditional aggregation. The window function row_number() can be used to identify the first and last time values:
select m.moment_type,
(max(case when seqnum_desc = 1 then flag end) -
min(case when seqnum_asc = 1 then flag end)
)
from (select m.*,
row_number() over (partition by m.moment_type order by m.time) as seqnum_asc,
row_number() over (partition by m.moment_type order by m.time desc) as seqnum_desc
from moments m
) m
group by m.moment_type;

Rolling counts based on rolling cohorts

Using Postgres 9.5. Test data:
create temp table rental (
customer_id smallint
,rental_date timestamp without time zone
,customer_name text
);
insert into rental values
(1, '2006-05-01', 'james'),
(1, '2006-06-01', 'james'),
(1, '2006-07-01', 'james'),
(1, '2006-07-02', 'james'),
(2, '2006-05-02', 'jacinta'),
(2, '2006-05-03', 'jacinta'),
(3, '2006-05-04', 'juliet'),
(3, '2006-07-01', 'juliet'),
(4, '2006-05-03', 'julia'),
(4, '2006-06-01', 'julia'),
(5, '2006-05-05', 'john'),
(5, '2006-06-01', 'john'),
(5, '2006-07-01', 'john'),
(6, '2006-07-01', 'jacob'),
(7, '2006-07-02', 'jasmine'),
(7, '2006-07-04', 'jasmine');
I am trying to understand the behaviour of existing customers. I am trying to answer this question:
What is the likelihood of a customer to order again based on when their last order was (current month, previous month (m-1)...to m-12)?
Likelihood is calculated as:
distinct count of people who ordered in current month /
distinct count of people in their cohort.
Thus, I need to generate a table that lists a count of the people who ordered in the current month, who belong in a given cohort.
Thus, what are the rules for being in a cohort?
- current month cohort: >1 order in month OR (1 order in month given no previous orders)
- m-1 cohort: <=1 order in current month and >=1 order in m-1
- m-2 cohort: <=1 order in current month and 0 orders in m-1 and >=1 order in m-2
- etc
I am using the DVD Store database as sample data to develop the query: http://linux.dell.com/dvdstore/
Here is an example of cohort rules and aggregations, based on July being the
"month's orders being analysed" (please notice: the "month's orders being analysed" column is the first column in the 'Desired output' table below):
customer_id | jul-16| jun-16| may-16|
------------|-------|-------|-------|
james | 1 1 | 1 | 1 | <- member of jul cohort, made order in jul
jasmine | 1 1 | | | <- member of jul cohort, made order in jul
jacob | 1 | | | <- member of jul cohort, did NOT make order in jul
john | 1 | 1 | 1 | <- member of jun cohort, made order in jul
julia | | 1 | 1 | <- member of jun cohort, did NOT make order in jul
juliet | 1 | | 1 | <- member of may cohort, made order in jul
jacinta | | | 1 1 | <- member of may cohort, did NOT make order in jul
This data would output the following table:
--where m = month's orders being analysed
month's orders |how many people |how many people from |how many people |how many people from |how many people |how many people from |
being analysed |are in cohort m |cohort m ordered in m |are in cohort m-1 |cohort m-1 ordered in m |are in cohort m-2 |cohort m-2 ordered in m |...m-12
---------------|----------------|----------------------|------------------|------------------------|------------------|------------------------|
may-16 |5 |1 | | | | |
jun-16 | | |5 |3 | | |
jul-16 |3 |2 |2 |1 |2 |1 |
My attempts so far have been on variations of:
generate_series()
and
row_number() over (partition by customer_id order by rental_id desc)
I haven't been able to get everything to come together yet (I've tried for many hours and haven't yet solved it).
For readability, I think posting my work in parts is better (if anyone wants me to post the sql query in its entirety please comment - and I'll add it).
series query:
(select
generate_series(date_trunc(‘month’,min(rental_date)),date_trunc(‘month’,max(rental_date)),’1 month)) as month_being_analysed
from
rental) as series
rank query:
(select
*,
row_number() over (partition by customer_id order by rental_id desc) as rnk
from
rental
where
date_trunc('month',rental_date) <= series.month_being_analysed) as orders_ranked
I want to do something like: run the orders_ranked query for every row returned by the series query, and then base aggregations on each return of orders_ranked.
Something like:
(--this query counts the customers in cohort m-1
select
count(distinct customer_id)
from
(--this query ranks the orders that have occured <= to the date in the row of the 'series' table
select
*,
row_number() over (partition by customer_id order by rental_id desc) as rnk
from
rental
where
date_trunc('month',rental_date)<=series.month_being_analysed) as orders_ranked
where
(rnk=1 between series.month_being_analysed - interval ‘2 months’ and series.month_being_analysed - interval ‘1 months’)
OR
(rnk=2 between series.month_being_analysed - interval ‘2 months’ and series.month_being_analysed - interval ‘1 months’)
) as people_2nd_last_booking_in_m_1,
(--this query counts the customers in cohort m-1 who ordered in month m
select
count(distinct customer_id)
from
(--this query returns the orders by customers in cohort m-1
select
count(distinct customer_id)
from
(--this query ranks the orders that have occured <= to the date in the row of the 'series' table
select
*,
row_number() over (partition by customer_id order by rental_id desc) as rnk
from
rental
where
date_trunc('month',rental_date)<=series.month_being_analysed) as orders_ranked
where
(rnk=1 between series.month_being_analysed - interval ‘2 months’ and series.month_being_analysed - interval ‘1 months’)
OR
(rnk=2 between series.month_being_analysed - interval ‘2 months’ and series.month_being_analysed - interval ‘1 months’)
where
rnk=1 in series.month_being_analysed
) as people_who_booked_in_m_whose_2nd_last_booking_was_in_m_1,
...
from
(select
generate_series(date_trunc(‘month’,min(rental_date)),date_trunc(‘month’,max(rental_date)),’1 month)) as month_being_analysed
from
rental) as series
This query does everything. It operates on the whole table and works for any time range.
Based on some assumptions and assuming current Postgres version 9.5. Should work with pg 9.1 at least. Since your definition of "cohort" is unclear to me, I skipped the "how many people in cohort" columns.
I would expect it to be faster than anything you tried so far. By orders of magnitude.
SELECT *
FROM crosstab (
$$
SELECT mon
, sum(count(*)) OVER (PARTITION BY mon)::int AS m0
, gap -- count of months since last order
, count(*) AS gap_ct
FROM (
SELECT mon
, mon_int - lag(mon_int) OVER (PARTITION BY c_id ORDER BY mon_int) AS gap
FROM (
SELECT DISTINCT ON (1,2)
date_trunc('month', rental_date)::date AS mon
, customer_id AS c_id
, extract(YEAR FROM rental_date)::int * 12
+ extract(MONTH FROM rental_date)::int AS mon_int
FROM rental
) dist_customer
) gap_to_last_month
GROUP BY mon, gap
ORDER BY mon, gap
$$
, 'SELECT generate_series(1,12)'
) ct (mon date, m0 int
, m01 int, m02 int, m03 int, m04 int, m05 int, m06 int
, m07 int, m08 int, m09 int, m10 int, m11 int, m12 int);
Result:
mon | m0 | m01 | m02 | m03 | m04 | m05 | m06 | m07 | m08 | m09 | m10 | m11 | m12
------------+----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----
2015-01-01 | 63 | 36 | 15 | 5 | 3 | 3 | | | | | | |
2015-02-01 | 56 | 35 | 9 | 9 | 2 | | 1 | | | | | |
...
m0 .. customers with >= 1 order this month
m01 .. customers with >= 1 order this month and >= 1 order 1 month before (nothing in between)
m02 .. customers with >= 1 order this month and >= 1 order 2 month before and no order in between
etc.
How?
In subquery dist_customer reduce to one row per month and customer_id (mon, c_id) with DISTINCT ON:
Select first row in each GROUP BY group?
To simplify later calculations add a count of months for the date (mon_int). Related:
How do you do date math that ignores the year?
If there are many orders per (month, customer), there are faster query techniques for the first step:
Optimize GROUP BY query to retrieve latest record per user
In subquery gap_to_last_month add the column gap indicating the time gap between this month and the last month with any orders of the same customer. Using the window function lag() for this. Related:
PostgreSQL window function: partition by comparison
In the outer SELECT aggregate per (mon, gap) to get the counts you are after. In addition, get the total count of distinct customers for this month m0.
Feed this query to crosstab() to pivot the result into the desired tabular form for the result. Basics:
PostgreSQL Crosstab Query
About the "extra" column m0:
Pivot on Multiple Columns using Tablefunc

Select first & last date in window

I'm trying to select first & last date in window based on month & year of date supplied.
Here is example data:
F.rates
| id | c_id | date | rate |
---------------------------------
| 1 | 1 | 01-01-1991 | 1 |
| 1 | 1 | 15-01-1991 | 0.5 |
| 1 | 1 | 30-01-1991 | 2 |
.................................
| 1 | 1 | 01-11-2014 | 1 |
| 1 | 1 | 15-11-2014 | 0.5 |
| 1 | 1 | 30-11-2014 | 2 |
Here is pgSQL SELECT I came up with:
SELECT c_id, first_value(date) OVER w, last_value(date) OVER w FROM F.rates
WINDOW w AS (PARTITION BY EXTRACT(YEAR FROM date), EXTRACT(MONTH FROM date), c_id
ORDER BY date ASC)
Which gives me a result pretty close to what I want:
| c_id | first_date | last_date |
----------------------------------
| 1 | 01-01-1991 | 15-01-1991 |
| 1 | 01-01-1991 | 30-01-1991 |
.................................
Should be:
| c_id | first_date | last_date |
----------------------------------
| 1 | 01-01-1991 | 30-01-1991 |
.................................
For some reasons last_value(date) returns every record in a window. Which giving me a thought that I'm misunderstanding how windows in SQL works. It's like SQL forming a new window for each row it iterates through, but not multiple windows for entire table based on YEAR and MONTH.
So could any one be kind and explain if I'm wrong and how do I achieve the result I want?
There is a reason why i'm not using MAX/MIN over GROUP BY clause. My next step would be to retrieve associated rates for dates I selected, like:
| c_id | first_date | last_date | first_rate | last_rate | avg rate |
-----------------------------------------------------------------------
| 1 | 01-01-1991 | 30-01-1991 | 1 | 2 | 1.1 |
.......................................................................
If you want your output to become grouped into a single (or just fewer) row(s), you should use simple aggregation (i.e. GROUP BY), if avg_rate is enough:
SELECT c_id, min(date), max(date), avg(rate)
FROM F.rates
GROUP BY c_id, date_trunc('month', date)
More about window functions in PostgreSQL's documentation:
But unlike regular aggregate functions, use of a window function does not cause rows to become grouped into a single output row — the rows retain their separate identities.
...
There is another important concept associated with window functions: for each row, there is a set of rows within its partition called its window frame. Many (but not all) window functions act only on the rows of the window frame, rather than of the whole partition. By default, if ORDER BY is supplied then the frame consists of all rows from the start of the partition up through the current row, plus any following rows that are equal to the current row according to the ORDER BY clause. When ORDER BY is omitted the default frame consists of all rows in the partition.
...
There are options to define the window frame in other ways ... See Section 4.2.8 for details.
EDIT:
If you want to collapse (min/max aggregation) your data and want to collect more columns than those what listed in GROUP BY, you have 2 choice:
The SQL way
Select min/max value(s) in a sub-query, then join their original rows back (but this way, you have to deal with the fact, that min/max-ed column(s) usually not unique):
SELECT c_id,
min first_date,
max last_date,
first.rate first_rate,
last.rate last_rate,
avg avg_rate
FROM (SELECT c_id, min(date), max(date), avg(rate)
FROM F.rates
GROUP BY c_id, date_trunc('month', date)) agg
JOIN F.rates first ON agg.c_id = first.c_id AND agg.min = first.date
JOIN F.rates last ON agg.c_id = last.c_id AND agg.max = last.date
PostgreSQL's DISTINCT ON
DISTINCT ON is typically meant for this task, but highly rely on ordering (only 1 extremum can be searched for this way at a time):
SELECT DISTINCT ON (c_id, date_trunc('month', date))
c_id,
date first_date,
rate first_rate
FROM F.rates
ORDER BY c_id, date
You can join this query with other aggregated sub-queries of F.rates, but this point (if you really need both minimum & maximum, and in your case even an average) the SQL compliant way is more suiting.
Windowing functions aren't appropriate for this. Use aggregate functions instead.
select
c_id, date_trunc('month', date)::date,
min(date) first_date, max(date) last_date
from rates
group by c_id, date_trunc('month', date)::date;
c_id | date_trunc | first_date | last_date
------+------------+------------+------------
1 | 2014-11-01 | 2014-11-01 | 2014-11-30
1 | 1991-01-01 | 1991-01-01 | 1991-01-30
create table rates (
id integer not null,
c_id integer not null,
date date not null,
rate numeric(2, 1),
primary key (id, c_id, date)
);
insert into rates values
(1, 1, '1991-01-01', 1),
(1, 1, '1991-01-15', 0.5),
(1, 1, '1991-01-30', 2),
(1, 1, '2014-11-01', 1),
(1, 1, '2014-11-15', 0.5),
(1, 1, '2014-11-30', 2);