use generate series

use generate series - sql

Im writing a psql procedure to read source table, then agregate and write in aggregate table.
My table source contains 2 columns beg, and end refers to client connection to the website, and client disconnect.
I want to caculate for each client the time that he spends . The purpose to use generate series is when the event is over one day.
My pseudo code is below
execute $$SELECT MAX(date_) FROM $$||aggregate_table INTO max_date;
IF max_date is not NULL THEN
execute $$DELETE FROM $$||aggregate_table||$$ WHERE date_ >= $$||quote_literal(max_date);
ELSE
max_date := 'XXXXXXX';
end if;
SELECT * from (
select
Id, gs.due_date,
(case
When TRIM(set) ~ '^OPT[0-9]{3}/MINUTE/$'
Then 'minute'
When TRIM(set) ~ '^OPT[0-9]{3}/SECOND/$'
Then 'second'
as TIME,
sum(extract(epoch from (least(s.end, gs.date_ + interval '1 day') -
greatest(s.beg, gs.date_)
)
) / 60) as Timing
from source s cross join lateral
generate_series(date_trunc(‘day’, s.beg), date_trunc('day',
least(s.end,
CASE WHEN $$||quote_literal(max_date)||$$ = ‘XXXXXXX’
THEN (current_date)
ELSE $$||quote_literal(max_date)||$$
END)
), interval '1 day’) gs(date_)
where ( (beg, end) overlaps ($$||quote_literal(max_date)||$$'00:00:00', $$||quote_literal(max_date)||$$'23:59:59’))
group by id, gs.date_, TIME
) as X
where ($$||quote_literal(max_date)||$$ = X.date_ and $$||quote_literal(max_date)||$$ != ‘XXXXXXX’)
OR ($$||quote_literal(max_date)||$$ ='XXXXXXX')
Data of table source
number, beg, end, id, set
(10, '2019-10-25 13:00:00', '2019-10-25 13:30:00', 1234, 'OPT111/MINUTE/'),
(11, '2019-10-25 13:00:00', '2019-10-25 14:00:00', 1234, 'OPT111/MINUTE/'),
(12, '2019-11-04 09:19:00', '2019-11-04 09:29:00', 1124, 'OPT111/SECOND/'),
(13, '2019-11-04 22:00:00', '2019-11-05 02:00:00', 1124, 'OPT111/MINUTE/')
Expected_output agregate table
2019-10-25, 1234, MINUTE, 90(1h30)
2019-11-04, 1124, SECOND, 10
2019-11-04, 1124, MINUTE, 120
2019-11-05, 1124, MINUTE, 120
The problem of my code is that, it diesn't work if i have new row that will be added tomorrow with for example (14, '2019-11-06 12:00:00', '2019-11-06 13:00:00', 1124, 'OPT111/MINUTE/').
Please guys who can help?
thank you

Here is my solution. I have changed column names in order to avoid reserved words. You may need to touch the formatting of duration.
with mycte as
(
select -- the first / first and only days
id, col_beg,
case when col_beg::date = col_end::date then col_end else date_trunc('day', col_end) end as col_end
from mytable
union all
select -- the last days of multi-day periods
id, date_trunc('day', col_end) as col_beg, col_end
from mytable
where col_end::date > col_beg::date
union all
select -- the middle days of multi-day periods
id, rd as col_beg, rd::date + 1 as col_end
from mytable
cross join lateral generate_series(col_beg::date + 1, col_end::date - 1, interval '1 day') g(rd)
where col_end::date > col_beg::date + 1
)
select
col_beg::date as start_time, id, sum(col_end - col_beg) as duration
from mycte group by 1, 2 order by 1;

Related

Group By Timestamp_Trunc including empty rows with '0' count

I have a BigQuery query as follows:
SELECT
timestamp_trunc(timestamp,
hour) hour,
statusCode,
CAST(AVG(durationMs) as integer) averageDurationMs,
COUNT(*) count
FROM
`project.dataset.table`
WHERE timestamp > DATE_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 DAY)
GROUP BY
hour,
statusCode
And it works great, returning results like this:
However, my charting component needs empty rows for empty 'hours' (e.g. 18:00 should be 0, 19:00 = 0 etc)
Is there an elegant way to do this in BigQuery SQL or do I have to do it in code before returning to my UI?

Try generating array of hours needed cross joining it with all the status codes and left joining with your results:
with mytable as (
select timestamp '2021-10-18 19:00:00' as hour, 200 as statusCode, 1234 as averageDurationMs, 25 as count union all
select '2021-10-18 21:00:00', 500, 4978, 6015 union all
select '2021-10-18 21:00:00', 404, 4987, 5984 union all
select '2021-10-18 21:00:00', 200, 5048, 11971 union all
select '2021-10-18 21:00:00', 401, 4976, 6030
)
select myhour, allCodes.statusCode, IFNULL(mytable.averageDurationMs, 0) as statusCode, IFNULL(mytable.count, 0) as averageDurationMs
from
UNNEST(GENERATE_TIMESTAMP_ARRAY(TIMESTAMP_SUB(TIMESTAMP_TRUNC(CURRENT_TIMESTAMP(), HOUR), INTERVAL 23 HOUR), TIMESTAMP_TRUNC(CURRENT_TIMESTAMP(), HOUR), INTERVAL 1 HOUR)) as myhour
CROSS JOIN
(SELECT DISTINCT statusCode FROM mytable) as allCodes
LEFT JOIN mytable ON myHour = mytable.hour AND allCodes.statusCode = mytable.statusCode

Dynamic LAG function (Standard SQL, BigQuery). Is it possible?

I am trying hard to find a solution for that. I've attached an image with a overview about what I want too, but I will write here too.
In LAG function, is it possible to have a dynamic number in the syntax?
LAG(sessions, 3)
Instead of using 3, I need the number of column minutosdelift which is 3 in this example, but it will be different for each situation.
I've tried to use LAG(sessions, minutosdelift) but It is not possible. I've tried LAG(sessions, COUNT(minutosdelift)) and it is not possible either.
The final goal is to calculate the difference between 52 and 6. So, (52/6)-1 which gives me 767%. But to do it I need a dynamic number into LAG function (or another idea to do it).
I've tried using ROWS PRECEDING AND ROWS UNBOUNDED PRECEDING, but again it needs a literal number.
Please, any idea about how to do it? Thanks!
This screenshot might explain it:
enter image description here
My code: this is the last query I've tried, because I have 7 previous views
SELECT
DATE, HOUR, MINUTE, SESSIONS, PROGRAMA_2,
janela_lift_teste, soma_sessao_programa_2, minutosdelift,
CASE
WHEN minutosdelift != 0
THEN LAG(sessions, 3) OVER(ORDER BY DATE, HOUR, MINUTE ASC)
END AS lagtest,
CASE
WHEN programa_2 = "#N/A" OR programa_2 is null
THEN LAST_VALUE(sessions) OVER (PARTITION BY programa_2 ORDER BY DATE, HOUR, MINUTE ASC)
END AS firstvaluetest,
FROM
tbl8
GROUP BY
DATE, HOUR, MINUTE, SESSIONS, PROGRAMA_2,
janela_lift_teste, minutosdelift, soma_sessao_programa_2
ORDER BY
DATE, HOUR, MINUTE ASC

In BigQuery (as in some other databases), the argument to lag() has to be a constant.
One method to get around this uses a self join. I find it hard to follow your query, but the idea is:
with tt as (
select row_number() over (order by sessions) as seqnum,
t.*
from t
)
select t.*, tprev.*
from t join
t tprev
on tprev.seqnum = t.seqnum - minutosdelift;

Consider below example - hope you can apply this approach to your use case
#standardSQL
with `project.dataset.table` as (
select 1 session, timestamp '2021-01-01 00:01:00' ts, 10 minutosdelift union all
select 2, '2021-01-01 00:02:00', 1 union all
select 3, '2021-01-01 00:03:00', 2 union all
select 4, '2021-01-01 00:04:00', 3 union all
select 5, '2021-01-01 00:05:00', 4 union all
select 6, '2021-01-01 00:06:00', 5 union all
select 7, '2021-01-01 00:07:00', 3 union all
select 8, '2021-01-01 00:08:00', 1 union all
select 9, '2021-01-01 00:09:00', 2 union all
select 10, '2021-01-01 00:10:00', 8 union all
select 11, '2021-01-01 00:11:00', 6 union all
select 12, '2021-01-01 00:12:00', 4 union all
select 13, '2021-01-01 00:13:00', 2 union all
select 14, '2021-01-01 00:14:00', 1 union all
select 15, '2021-01-01 00:15:00', 11 union all
select 16, '2021-01-01 00:16:00', 1 union all
select 17, '2021-01-01 00:17:00', 8
)
select a.*, b.session as lagtest
from `project.dataset.table` a
left join `project.dataset.table` b
on b.ts = timestamp_sub(a.ts, interval a.minutosdelift minute)
with output

SQL query (of minute time series data points) to get all data points within a given hour plus the first data point of the following hour?

I have a table of data values keyed by a stream id and a time stamp, basically each row represents a minute of data given a specific stream at a specific minute, and the table has many streams and many minutes.
So I'm trying to query over a set of streams, any data points within a given hour plus the (chronologically) first data point of the following hour (this is the part I'm having trouble with).
It's also difficult because any of the 60+1 minute rows could be missing, and I want the single data point even if is in the middle of the hour, as long as its the first one. So I can't just query over '2019-12-06 00:00:00' - '2019-12-06 01:01:00'.
Sorry this is probably unclear but if you look at my examples, I think it will make sense.
I made a couple attempts that work on my test cases but I have a feeling like they are not universal or I could be doing it a better way.
SELECT stream_id, time_stamp, my_data
FROM data_points_minutes
WHERE
time_stamp >= '2019-12-06 00:00:00'
AND time_stamp < '2019-12-06 01:00:00'
AND stream_id IN (123, 456, 789)
UNION
SELECT DISTINCT ON (stream_id) stream_id, time_stamp, my_data
FROM data_point_minutes
WHERE
time_slot >= '2019-12-06 01:00:00'
AND time_slot < '2019-12-06 02:00:00'
AND stream_id IN (123, 456, 789)
ORDER BY
stream_id, time_stamp;
This works for my test data but I'm worried that the SELECT DISTINCT only working because they are already sorted by timestamp but would not work if they weren't, which led me to
SELECT *
FROM(
SELECT stream_id, time_stamp, my_value
FROM
data_point_minutes
WHERE
time_stamp >= '2019-12-06 00:00:00'
AND time_stamp < '2019-12-06 01:00:00'
AND stream_id IN (123, 456, 789)
) as q1
UNION
SELECT *
FROM(
SELECT
DISTINCT ON (stream_id) stream_id, time_stamp, my_value
FROM
data_point_minutes
WHERE
time_stamp >= '2019-12-06 01:00:00'
AND time_stamp < '2019-12-06 02:00:00'
AND stream_id IN (123, 456, 789)
ORDER BY
stream_id, time_stamp ASC
) AS q2
ORDER BY
stream_id, time_stamp;
and I think this is mostly working, and I might go with this but nesting this way seems a little awkward to me so I'm hoping someone could suggest something more elegant.

You could or the condition on the upper bound of the date range with an equality check on the next timestamp, that can be computed with a subquery:
select stream_id, time_stamp, my_data
from data_points_minutes
where
stream_id in (123, 456, 789)
and time_stamp >= '2019-12-06 00:00:00'
and (
time_stamp < '2019-12-06 01:00:00'
or time_stamp = (
select min(d1.time_stamp)
from data_points_minutes d1
where d1.stream_id in (123, 456, 789) and d1.timestamp >= '2019-12-06 01:00:00'
)
)
Or possibly, if you want the next data point for each stream_id, you can correlate the subquery:
select stream_id, time_stamp, my_data
from data_points_minutes d
where
stream_id in (123, 456, 789)
and time_stamp >= '2019-12-06 00:00:00'
and (
time_stamp < '2019-12-06 01:00:00'
or time_stamp = (
select min(d1.time_stamp)
from data_points_minutes d1
where d1.stream_id = d.stream_id and d1.timestamp >= '2019-12-06 01:00:00'
)
)

What you basically want is the minimum value of timestamp for each stream in the given row set (selection from the next hour) and argmin, the row on which minimum value is achieved. There are a few ways to solve it, but probably the most readable way is using window functions.
Here is a query which generates some test values:
WITH Data AS (
select * from (values
(NOW() , 1),
(NOW() + interval '1m', 1),
(NOW() + interval '1m', 2),
(NOW() + interval '2m', 2)
) T(ts, stream)
)
SELECT * FROM Data;
ts | stream
-------------------------------+--------
2019-12-14 01:08:07.556573+00 | 1
2019-12-14 01:09:07.556573+00 | 1
2019-12-14 01:09:07.556573+00 | 2
2019-12-14 01:10:07.556573+00 | 2
A query which calculates the minimum timestamps and its argmin for each stream:
WITH Data AS (
select * from (values
(NOW() , 1),
(NOW() + interval '1m', 1),
(NOW() + interval '1m', 2),
(NOW() + interval '2m', 2)
) T(ts, stream)
),
RankedData AS (
SELECT ts,
RANK() OVER (PARTITION BY stream ORDER BY ts),
stream
FROM Data
)
SELECT * FROM RankedData WHERE rank=1;
ts | rank | stream
-------------------------------+------+--------
2019-12-14 01:12:08.676228+00 | 1 | 1
2019-12-14 01:13:08.676228+00 | 1 | 2
If you build Data as selection of rows from the next hour, it will solve your problem:
SELECT stream_id, time_stamp, my_data
FROM data_points_minutes
WHERE
time_stamp >= '2019-12-06 00:00:00'
AND time_stamp < '2019-12-06 01:00:00'
AND stream_id IN (123, 456, 789)
UNION (
WITH Data AS (
SELECT stream_id, time_stamp, my_data
FROM data_point_minutes
WHERE
time_slot >= '2019-12-06 01:00:00'
AND time_slot < '2019-12-06 02:00:00'
AND stream_id IN (123, 456, 789)
),
RankedData AS (
SELECT time_stamp, my_data
RANK() OVER (PARTITION BY stream_id ORDER BY time_stamp),
stream_id
FROM Data
)
SELECT stream_id, time_stamp, my_data FROM RankedData WHERE rank=1
)

Select overlaped hours

I have table
CREATE TABLE card_tab
(card_no NUMBER,
emp_no VARCHAR2(100),
DATA DATE,
start_time DATE,
end_time DATE)
insert into card_tab (CARD_NO, EMP_NO, DATA, START_TIME, END_TIME)
values (1, '100', to_date('15-11-2019', 'dd-mm-yyyy'), to_date('15-11-2019 20:00:00', 'dd-mm-yyyy hh24:mi:ss'), to_date('16-11-2019 03:00:00', 'dd-mm-yyyy hh24:mi:ss'));
insert into card_tab (CARD_NO, EMP_NO, DATA, START_TIME, END_TIME)
values (2, '100', to_date('15-11-2019', 'dd-mm-yyyy'), to_date('15-11-2019 22:00:00', 'dd-mm-yyyy hh24:mi:ss'), to_date('15-11-2019 23:00:00', 'dd-mm-yyyy hh24:mi:ss'));
The card_no is only sequence. Emp_no was working 7 hours
SELECT t.*, (t.end_time - t.start_time) * 24 work_hours FROM card_tab t;
CARD_NO EMP_NO DATA START_TIME END_TIME WORK_HOURS
1 100 15.11.2019 15.11.2019 20:00:00 16.11.2019 03:00:00 7
2 100 15.11.2019 15.11.2019 22:00:00 15.11.2019 23:00:00 1
If hours are overlaped then sould be divided by 2
in this example it will be for
CARD_NO WORK_HOURS
1 6,5
2 0,5
The sum of working hours is 7 so it's correct.
There can be more than two overlaped records. I do a lot of loops but i think this can be do more easier. It's loo like island and gap problem but i don't know how to solve it.

You can try this:
with data as
(
select exact_hour
, card_no
, emp_no
, ( select count(1)
from card_tab
where exact_hour >= start_time
and exact_hour < end_time) cnt
from
(
select distinct start_time + (level - 1)/24 exact_hour
, tab.card_no
, emp_no
from card_tab tab
connect by nocycle level <= (end_time - start_time) * 24
)
)
select card_no, sum(1 / cnt)
from data
group by card_no
;
Some explanation:
(level - 1)
it has been made that way because we only refer to start of each hour
exact_hour >= start_time and exact_hour < end_time
we can't use between keyword because in your data start of each period is equal to end of the previous one so for example period 22:00-23:00 would appear in two start hours
sum(1 / cnt)
we want to sum every part of hour so we need to divide each hour by number of different cards which were worked (?? - I don't know the exact business case)
I suppose everything except that three points is clear.

Querying count on daily basis with date constraints over multiple weeks

I'm trying to find the # active users over time on a daily basis.
A user is active when he has made more than 10 requests per week for 4 consecutive weeks.
ie. On Oct 31, 2014, a user is active if he has made more than 10 requests in total per week between:
Oct 24-Oct 30, 2014 AND
Oct 17-Oct 23, 2014 AND
Oct 10-Oct 16, 2014 AND
Oct 3-Oct 9, 2014
I have a table of requests:
CREATE TABLE requests (
id text PRIMARY KEY, -- id of the request
amount bigint, -- sum of requests made by accounts_id to recipient_id,
-- aggregated on a daily basis based on "date"
accounts_id text, -- id of the user
recipient_id text, -- id of the recipient
date timestamp -- date that the request was made in YYYY-MM-DD
);
Sample values:
INSERT INTO requests2
VALUES
('1', 19, 'a1', 'b1', '2014-10-05 00:00:00'),
('2', 19, 'a2', 'b2', '2014-10-06 00:00:00'),
('3', 85, 'a3', 'b3', '2014-10-07 00:00:00'),
('4', 11, 'a1', 'b4', '2014-10-13 00:00:00'),
('5', 2, 'a2', 'b5', '2014-10-14 00:00:00'),
('6', 50, 'a3', 'b5', '2014-10-15 00:00:00'),
('7', 787323, 'a1', 'b6', '2014-10-17 00:00:00'),
('8', 33, 'a2', 'b8', '2014-10-18 00:00:00'),
('9', 14, 'a3', 'b9', '2014-10-19 00:00:00'),
('10', 11, 'a4', 'b10', '2014-10-19 00:00:00'),
('11', 1628, 'a1', 'b11', '2014-10-25 00:00:00'),
('13', 101, 'a2', 'b11', '2014-10-25 00:00:00');
Example output:
Date | # Active users
-----------+---------------
10-01-2014 | 600
10-02-2014 | 703
10-03-2014 | 891
Here's what I tried to do to find the number of active users for a certain date (e.g. 10-01-2014):
SELECT count(*)
FROM
(SELECT accounts_id
FROM requests
WHERE "date" BETWEEN '2014-10-01'::date - interval '2 weeks' AND '2014-10-01'::date - interval '1 week'
GROUP BY accounts_id HAVING sum(amount) > 10) week_1
JOIN
(SELECT accounts_id
FROM requests
WHERE "date" BETWEEN '2014-10-01'::date - interval '3 weeks' AND '2014-10-01'::date - interval '2 week'
GROUP BY accounts_id HAVING sum(amount) > 10) week_2 ON week_1.accounts_id = week_2.accounts_id
JOIN
(SELECT accounts_id
FROM requests
WHERE "date" BETWEEN '2014-10-01'::date - interval '4 weeks' AND '2014-10-01'::date - interval '3 week'
GROUP BY accounts_id HAVING sum(amount) > 10) week_3 ON week_2.accounts_id = week_3.accounts_id
JOIN
(SELECT accounts_id
FROM requests
WHERE "date" BETWEEN '2014-10-01'::date - interval '5 weeks' AND '2014-10-01'::date - interval '4 week'
GROUP BY accounts_id HAVING sum(amount) > 10) week_4 ON week_3.accounts_id = week_4.accounts_id
Since this is just the query to get the number for 1 day, I need to get this number on a daily basis over time. I think the idea is to do a join to get the date so I tried to do something like this:
SELECT week_1."Date_series",
count(*)
FROM
(SELECT to_char(DAY::date, 'YYYY-MM-DD') AS "Date_series",
accounts_id
FROM generate_series('2014-10-01'::date, CURRENT_DATE, '1 day') DAY, requests
WHERE to_char(DAY::date, 'YYYY-MM-DD')::date BETWEEN requests.date::date - interval '2 weeks' AND requests.date::date - interval '1 week'
GROUP BY "Date_series",
accounts_id HAVING sum(amount) > 10) week_1
JOIN
(SELECT to_char(DAY::date, 'YYYY-MM-DD') AS "Date_series",
accounts_id
FROM generate_series('2014-10-01'::date, CURRENT_DATE, '1 day') DAY, requests
WHERE to_char(DAY::date, 'YYYY-MM-DD')::date BETWEEN requests.date::date - interval '3 weeks' AND requests.date::date - interval '2 week'
GROUP BY "Date_series",
accounts_id HAVING sum(amount) > 10) week_2 ON week_1.accounts_id = week_2.accounts_id
AND week_1."Date_series" = week_2."Date_series"
JOIN
(SELECT to_char(DAY::date, 'YYYY-MM-DD') AS "Date_series",
accounts_id
FROM generate_series('2014-10-01'::date, CURRENT_DATE, '1 day') DAY, requests
WHERE to_char(DAY::date, 'YYYY-MM-DD')::date BETWEEN requests.date::date - interval '4 weeks' AND requests.date::date - interval '3 week'
GROUP BY "Date_series",
accounts_id HAVING sum(amount) > 10) week_3 ON week_2.accounts_id = week_3.accounts_id
AND week_2."Date_series" = week_3."Date_series"
JOIN
(SELECT to_char(DAY::date, 'YYYY-MM-DD') AS "Date_series",
accounts_id
FROM generate_series('2014-10-01'::date, CURRENT_DATE, '1 day') DAY, requests
WHERE to_char(DAY::date, 'YYYY-MM-DD')::date BETWEEN requests.date::date - interval '5 weeks' AND requests.date::date - interval '4 week'
GROUP BY "Date_series",
accounts_id HAVING sum(amount) > 10) week_4 ON week_3.accounts_id = week_4.accounts_id
AND week_3."Date_series" = week_4."Date_series"
GROUP BY week_1."Date_series"
However, I think I'm not getting the right answer and I'm not sure why. Any tips/ guidance/ pointers is much appreciated! :) :)
PS. I'm using Postgres 9.3

Here is a long answer how to make your queries short. :)
Table
Building on my table (before you provided table definition with different (odd!) data types:
CREATE TABLE requests (
id int
, accounts_id int -- (id of the user)
, recipient_id int -- (id of the recipient)
, date date -- (date that the request was made in YYYY-MM-DD)
, amount int -- (# of requests by accounts_id for the day)
);
Active Users for given day
The list of "active users" for one given day:
SELECT accounts_id
FROM (
SELECT w.w, r.accounts_id
FROM (
SELECT w
, day - 6 - 7 * w AS w_start
, day - 7 * w AS w_end
FROM (SELECT '2014-10-31'::date - 1 AS day) d -- effective date here
, generate_series(0,3) w
) w
JOIN requests r ON r."date" BETWEEN w_start AND w_end
GROUP BY w.w, r.accounts_id
HAVING sum(r.amount) > 10
) sub
GROUP BY 1
HAVING count(*) = 4;
Step 1
In the innermost subquery w (for "week") build the bounds of the 4 weeks of interest from a CROSS JOIN of the given day - 1 with the output of generate_series(0-3).
To add / subtract days to / from a date (not from a timestamp!) just add / subtract integer numbers. The expression day - 7 * w subtracts 0-3 times 7 days from the given date, arriving at the end dates for each week (w_end).
Subrtract another 6 days (not 7!) from each to compute the respective start (w_start).
Additionally, keep the week number w (0-3) for the later aggregation.
Step 2
In subquery sub join rows from requests to the set of 4 weeks, where the date lies between start and end date. GROUP BY the week number w and the accounts_id.
Only weeks with more than 10 requests total qualify.
Step 3
In the outer SELECT count the number of weeks each user (accounts_id) qualified. Must be 4 to qualify as "active user"
Count of active users per day
This is dynamite.
Wrapped in in a simple SQL function to simplify general use, but the query can be used on its own just as well:
CREATE FUNCTION f_active_users (_now date = now()::date, _days int = 3)
RETURNS TABLE (day date, users int) AS
$func$
WITH r AS (
SELECT accounts_id, date, sum(amount)::int AS amount
FROM requests
WHERE date BETWEEN _now - (27 + _days) AND _now - 1
GROUP BY accounts_id, date
)
SELECT date + 1, count(w_ct = 4 OR NULL)::int
FROM (
SELECT accounts_id, date
, count(w_amount > 10 OR NULL)
OVER (PARTITION BY accounts_id, dow ORDER BY date DESC
ROWS BETWEEN CURRENT ROW AND 3 FOLLOWING) AS w_ct
FROM (
SELECT accounts_id, date, dow
, sum(amount) OVER (PARTITION BY accounts_id ORDER BY date DESC
ROWS BETWEEN CURRENT ROW AND 6 FOLLOWING) AS w_amount
FROM (SELECT _now - i AS date, i%7 AS dow
FROM generate_series(1, 27 + _days) i) d -- period of interest
CROSS JOIN (
SELECT accounts_id FROM r
GROUP BY 1
HAVING count(*) > 3 AND sum(amount) > 39 -- enough rows & requests
AND max(date) > min(date) + 15) a -- can cover 4 weeks
LEFT JOIN r USING (accounts_id, date)
) sub1
WHERE date > _now - (22 + _days) -- cut off 6 trailing days now - useful?
) sub2
GROUP BY date
ORDER BY date DESC
LIMIT _days
$func$ LANGUAGE sql STABLE;
The function takes any day (_now), "today" by default, and the number of days (_days) in the result, 3 by default. Call:
SELECT * FROM f_active_users('2014-10-31', 5);
Or without parameters to use defaults:
SELECT * FROM f_active_users();
The approach is different from the first query.
SQL Fiddle with both queries and variants for your table definition.
Step 0
In the CTE r pre-aggregate amounts per (accounts_id, date) for only the period of interest, for better performance. The table is only scanned once, the suggested index (see blow) will kick in here.
Step 1
In the inner subquery d generate the necessary list of days: 27 + _days rows, where _days is the desired number of rows in the output, effectively 28 days or more.
While being at it, compute the day of the week (dow) to be used for aggregating in step 3. i%7 coincides with weekly intervals, the query works for any interval, though.
In the inner subquery a generate a unique list of users (accounts_id) that exist in CTE r and pass some first superficial tests (sufficient rows spanning sufficient time with sufficient total requests).
Step 2
Generate a Cartesian product from d and a with a CROSS JOIN to have one row for every relevant day for every relevant user. LEFT JOIN to r to append the amount of requests (if any). No WHERE condition, we want every day in the result, even if there are no active users at all.
Compute the total amount for the past week (w_amount) in the same step using a Window functions with a custom frame. Example:
How to use a ring data structure in window functions
Step 3
Cut off the last 6 days now; which is optional and may or may not help performance. Test it: WHERE date >= _now - (21 + _days)
Count the weeks where the minimum amount is met (w_ct) in a similar window function, this time partitioned by dow additionally to only have same weekdays for the past 4 weeks in the frame (which carry the sum of the respective past week).
The expression count(w_amount > 10 OR NULL) only counts rows with more than 10 requests. Detailed explanation:
Compute percents from SUM() in the same SELECT sql query
Step 4
In the outer SELECT group by date and count users that passed all 4 weeks (count(w_ct = 4 OR NULL)). Add 1 to the date to compensate off-by-1, ORDER and LIMIT to the requested number of days.
Performance and outlook
The perfect index for both queries would be:
CREATE INDEX foo ON requests (date, accounts_id, amount);
Performance should be good, but get even (much) better with the upcoming Postgres 9.4, due to the new moving aggregate support:
Moving-aggregate support in the Postgres Wiki.
Moving aggregates in the 9.4 manual
Aside: don't call a timestamp column "date", it's a timestamp, not a date. Better yet, never use basic type names like date or timestamp as identifier. Ever.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

use generate series - sql

Related

Group By Timestamp_Trunc including empty rows with '0' count

Dynamic LAG function (Standard SQL, BigQuery). Is it possible?

SQL query (of minute time series data points) to get all data points within a given hour plus the first data point of the following hour?

Select overlaped hours

Querying count on daily basis with date constraints over multiple weeks

Categories

Resources