Query for time-series-like counters in psql

Query for time-series-like counters in psql - sql

I've got the following append-only table in psql:
CREATE TABLE IF NOT EXISTS data (
id UUID DEFAULT gen_random_uuid () PRIMARY KEY,
test_id UUID NOT NULL,
user_id UUID NOT NULL,
completed BOOL NOT NULL DEFAULT False,
inserted_at TIMESTAMPTZ NOT NULL DEFAULT (NOW() AT TIME ZONE 'UTC'),
);
CREATE INDEX some_idx ON data (user_id, test_id, inserted_at DESC);
CREATE INDEX some_idx2 ON data (test_id, inserted_at DESC);
A single user_id might have multiple entries for a given test_id, but only one can be completed (the completed entry is also the last one).
I'm querying for a given test_id. What I need is time-series-like data for each day in the past week. For each day, I should have the following:
total - total entries for unique users WHERE inserted_at < "day"
completed - total completed entries for unique users where inserted_at < "day"
Ultimately, total and completed are like counters and I'm simply trying to take their values for each day in the past week. For instance:
| date | total | completed |
|------------|-------|-----------|
| 2022.01.19 | 100 | 50 |
| 2022.01.18 | 90 | 45 |
| ... | | |
What would be a query with an efficient query-plan? I can consider adding new indexes or modifying the existing one.
PS: I've got a working version here:
SELECT date, entered, completed
FROM (
SELECT d::date AS date
FROM generate_series('2023-01-12', now(),INTERVAL '1 day') AS d
) AS dates
cross join lateral (
SELECT COUNT(DISTINCT user_id) AS entered,
COUNT(1) FILTER (WHERE completed) AS completed // no need for distinct as completed is guaranteed to be once per user
FROM data
WHERE
test_id = 'someId' AND
inserted_at < dates.date
) AS vals
I don't think this is a good/performant solution as it rescans the table with every lateral join iteration. Here's the query plan:
+---------------------------------------------------------------------------------------------------------------------------->
| QUERY PLAN >
|---------------------------------------------------------------------------------------------------------------------------->
| Nested Loop (cost=185.18..185218.25 rows=1000 width=28) (actual time=0.928..7.687 rows=8 loops=1) >
| -> Function Scan on generate_series d (cost=0.01..10.01 rows=1000 width=8) (actual time=0.009..0.012 rows=8 loops=1) >
| -> Aggregate (cost=185.17..185.18 rows=1 width=16) (actual time=0.957..0.957 rows=1 loops=8) >
| -> Bitmap Heap Scan on data (cost=12.01..183.36 rows=363 width=38) (actual time=0.074..0.197 rows=779 loops>
| Recheck Cond: ((test_id = 'someId'::uuid) AND (inserted_at < (d.d)::date)) >
| Heap Blocks: exact=629 >
| -> Bitmap Index Scan on some_idx2 (cost=0.00..11.92 rows=363 width=0) (actual time=>
| Index Cond: ((test_id = 'someId'::uuid) AND (inserted_at < (d.d)::date>
| Planning Time: 0.261 ms >
| Execution Time: 7.733 ms >
+---------------------------------------------------------------------------------------------------------------------------->
I'm sure I'm missing some convenient functions here that will help out. All help is appreciated :pray:

After chat we found a solution
SELECT
date_trunc('day',inserted_at) AS adate,
COUNT(user_id) OVER (
ORDER BY inserted_at ASC
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) -
SUM(CASE WHEN completed THEN 1 ELSE 0 END) OVER (
ORDER BY date_trunc('day', inserted_at) ASC
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) as user_cnt,
SUM(CASE WHEN completed THEN 1 ELSE 0 END) OVER (
ORDER BY date_trunc('day', inserted_at) ASC
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) AS completed
FROM executions
WHERE journey_id = 'd2be0e01-19b1-403e-8659-ce6222f074fd'
ORDER BY date_trunc('day', inserted_at) ASC
You can see we are using the same SUM window function twice. All SQL engines will optimize this to calculate once giving the expected result for the 2nd column. (completed items were counted as an addition user).
prior answers below
ok when I looked at it you don't need a window function after all -- just the trick of CASE statement in a SUM() with GROUP BY
SELECT COUNT(DISTINCT user_id) AS entered,
SUM(CASE WHEN completed THEN 1 ELSE 0 END) AS completed
FROM data
WHERE test_id = 'someId'
GROUP BY inserted_at
To get all prior for a given date looks something like this:
SELECT date_trunc(day,inserted_at) AS date,
DENSE_RANK()
OVER (PARTITION BY user_id ORDER BY inserted_at ASC
BETWEEN ROWS UNBOUNDED PRECEDING AND CURRENT ROW) as user_cnt,
SUM(CASE WHEN completed THEN 1 ELSE 0 END)
OVER (ORDER BY inserted_at ASC
BETWEEN ROWS UNBOUNDED PRECEDING AND CURRENT ROW) AS completed
FROM data
WHERE test_id = 'someId'
ORDER BY inserted_at ASC

Related

How do I flatten the complexity of this select statement in postgresql?

I have the following table:
create table account_values
(
account_id bigint not null,
timestamp timestamp not null,
value1 numeric not null,
value2 numeric not null,
primary key (timestamp, account_id)
);
I also have the following query which produces an array of every value1+value2 of the row with the closest (before) timestamp to an evenly spaced generated series:
select array [(trunc(extract(epoch from gs) * 1000))::text, COALESCE((values.value1 + values.value2), 0.000000)::text]
from generate_series((now() - '1 year'::interval)::timestamp, now(), interval '1 day') gs
left join lateral (select value1, value2
from account_values
where timestamp <= gs and account_id = ?
order by timestamp desc
limit 1) equity on (TRUE);
The issue with this method of generating such an array becomes apparent when inspecting the output of explain analyse:
Nested Loop Left Join (cost=0.45..3410.74 rows=1000 width=32) (actual time=0.134..3948.546 rows=366 loops=1)
-> Function Scan on generate_series gs (cost=0.02..10.02 rows=1000 width=8) (actual time=0.075..0.244 rows=366 loops=1)
-> Limit (cost=0.43..3.36 rows=1 width=26) (actual time=10.783..10.783 rows=1 loops=366)
-> Index Scan Backward using account_values_pkey on account_values (cost=0.43..67730.27 rows=23130 width=26) (actual time=10.782..10.782 rows=1 loops=366)
" Index Cond: ((""timestamp"" <= gs.gs) AND (account_id = 459))"
Planning Time: 0.136 ms
Execution Time: 3948.659 ms
Specifically: loops=366
This problem will only get worse if I ever decide to decrease my generated series interval time.
Is there a way to flatten this looped select into a more efficient query?
If not, what are some other approaches I can take to improving the performance?
edit;
One hard requirement is that the result of the statement cannot be altered. For example I don't want the range to round to the closest day. The range should always start the second the statement is invoked and each interval precisely one day before.

based on Edouard answer.
with a(_timestamp,values_agg) as
(select _timestamp, array_agg(lpad( (value1 + value2)::text,6,'0')) as values_agg from account_values
where account_id = 1
and _timestamp <#tsrange(now()::timestamp - '1 year'::interval, now()::timestamp)
group by 1)
select jsonb_agg(jsonb_build_object
(
'_timestamp', trunc(extract(epoch from _timestamp) *1000)::text
, 'values', values_agg )
) AS item from a;

Not sure you will get the exact same result, but it should be faster :
select array [ (trunc(extract(epoch from date_trunc('day', timestamp)) * 1000))::text
, (array_agg(value1 + value2 ORDER BY timestamp DESC))[1] :: text
]
from account_values
where account_id = ?
and timestamp <# tsrange(now() - '1 year'::interval, now())
group by date_trunc('day', timestamp)

How to get a count of records by minute using a datetime column

I have a table with columns below:
Customer
Time_Start
A
01/20/2020 01:25:00
A
01/22/2020 14:15:00
A
01/20/2020 03:23:00
A
01/21/2020 20:37:00
I am trying to get a table that outputs a table by minute (including zeros) for a given day.
i.e.
Customer
Time_Start
Count
A
01/20/2020 00:01:00
5
A
01/20/2020 00:02:00
2
A
01/20/2020 00:03:00
0
A
01/20/2020 00:04:00
12
I would like to have it show only 1 day for 1 customer at a time.
Here is what I have so far:
select
customer,
cast(time_start as time) + ((cast(time_start as time) - cast('00:00:00' as time)) hour(2)) as TimeStampHour,
count(*) as count
from Table_1
where customer in ('A')
group by customer, TimeStampHour
order by TimeStampHour

In Teradata 16.20 this would be a simple task using the new time series aggregation
SELECT
customer
,Cast(Begin($Td_TimeCode_Range) AS VARCHAR(16))
,Count(*)
FROM table_1
WHERE customer = 'A'
AND time_start BETWEEN TIMESTAMP'2020-01-20 00:00:00'
AND Prior(TIMESTAMP'2020-01-20 00:00:00' + INTERVAL '1' DAY)
GROUP BY TIME(Minutes(1) AND customer)
USING timecode(time_start)
FILL (0)
;
Before you must implement it similar to Ramin Faracov answer, create a list of all minutes first and then left join to it. But I prefer doing the count before joining:
WITH all_minutes AS
( -- create a list of all minutes
SELECT
Begin(pd) AS bucket
FROM
( -- EXPAND ON requires FROM and TRUNC materializes the FROM avoiding error
-- "9303. EXPAND ON clause must not be specified in a query expression with no table references."
SELECT Cast(Trunc(DATE '2020-01-20') AS TIMESTAMP(0)) AS start_date
) AS dt
EXPAND ON PERIOD(start_date, start_date + INTERVAL '1' DAY) AS pd
BY INTERVAL '1' MINUTE
)
SELECT customer
,bucket
,Coalesce(Cnt, 0)
FROM all_minutes
LEFT JOIN
(
SELECT customer
,time_start
- (Extract (SECOND From time_start) * INTERVAL '1' SECOND) AS time_minute
,Count(*) AS Cnt
FROM table_1
WHERE customer = 'A'
AND time_start BETWEEN TIMESTAMP'2020-01-20 00:00:00'
AND Prior(TIMESTAMP'2020-01-20 00:00:00' + INTERVAL '1' DAY)
GROUP BY customer, time_minute
) AS counts
ON counts.time_minute = bucket
ORDER BY bucket
;

Firstly, we create a function that returns a table that has one field and records starting from the time_start value and increasing one minute period.
CREATE OR REPLACE FUNCTION get_dates(time_start timestamp without time zone)
RETURNS TABLE(list_dates timestamp without time zone)
LANGUAGE plpgsql
AS $function$
declare
time_end timestamp;
begin
time_end = time_start + interval '1 day';
return query
SELECT t1.dates
FROM generate_series(time_start, time_end, interval '1 min') t1(dates);
END;
$function$
;
Consider that, this function return's list of timestamp which has an hour and minute, but seconds and milliseconds are always equal to zero. We must join this result of function with our customer table, using the timestamp fields. But our customer table's timestamp field has a full DateTime format, second and millisecond are not empty. Hence it is, understood that our join conditions will not be correct. We need that, in customer table all data of the timestamp field always have an empty second and empty millisecond for correct joining. And for this, I created an immutable function, when you send to this function fully formatted DateTime 2021-02-03 18:24:51.203 then the function will be return timestamps that are empty second and empty millisecond, on this format: 2021-02-03 18:24:00.000
CREATE OR REPLACE FUNCTION clear_seconds_from_datetime(dt timestamp without time zone)
RETURNS timestamp without time zone
LANGUAGE plpgsql
IMMUTABLE
AS $function$
DECLARE
str_date text;
str_hour text;
str_min text;
v_ret timestamp;
begin
str_date = (dt::date)::text;
str_hour = (extract(hour from dt))::text;
str_min = (extract(min from dt))::text;
str_date = str_date || ' ' || str_hour || ':' || str_min || ':' || '00';
v_ret = str_date::timestamp;
return v_ret;
END;
$function$
;
Why our function is immutable? For high performance, I want to use the function-based index on PostgreSQL. But, PostgreSQL is required to use only immutable functions on indexes. We use this function in the condition of the joining process. Now let's view the process of creating a function-based index:
CREATE INDEX customer_date_function_idx ON customer USING btree ((clear_seconds_from_datetime(datestamp)));
Let's go write the main query that's we need:
select
t1.list_dates,
cc.userid,
count(cc.id) as count_as
from
get_dates('2021-02-03 00:01:00'::timestamp) t1 (list_dates)
left join
customer cc on clear_seconds_from_datetime(cc.datestamp) = t1.list_dates
group by t1.list_dates, cc.userid
I inserted 30 million sample data into a customer table for testing. But, this query run's for 33 milliseconds. View the query plan for the resulting explain analyze command:
HashAggregate (cost=309264.30..309432.30 rows=16800 width=20) (actual time=26.076..26.958 rows=2022 loops=1)
Group Key: t1.list_dates, lg.userid
-> Nested Loop Left Join (cost=0.68..253482.75 rows=7437540 width=16) (actual time=18.870..24.882 rows=2023 loops=1)
-> Function Scan on get_dates t1 (cost=0.25..10.25 rows=1000 width=8) (actual time=18.699..18.906 rows=1441 loops=1)
-> Index Scan using log_date_cast_idx on log lg (cost=0.43..179.09 rows=7438 width=16) (actual time=0.003..0.003 rows=1 loops=1441)
Index Cond: (clear_seconds_from_datetime(datestamp) = t1.list_dates)
Planning Time: 0.398 ms
JIT:
Functions: 12
Options: Inlining false, Optimization false, Expressions true, Deforming true
Timing: Generation 3.544 ms, Inlining 0.000 ms, Optimization 0.816 ms, Emission 16.709 ms, Total 21.069 ms
Execution Time: 31.429 ms
Result:
list_dates
group_count
2021-02-03 09:41:00.000
1
2021-02-03 09:42:00.000
3
2021-02-03 09:43:00.000
1
2021-02-03 09:44:00.000
3
2021-02-03 09:45:00.000
1
2021-02-03 09:46:00.000
5
2021-02-03 09:47:00.000
2
2021-02-03 09:48:00.000
1
2021-02-03 09:49:00.000
1
2021-02-03 09:50:00.000
1
2021-02-03 09:51:00.000
4
2021-02-03 09:52:00.000
0
2021-02-03 09:53:00.000
0
2021-02-03 09:54:00.000
2
2021-02-03 09:55:00.000
1

BigQuery - Nested Query with different WHERE parameters?

I'm trying to trying to fetch the user_counts and new_user_counts by date where new_user_counts is defined by condition WHERE date of timestamp event_timestamp = date of timestamp user_first_touch_timestamp while user_counts would fetch the distinct count of user_pseduo_id field between the same date range. How can I do this in the same query? Here's how my current query is looking.
Eventually, I'd like the result to be as:
|Date | new_user_count | user_counts |
|20200820 | X | Y |
Here is the error I'm getting at line 8 of code:
Syntax error: Function call cannot be applied to this expression. Function calls require a path, e.g. a.b.c() at [8:5]
Thanks.
SELECT
event_date,
COUNT (DISTINCT(user_pseudo_id)) AS new_user_counts FROM
`my-google-analytics-table-name.*`
WHERE DATE(TIMESTAMP_MICROS(event_timestamp)) =
DATE(TIMESTAMP_MICROS(user_first_touch_timestamp))
AND event_date BETWEEN '20200820' AND '20200831'
(SELECT
COUNT (DISTINCT(user_pseudo_id)) AS user_counts
FROM `my-google-analytics-table-name.*`
WHERE event_date BETWEEN '20200820' AND '20200831'
)
GROUP BY event_date
ORDER BY event_date ASC

Try below (solely based on your original query just fixing the syntax/logic)
SELECT
event_date,
COUNT(DISTINCT IF(
DATE(TIMESTAMP_MICROS(event_timestamp)) = DATE(TIMESTAMP_MICROS(user_first_touch_timestamp)),
user_pseudo_id,
NULL
)) AS new_user_counts,
COUNT(DISTINCT(user_pseudo_id)) AS user_counts
FROM `my-google-analytics-table-name.*`
GROUP BY event_date
ORDER BY event_date ASC

SQL get n results from each set with defined deviation

Using Oracle 11g and having a table like:
USER | TIME
----- | --------
User1 | 08:15:50
User1 | 10:42:22
User1 | 10:42:24
User1 | 10:42:35
User1 | 10:50:01
User2 | 13:23:05
User2 | 13:23:34
User2 | 13:24:01
User2 | 13:24:02
For each user I need to get (if available) exactly 3 records with deviation between first and last less than a minute. If rows are more than 3 they won't match the criteria. Could you give me some clue?
The result should look like:
User1 | 10:42:22
User1 | 10:42:24
User1 | 10:42:35

Here's my stab at this. I don't have a live Oracle and SQLFiddle isn't working, so please advise how it turns out:
CREATE TABLE t (
u VARCHAR(5),
t DATETIME
);
INSERT INTO t
(u, t)
VALUES
('User1', '2001-01-01 08:15:50'),
('User1', '2001-01-01 10:42:22'),
('User1', '2001-01-01 10:42:24'),
('User1', '2001-01-01 10:42:35'),
('User1', '2001-01-01 10:50:01'),
('User2', '2001-01-01 13:23:05'),
('User2', '2001-01-01 13:23:34'),
('User2', '2001-01-01 13:24:01'),
('User2', '2001-01-01 13:24:02');
SELECT
z.u,
min(z.t) evt_start,
max(z.t) evt_end
FROM
(
SELECT y.*, SUM(prev_or_2prev_not_within) OVER(PARTITION BY u ORDER BY t ROWS UNBOUNDED PRECEDING) as ctr
FROM
(
SELECT
t.*,
CASE WHEN
t - LAG(t) OVER(PARTITION BY u ORDER BY t) < 1.0/1440.0 OR
t - LAG(t, 2) OVER(PARTITION BY u ORDER BY t) < 1.0/1440.0
THEN 0 ELSE 1
END as prev_or_2prev_not_within
FROM
t
) y
) z
GROUP BY
z.u,
z.ctr
HAVING COUNT(*) = 3
I believe it will establish an incrementing counter that doesn't increment when the previous or previousprevious row is within a minute of the current row. It does this by classing rows as 0 or 1, and when 0 occurs the sum-all-preceding-rows operation generates a counter that doesn't change. It then groups on this counter having exactly 3 occurrences. The partition makes the counter work per user
You can see it in action here: https://dbfiddle.uk/?rdbms=sqlserver_2017&fiddle=018125210ecd071f3d11e3d4b3d3e670
It's SQL Server (as noted, I don't have an oracle) but the terms used for sqlserver and the logic should be broadly similar for oracle - oracle supports lag, unbounded sums, having etc, and it does date math in terms of dateA - dateB -> a floating point number representative of whole or parts of a day (and 1440 minutes per day, 1/1440 should represent a float of one minute). The data types sqlserver uses might differ slightly to oracle, and this query does depend on TIME (I called it t - dislike column names that are reserved words/keywords) column being a datetime, not a string that looks like a time. If your data is a string, sort it out so it isn't (use an inner subquery to generate a datetime, or change your data storage so it's stored as a datetime type)
You said you wanted a result that tells the user and the event time - the simplest way to do that was to use min and max to give you the date range. If you're desperate to have all 3 rows on show, you can join the output of this query back to the table with date between evt_start and evt_end, or you can use some sort of string_aggregate type function to give you a list of times straight out of the outermost group operation

I would use analytical count() with range clause:
SQL Fiddle demo
select user_, to_char(time_, 'hh24:mi:ss') time_
from (
select user_, time_,
count(1) over (partition by user_ order by time_
range between interval '1' minute preceding
and interval '1' minute following) cnt
from (select user_, to_date(time_, 'hh24:mi:ss') time_ from tbl))
where cnt = 3
Result:
USER_ TIME_
----- --------
User1 10:42:22
User1 10:42:24
User1 10:42:35
Edit:
As #CaiusJard noticed first answer may show incorrect values when there are intervals like 10:52:01, 10:53:00, 10:53:59. There are some ways to correct this. First is to find min and max time in group and check condition numtodsinterval( max - min, 'day') <= interval '1' minute. Second is to number all rows, then assign flag to these row where prior, current and leading count = 3.
Finally show flagged rows joined with original table:
with t as (
select row_number() over (order by user_, time_) rn, tbl.*,
count(1) over (partition by user_ order by time_
range between interval '1' minute preceding
and interval '1' minute following) cnt
from (select user_, to_date(time_, 'hh24:mi:ss') time_ from tbl) tbl),
r as (select rn,
case when 3 = lag(cnt) over (partition by user_ order by time_)
and 3 = cnt
and 3 = lead(cnt) over (partition by user_ order by time_)
then 1
end flag
from t )
select * from t
join (select rn-1 r1, rn r2, rn+1 r3 from r where flag = 1) r
on rn in (r1, r2, r3)

Efficient subquery for aggregate time series

I want to build a time series daily from a certain date and calculate a few statistics for each day. However this query is very slow... Any way to speed it up? (for example, select the table once in the subquery and compute various stats on that table for each day).
In code this would look like
for i, day in series:
previous_days = series[0...i]
some_calculation_a = some_operation_on(previous_days)
some_calculation_b = some_other_operation_on(previous_days)
Here is an example for a time series looking for users with <= 5 messages up to that date:
with
days as
(
select date::Timestamp with time zone from generate_series('2015-07-09',
now(), '1 day'::interval) date
),
msgs as
(
select days.date,
(select count(customer_id) from daily_messages where sum < 5 and date_trunc('day'::text, created_at) <= days.date) as LT_5,
(select count(customer_id) from daily_messages where sum = 1 and date_trunc('day'::text, created_at) <= days.date) as EQ_1
from days, daily_messages
where date_trunc('day'::text, created_at) = days.date
group by days.date
)
select * from msgs;
Query breakdown:
CTE Scan on msgs (cost=815579.03..815583.03 rows=200 width=24)
Output: msgs.date, msgs.lt_5, msgs.eq_1
CTE days
-> Function Scan on pg_catalog.generate_series date (cost=0.01..10.01 rows=1000 width=8)
Output: date.date
Function Call: generate_series('2015-07-09 00:00:00+00'::timestamp with time zone, now(), '1 day'::interval)
CTE msgs
-> Group (cost=6192.62..815569.02 rows=200 width=8)
Output: days.date, (SubPlan 2), (SubPlan 3)
Group Key: days.date
-> Merge Join (cost=6192.62..11239.60 rows=287970 width=8)
Output: days.date
Merge Cond: (days.date = (date_trunc('day'::text, daily_messages_2.created_at)))
-> Sort (cost=69.83..72.33 rows=1000 width=8)
Output: days.date
Sort Key: days.date
-> CTE Scan on days (cost=0.00..20.00 rows=1000 width=8)
Output: days.date
-> Sort (cost=6122.79..6266.78 rows=57594 width=8)
Output: daily_messages_2.created_at, (date_trunc('day'::text, daily_messages_2.created_at))
Sort Key: (date_trunc('day'::text, daily_messages_2.created_at))
-> Seq Scan on public.daily_messages daily_messages_2 (cost=0.00..1568.94 rows=57594 width=8)
Output: daily_messages_2.created_at, date_trunc('day'::text, daily_messages_2.created_at)
SubPlan 2
-> Aggregate (cost=2016.89..2016.90 rows=1 width=32)
Output: count(daily_messages.customer_id)
-> Seq Scan on public.daily_messages (cost=0.00..2000.89 rows=6399 width=32)
Output: daily_messages.created_at, daily_messages.customer_id, daily_messages.day_total, daily_messages.sum, daily_messages.elapsed
Filter: ((daily_messages.sum < '5'::numeric) AND (date_trunc('day'::text, daily_messages.created_at) <= days.date))
SubPlan 3
-> Aggregate (cost=2001.13..2001.14 rows=1 width=32)
Output: count(daily_messages_1.customer_id)
-> Seq Scan on public.daily_messages daily_messages_1 (cost=0.00..2000.89 rows=96 width=32)
Output: daily_messages_1.created_at, daily_messages_1.customer_id, daily_messages_1.day_total, daily_messages_1.sum, daily_messages_1.elapsed
Filter: ((daily_messages_1.sum = '1'::numeric) AND (date_trunc('day'::text, daily_messages_1.created_at) <= days.date))

In addition to being very inefficient, I suspect the query is also incorrect. Assuming current Postgres 9.6, my educated guess:
SELECT created_at::date
, sum(count(customer_id) FILTER (WHERE sum < 5)) OVER w AS lt_5
, sum(count(customer_id) FILTER (WHERE sum = 1)) OVER w AS eq_1
FROM daily_messages m
WHERE created_at >= timestamptz '2015-07-09' -- sargable!
AND created_at < now() -- probably redundant
GROUP BY 1
WINDOW w AS (ORDER BY created_at::date);
All those correlated subqueries are probably not needed. I replaced it with window functions combined with aggregate FILTER clauses. You can have a window function over an aggregate function. Related answers with more explanation:
Postgres window function and group by exception
Conditional lead/lag function PostgreSQL?
The CTEs don't help either (unnecessary overhead). You only would need a single subquery - or not even that, just the result from the set-returning function generate_series(). generate_series() can deliver timestamptz directly. Be aware of implications, though. You query depends on the time zone setting of the session. Details:
Ignoring timezones altogether in Rails and PostgreSQL
On second thought, I removed generate_series() completely. As long as you have an INNER JOIN to daily_messages, only days with actual rows remain in the result anyway. No need for generate_series() at all. Would make sense with LEFT JOIN. Not enough information in the question.
Related answer explaining "sargable":
Return Table Type from A function in PostgreSQL
You might replace count(customer_id) with count(*). Not enough information in the question.
Might be optimized further, but there is not enough information to be more specific in the answer.
Include days without new entries in result
SELECT day
, sum(lt_5_day) OVER w AS lt_5
, sum(eq_1_day) OVER w AS eq_1
FROM (
SELECT day::date
FROM generate_series(date '2015-07-09', current_date, interval '1 day') day
) d
LEFT JOIN (
SELECT created_at::date AS day
, count(customer_id) FILTER (WHERE sum < 5) AS lt_5_day
, count(customer_id) FILTER (WHERE sum = 1) AS eq_1_day
FROM daily_messages m
WHERE created_at >= timestamptz '2015-07-09'
GROUP BY 1
) m USING (day)
WINDOW w AS (ORDER BY day);
Aggregate daily sums in subquery m.
Generate series of all days in time range in subquery d.
Use LEFT [OUTER] JOIN to retain all days in the result, even without new rows for the day.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Query for time-series-like counters in psql - sql

Related

How do I flatten the complexity of this select statement in postgresql?

How to get a count of records by minute using a datetime column

BigQuery - Nested Query with different WHERE parameters?

SQL get n results from each set with defined deviation

Efficient subquery for aggregate time series

Categories

Resources