PostgreSQL GROUP BY that includes zeros - postgresql-9.5

I have a SQL query (postgresql) that looks something like this:
SELECT
my_timestamp::timestamp::date as the_date,
count(*) as count
FROM my_table
WHERE ...
GROUP BY the_date
ORDER BY the_date
The result is a table of YYYY-MM-DD, count pairs.
Now I've been asked to fill in the empty dates with zero. So if I was previously providing
2022-03-15 3
2022-03-17 1
I'd now want to return
2022-03-15 3
2022-03-16 0
2022-03-17 1
Now I can easily do this client-side (relative to the database) and let my program compute and return the zero-augmented list to its clients based on the original list from postgres. But perhaps it would better if I could just tell postgresql to include zeros.
I suspect this isn't easy at all, because postgres has no obvious way of knowing what I'm up to. But in the interests of learning more about postgres and SQL, I thought I'd have try. The try isn't too promising thus far...
Any pointers before I conclude that I was right to leave this to my (postgres client) program?
Update
This is an interesting case where my simplification of the problem led to a correct answer that didn't work for me. For those who come after, I thought it worth documenting what followed, because it take some fun twists through constructing SQL queries.
#a_horse_with_no_name responded with a query that I've verified works if I simplify my own query to match. Unfortunately, my query had some extra baggage that I didn't think pertinent, and so had trimmed out when posting the original question.
Here's my real (original) query, with all names preserved (if shortened):
-- current query
SELECT
LEAST(time1, time2, time3, time4)::timestamp::date as the_date,
count(*) as count
FROM reading_group_reader rgr
INNER JOIN ( SELECT group_id, group_type ::group_type_name
FROM (VALUES (31198, 'excerpt')) as T(group_id, group_type)) TT
ON TT.group_id = rgr.group_id
AND TT.group_type = rgr.group_type
WHERE LEAST(time1, time2, time3, time4) > current_date - 30
GROUP BY the_date
ORDER BY the_date;
If I translate that directly into the proposed solution, however, the inner join between reading_group_reader and the temporary table TT causes the left join to become inner (I think) and the date sequence drops its zeros again. Fwiw, the table TT is a table because sometimes it actually is a subselect.
So I transformed my query into this:
SELECT
g.dt::date as the_date,
count(*) as count
FROM generate_series(date '2022-03-06', date '2022-04-06', interval '1 day') as g(dt)
LEFT JOIN (
SELECT
LEAST(rgr.time1, rgr.time2, rgr.time3, rgr.time4)::timestamp::date as the_date
FROM reading_group_reader rgr
INNER JOIN (
SELECT group_id, group_type ::group_type_name
FROM (VALUES (31198, 'excerpt')) as T(group_id, group_type)) TT
ON TT.group_id = rgr.group_id
AND TT.group_type = rgr.group_type
) rgrt
ON rgrt.the_date = g.dt::date
GROUP BY g.dt
ORDER BY the_date;
but this outputs 1's instead of 0's at the places that should be 0.
The reason for that, however, is because I've now selected every date, so, of course, there's one of each. I need to include an additional field (which will be NULL) and count that.
So this query finally does what I want:
SELECT
g.dt::date as the_date,
count(rgrt.device_id) as count
FROM generate_series(date '2022-03-06', date '2022-04-06', interval '1 day') as g(dt)
LEFT JOIN (
SELECT
LEAST(rgr.time1, rgr.time2, rgr.time3, rgr.time4)::timestamp::date as the_date,
rgr.device_id
FROM reading_group_reader rgr
INNER JOIN (
SELECT group_id, group_type ::group_type_name
FROM (VALUES (31198, 'excerpt')) as T(group_id, group_type)
) TT
ON TT.group_id = rgr.group_id
AND TT.group_type = rgr.group_type
) rgrt(the_date)
ON rgrt.the_date = g.dt::date
GROUP BY g.dt
ORDER BY g.dt;
And, of course, on re-reading the accepted answer, I eventually saw that he did count an unrelated field, which I'd simply missed on my first several readings.

You will need to join to a list of dates. This can e.g. be done using generate_series()
SELECT g.dt::date as the_date,
count(t.my_timestamp) as count
FROM generate_series(date '2022-03-01',
date '2022-03-31',
interval '1 day') as g(dt)
LEFT JOIN my_table as t
ON t.my_timestamp::date = g.dt::date
AND ... -- the original WHERE clause goes here!
GROUP BY the_date
ORDER BY the_date;
Note that the original WHERE conditions need to go into the join condition of the LEFT JOIN. You can't put them into a WHERE clause because that would turn the outer join back into an inner join (which means the missing dates wouldn't be returned).

Related

Add missing months with values from previous month

I need to use this SQL query for a software and get the time in a particular format hence the reason for the Time column however I need the query to insert the months that are missing with the value from the previous month. This is the query I currently have.
SELECT [accountnumber],SUM([postingamount]) AS Amount, [accountingdate],
convert(varchar(4),year(accountingdate))+'M'+ Format(DATEPART( MONTH, accountingdate) , '00')
AS [Time]
FROM [7 GL Detail MACL]
where [accountingdate]>='2019-01-01'
GROUP BY [accountingdate],[postingamount],[accountnumber]
Current Results
Expected Results
Since you didn't specify the RDBMS system you're using, I can't guarantee that this logic will work because every system uses slightly different SQL syntax.
However I used Rasgo datespine function to generate this SQL, as it is quite complex to wrap your head around, and tested it on Snowflake.
The main differences between Snowflake and other systems are: DATEADD and TABLE (GENERATOR())
In case you can't modify this to work in your system, here are the basic steps which you'll want to follow:
Select unique accountnumbers
Select unique dates (month beginnings?) This is where Snowflake uses GENERATOR but other systems might actually have a Calendar table you can select from
Cross Join (cartesian join) these to create every possible combination of accountnumber and date
Outer Join #3 to your data (might have to truncate your date to month-begin)
Filter out rows that dont apply. Like for instance you might have just inserted a row for 1/1/2019 for an account that didn't even begin until 12/12/2020.
WITH GLOBAL_SPINE AS (
SELECT
ROW_NUMBER() OVER (ORDER BY NULL) as INTERVAL_ID,
DATEADD('MONTH', (INTERVAL_ID - 1), '2019-01-01'::timestamp_ntz) as SPINE_START,
DATEADD('MONTH', INTERVAL_ID, '2022-06-01'::timestamp_ntz) as SPINE_END
FROM TABLE (GENERATOR(ROWCOUNT => 42))
),
GROUPS AS (
SELECT
accountnumber,
MIN(DESIRED_INTERVAL) AS LOCAL_START,
MAX(DESIRED_INTERVAL) AS LOCAL_END
FROM [7 GL Detail MACL]
GROUP BY
accountnumber
),
GROUP_SPINE AS (
SELECT
accountnumber,
SPINE_START AS GROUP_START,
SPINE_END AS GROUP_END
FROM GROUPS G
CROSS JOIN LATERAL (
SELECT
SPINE_START, SPINE_END
FROM GLOBAL_SPINE S
WHERE S.SPINE_START >= G.LOCAL_START
)
)
SELECT
G.accountnumber AS GROUP_BY_accountnumber,
GROUP_START,
GROUP_END,
T.*
FROM GROUP_SPINE G
LEFT JOIN {{ your_table }} T
ON DESIRED_INTERVAL >= G.GROUP_START
AND DESIRED_INTERVAL < G.GROUP_END
AND G.accountnumber = T.accountnumber;
You were also doing an aggregation step, but I figure once you get this complicated part down, you can figure out how to finally aggregate it the way you want it.

Get apps with the highest review count since a dynamic series of days

I have two tables, apps and reviews (simplified for the sake of discussion):
apps table
id int
reviews table
id int
review_date date
app_id int (foreign key that points to apps)
2 questions:
1. How can I write a query / function to answer the following question?:
Given a series of dates from the earliest reviews.review_date to the latest reviews.review_date (incrementing by a day), for each date, D, which apps had the most reviews if the app's earliest review was on or later than D?
I think I know how to write a query if given an explicit date:
SELECT
apps.id,
count(reviews.*)
FROM
reviews
INNER JOIN apps ON apps.id = reviews.app_id
group by
1
having
min(reviews.review_date) >= '2020-01-01'
order by 2 desc
limit 10;
But I don't know how to query this dynamically given the desired date series and compile all this information in a single view.
2. What's the best way to model this data?
It would be nice to have the # of reviews at the time for each date as well as the app_id. As of now I'm thinking something that might look like:
... 2020-01-01_app_id | 2020-01-01_review_count | 2020-01-02_app_id | 2020-01-02_review_count ...
But I'm wondering if there's a better way to do this. Stitching the data together also seems like a challenge.
I think this is what you are looking for:
Postgres 13 or newer
WITH cte AS ( -- MATERIALIZED
SELECT app_id, min(review_date) AS earliest_review, count(*)::int AS total_ct
FROM reviews
GROUP BY 1
)
SELECT *
FROM (
SELECT generate_series(min(review_date)
, max(review_date)
, '1 day')::date
FROM reviews
) d(review_window_start)
LEFT JOIN LATERAL (
SELECT total_ct, array_agg(app_id) AS apps
FROM (
SELECT app_id, total_ct
FROM cte c
WHERE c.earliest_review >= d.review_window_start
ORDER BY total_ct DESC
FETCH FIRST 1 ROWS WITH TIES -- new & hot
) sub
GROUP BY 1
) a ON true;
WITH TIES makes it a bit cheaper. Added in Postgres 13 (currently beta). See:
Get top row(s) with highest value, with ties
Postgres 12 or older
WITH cte AS ( -- MATERIALIZED
SELECT app_id, min(review_date) AS earliest_review, count(*)::int AS total_ct
FROM reviews
GROUP BY 1
)
SELECT *
FROM (
SELECT generate_series(min(review_date)
, max(review_date)
, '1 day')::date
FROM reviews
) d(review_window_start)
LEFT JOIN LATERAL (
SELECT total_ct, array_agg(app_id) AS apps
FROM (
SELECT total_ct, app_id
, rank() OVER (ORDER BY total_ct DESC) AS rnk
FROM cte c
WHERE c.earliest_review >= d.review_window_start
) sub
WHERE rnk = 1
GROUP BY 1
) a ON true;
db<>fiddle here
Same as above, but without WITH TIES.
We don't need to involve the table apps at all. The table reviews has all information we need.
The CTE cte computes earliest review & current total count per app. The CTE avoids repeated computation. Should help quite a bit.
It is always materialized before Postgres 12, and should be materialized automatically in Postgres 12 since it is used many times in the main query. Else you could add the keyword MATERIALIZED in Postgres 12 or later to force it. See:
How to force evaluation of subquery before joining / pushing down to foreign server
The optimized generate_series() call produces the series of days from earliest to latest review. See:
Generating time series between two dates in PostgreSQL
Join a count query on generate_series() and retrieve Null values as '0'
Finally, the LEFT JOIN LATERAL you already discovered. But since multiple apps can tie for the most reviews, retrieve all winners, which can be 0 - n apps. The query aggregates all daily winners into an array, so we get a single result row per review_window_start. Alternatively, define tiebreaker(s) to get at most one winner. See:
What is the difference between LATERAL JOIN and a subquery in PostgreSQL?
If you are looking for hints, then here are a few:
Are you aware of generate_series() and how to use it to compose a table of dates given a start and end date? If not, then there are plenty of examples on this site.
To answer this question for any given date, you need to have only two measures for each app, and only one of these is used to compare an app against other apps. Your query in part 1 shows that you know what these two measures are.
Hints 1 and 2 should be enough to get this done. The only thing I can add is for you not to worry about making the database do "too much work." That is what it is there to do. If it does not do it quickly enough, then you can think about optimizations, but before you get to that step, concentrate on getting the answer that you want.
Please comment if you need further clarification on this.
The missing piece for me was lateral join.
I can accomplish just about what I want using the following:
select
review_windows.review_window_start,
id,
review_total,
earliest_review
from
(
select
date_trunc('day', review_windows.review_windows) :: date as review_window_start
from
generate_series(
(
SELECT
min(reviews.review_date)
FROM
reviews
),
(
SELECT
max(reviews.review_date)
FROM
reviews
),
'1 year'
) review_windows
order by
1 desc
) review_windows
left join lateral (
SELECT
apps.id,
count(reviews.*) as review_total,
min(reviews.review_date) as earliest_review
FROM
reviews
INNER JOIN apps ON apps.id = reviews.app_id
where
reviews.review_date >= review_windows.review_window_start
group by
1
having
min(reviews.review_date) >= review_windows.review_window_start
order by
2 desc,
3 desc
limit
2
) apps_most_reviews on true;

Recognize if not enough resources during timeframe

i need an idea how to solve the following problem.
Lets say i have one group with given timeframe (8:00-12:00) and i can assign resources (people) to it. Each resource can have a custom timeframe (like 9-10, 9-12,8-12 etc.) and could be assigned multiple times.
Tables
Groups
ID,
TITLE,
START_TIME,
END_TIME,
REQUIRED_PEOPLE:INTEGER
PeopleAssignments
ID,
USER_ID,
GROUP_ID,
START_TIME,
END_TIME
Now i have the rule that at any given time during the group timeframe that there have to be like like 4 people assigned. Otherwise i want to get a warning.
I am working with ruby & sql (Postgres) here.
is there an elegant way without iterating through the whole timeframe and checking if count(assignments) > REQUIRED_PEOPLE
You can solve this with only SQL too (if you are interested in such answers).
Range types offers great functions and operators to calculate with.
These solutions will give you rows, when there are sub-ranges, where there is some missing people from a given group (and it will give you which sub-range it is exactly & how many people is missing from the required number).
The easy way:
You wanted to try something similar to this. You'll need to pick some interval in which the count() is based on (I picked 5 minutes):
select g.id group_id, i start_time, i + interval '5 minutes' end_time, g.required_people - count(a.id)
from groups g
cross join generate_series(g.start_time, g.end_time, interval '5 minutes') i
left join people_assignments a on a.group_id = g.id
where tsrange(a.start_time, a.end_time) && tsrange(i, i + interval '5 minutes')
group by g.id, i
having g.required_people - count(a.id) > 0
order by g.id, i
But note that you won't be able to detect missing sub-ranges, when they are less than 5 minutes. F.ex. user1 has assignment for 11:00-11:56 and user2 has one for 11:59-13:00, they will appear to be "in" the group for 11:00-13:00 (so the missing sub-range of 11:56-11:59 will go unnoticed).
Note: the more short the interval is (what you've picked) the more precise (and slow!) the results will be.
http://rextester.com/GRC64969
The hard way:
You can accumulate the result on-the-fly with custom aggregates or recursive CTEs
with recursive r as (
-- start with "required_people" as "missing_required_people" in the whole range
select 0 iteration,
id group_id,
array[]::int[] used_assignment_ids,
-- build a json map, where keys are the time ranges
-- and values are the number of missing people for that range
jsonb_build_object(tsrange(start_time, end_time), required_people) required_people_per_time_range
from groups
where required_people > 0
and id = 1 -- query parameter
union all
select r.iteration + 1,
r.group_id,
r.used_assignment_ids || a.assignment_id,
d.required_people_per_time_range
from r
-- join a single assignment to the previous iteration, where
-- the assigment's time range overlaps with (at least one) time range,
-- where there is still missing people. when there are no such time range is
-- found in assignments, the "recursion" (which is really just a loop) stops
cross join lateral (
select a.id assignment_id, tsrange(start_time, end_time) time_range
from people_assignments a
cross join (select key::tsrange time_range from jsonb_each(r.required_people_per_time_range)) j
where a.group_id = r.group_id
and a.id <> ALL (r.used_assignment_ids)
and tsrange(start_time, end_time) && j.time_range
limit 1
) a
-- "partition" && accumulate all remaining time ranges with
-- the one found in the previous step
cross join lateral (
-- accumulate "partition" results
select jsonb_object_agg(u.time_range, u.required_people) required_people_per_time_range
from (select key::tsrange time_range, value::int required_people
from jsonb_each_text(r.required_people_per_time_range)) j
cross join lateral (
select u time_range, j.required_people - case when u && a.time_range then 1 else 0 end required_people
-- "partition" the found time range with all existing ones, one-by-one
from unnest(case
when j.time_range #> a.time_range
then array[tsrange(lower(j.time_range), lower(a.time_range)), a.time_range, tsrange(upper(a.time_range), upper(j.time_range))]
when j.time_range && a.time_range
then array[j.time_range * a.time_range, j.time_range - a.time_range]
else array[j.time_range]
end) u
where not isempty(u)
) u
) d
),
-- select only the last iteration
l as (
select group_id, required_people_per_time_range
from r
order by iteration desc
limit 1
)
-- unwind the accumulated json map
select l.group_id, lower(time_range) start_time, upper(time_range) end_time, missing_required_people
from l
cross join lateral (
select key::tsrange time_range, value::int missing_required_people
from jsonb_each_text(l.required_people_per_time_range)
) j
-- select only where there is still some missing people
-- this is optional, if you omit it you'll also see row(s) for sub-ranges where
-- there is enough people in the group (these rows will have zero,
-- or negative amount of "missing_required_people")
where j.missing_required_people > 0
http://rextester.com/GHPD52861
In any case you need to query number of assignment in DB. There is no other way to find how many times a group assign to people.
There might be ways to find number of assignment but in the end you have to fire a query to DB.
#group = Group.find(id)
if #group.people_assignments.count >= REQUIRED_PEOPLE
pus 'warning'
end
You can add extra column in group that hold information how many times that group assign to people. In this way one query to server reduced.
#group = Group.find(id)
if #group.count_people_assigned >= REQUIRED_PEOPLE
puts 'warning'
end
In second case count_people_assigned is column so no extra query will execute while in first case people_assignments is association so one extra query will fire.
But in second case you have you update group each time you assign group to people. Ultimately extra query. Your choice where you want to reduce query.
My opinion is second case, It will happen rare than first.

Filtering a large table by date

I have a table, VISIT_INFO, with these columns:
pers_key - unique identifyer for each person
pers_name - name of person
visit_date - date at which they visited a business
And another table, VALID_DATES, with these columns:
condition - string
start_date - date
end_date - date
I currently have the following query:
select pers_key, pers_name from VISIT_INFO a
CROSS JOIN
(select start_date, end_date from VALID_DATES where condition = 'condition1') b
WHERE (a.visit_date >= b.start_date and a.visit_date <= b.end_date)
GROUP BY a.pers_key
So 'condition1' has a specific start_date and end_date. I need to filter VISIT_INFO for visits that are between the two dates. I'm wondering if there is a more efficient way to do this. From my current understanding, it currently has to go through the entire table (millions of rows) and add start_date and end_date to each row. Then does it have to go through each row again and test against the WHERE condition?
I ask this because when I remove the cross join and hardcode the start_date and end_date for condition1, it takes substantially less time. I'm trying to avoid hardcoding in the dates because it will lead to serious tedium down the road.
So to reiterate, is there a better way to filter VISIT_INFO by specific dates in VALID_DATES?
Edit: I just realized I left out a pretty huge piece of information, being that this is all in HIVE. So EXISTS and joins on (a between b and c) are out of the question.
How about:
SELECT DISTINCT pers_key, pers_name
FROM visit_info
WHERE EXISTS
(
SELECT 1
FROM valid_dates
WHERE condition = 'condition1'
AND visit_date BETWEEN start_date AND end_date
);
?
with dt as (select start_date, end_date from VALID_DATES where condition = 'condition1')
select a.pers_key, a.pers_name
from VISIT_INFO a
JOIN dt on a.visit_date between dt.start_date and dt.end_date
GROUP BY a.pers_key
Trying the exists version is definitely a possibility. However, you might be better off expanding the VALID_DATES table, so there is one row per date.
Then, the query:
select vi.*
from VISIT_INFO vi JOIN
VALID_DATES_expanded vde
ON vi.visit_date = vde.valid_date
where vde.condition = 'condition1';
can make use of an index on VISIT_INFO(visit_date) and on VALID_DATES_expanded(condition, valid_date). This is likely to be the fastest approach to solving this problem, if VISIT_INFO is very large and relatively few rows are being selected by the query.

How do you select data from PostgreSQL database, but if no data is present for a given day, then return 0?

I have the following query:
SELECT created_at::DATE, count (*)
FROM messages
WHERE city = 'los angeles'
GROUP BY created_at::DATE
Which works great. The challenge is that if there are no messages for a given date, then it returns no record for that date. How do you make the above query return the date and 0 if there are no messages on that date, for all days between a given date and today?
Working in PostgreSQL 8.3.
Thanks!
It sounds like you need a table of all the dates you are interested in, as it may contain dates not in your messages table. If you have, or build, this table then left join with the messages table and do count on a column that table--it will return 0 where nothing matches the join.
select d.created_at, count(m.messageId)
from possibleDates d
left join messages m
on d.created_at = m.created_at
group by d.created_at
Typical way is to have a separate calendar table with all of the dates in it, left joined to your table on date column, and then some sort of ifnull(x, 0) statement [whatever the function is for PostgreSQL] or case statement to return 0 when the left-join on the date returns null or 1 when it is not null. Then you can do your normal group by and use SUM(x) instead of count().
Very often, when you want to fill in zeroes for missing entries in a series, the answer in PostgreSQL involves the generate_series function. (Search Stackoverflow for lots of similar questions and answers.) In your case, use something like this:
SELECT ts::date AS date, coalesce(count, 0) AS count
FROM
(SELECT created_at::date, count(*)
FROM messages
WHERE city = 'los angeles'
GROUP BY created_at::date) AS m
RIGHT JOIN
(SELECT *
FROM generate_series(timestamp '2011-07-01',
timestamp 'today',
interval '1 day')) AS series(ts)
ON m.created_at = series.ts
ORDER BY 1;