PostgreSQL, summing multiple date columns over multiple rows - sql

Let's assume the following table:
CREATE TABLE assumption ( datetime_1 TIMESTAMP, datetime_2 TIMESTAMP, datetime_3 TIMESTAMP);
Now I want to know the total amount of times a month is set in any row or column combined as long as the date is in the future.
Right now I have:
SELECT
COALESCE(
SUM(
COALESCE(
datetime_1 > NOW() AND EXTRACT(MONTH FROM datetime_1) = 1,
FALSE
)::INT,
COALESCE(
datetime_2 > NOW() AND EXTRACT(MONTH FROM datetime_2) = 1,
FALSE
)::INT,
COALESCE(
datetime_3 > NOW() AND EXTRACT(MONTH FROM datetime_3) = 1,
FALSE
)::INT
), 0) AS january_count,
.... AS february_count,
.... AS march_count,
.... AS etc
FROM assumption;
This works and returns the right result, yet it is rather bloated and it returns me a single row with a column for every month.
As in real life this query is a bit more complex and I would rather have a result that would give me a row for each month (So I can add more fields to every monthly row)
Is there any thing I am missing, any way I can improve this?

Do you want a lateral join?
select date_trunc('month', d.dt) dt_month, count(*) cnt
from assumption a
cross join lateral (values (datetime_1), (datetime_2), (datetime_3)) d(dt)
where dt > now()
group by date_trunc('month', d.dt)
Truncating the date to the first day of the month would seem more useful that extracting the month (if your data spreads over several years in the future, the result do differ). But if you do mean extracting the month, then:
select extract(month from d.dt) dt_month, count(*) cnt
from assumption a
cross join lateral (values (datetime_1), (datetime_2), (datetime_3)) d(dt)
where dt > now()
group by extract(month from d.dt)
Finally, if you want a row for each month, even those that have no timestamp in the future, then:
select extract(month from d.dt) dt_month, count(*) filter(where d.dt > now()) cnt
from assumption a
cross join lateral (values (datetime_1), (datetime_2), (datetime_3)) d(dt)
group by extract(month from d.dt)

Related

using LAG to compare the data from today and 7 days ago (not between)

I am currently trying to compare aggregated numbers from today and exactly 7 days ago (not between today and 7 days ago, but instead simply comparing these two discrete dates).
I already have a way of doing it using a lot of subqueries, but the performance is bad, and I am now trying to optimize.
This is what I have come up with so far (sample query, not with real table names and columns due to confidentiality):
Select current_date, previous_date, current_sum, previous_sum, percentage
From (Select date as current_date, sum(numbers) as current_sum,
lag (sum(numbers)) over (partition by date order by date) as previous_sum,
(Select max(date)-7 From t1 ) as previous_date,
(current_sum - previous_sum)*100/current_sum as percentage
From t1 where date>=sysdate-7 group by date,previous_date)
But I am definitely doing something wrong since in the output the previous_sum appears null, and naturally the percentage too.
Any ideas on what I am doing wrong? I haven't used LAG before so it must be something there.
Thanks!
Using Join of pre-aggregated subqueries.
with agg as (
select sum(numbers) as sum_numbers, date from t1 group by date
)
select curr.sum_numbers as current_sum,
prev.sum_numbers as prev_sum,
curr.date as curr_date,
prev.date as prev_date
from agg curr
left join agg prev on curr.date-7=prev.date
Using lag:
with agg as (
select sum(numbers) as sum_numbers, date from t1 group by date
)
select sum_numbers as current_sum,
lag(sum_numbers, 7) over(order by date) as prev_sum,
a.date as curr_date,
lag(a.date,7) over(order by date) as prev_date
from agg a
If you want exactly 2 dates only (today and today-7) then it can be done much simpler using conditional aggregation and filter:
select sum(case when date = trunc(sysdate) then numbers else null end) as current_sum,
sum(case when date = trunc(sysdate-7) then numbers else null end) as previous_sum,
trunc(sysdate) as curr_date,
trunc(sysdate-7) as prev_date,
(current_sum - previous_sum)*100/current_sum as percentage
from t1 where date = trunc(sysdate) or date = trunc(sysdate-7)
You can do this with window (analytic) functions, which should be the fastest method. Your actually aggregation query is a bit unclear, but I think it is:
select date as current_date, sum(numbers) as current_sum
from t1
group by date;
If you have values for all dates, then use:
select date as current_date, sum(numbers) as current_sum,
lag(sum(numbers), 7) over (order by date) as prev_7_sum
from t1
group by date;
If you don't have data for all days, then use a window frame:
select date as current_date, sum(numbers) as current_sum,
max(sum(numbers), 7) over (order by date range between '7' day preceding and '7' day preceding) as prev_7_sum
from t1
group by date;

Month over Month percent change in user registrations

I am trying to write a query to find month over month percent change in user registration. \
Users table has the logs for user registrations
user_id - pk, integer
created_at - account created date, varchar
activated_at - account activated date, varchar
state - active or pending, varchar
I found the number of users for each year and month. How do I find month over month percent change in user registration? I think I need a window function?
SELECT
EXTRACT(month from created_at::timestamp) as created_month
,EXTRACT(year from created_at::timestamp) as created_year
,count(distinct user_id) as number_of_registration
FROM users
GROUP BY 1,2
ORDER BY 1,2
This is the output of above query:
Then I wrote this to find the difference in user registration in the previous year.
SELECT
*
,number_of_registration - lag(number_of_registration) over (partition by created_month) as difference_in_previous_year
FROM (
SELECT
EXTRACT(month from created_at::timestamp) as created_month
,EXTRACT(year from created_at::timestamp) as created_year
,count( user_id) as number_of_registration
FROM users as u
GROUP BY 1,2
ORDER BY 1,2) as temp
The output is this:
You want an order by clause that contains created_year.
number_of_registration
- lag(number_of_registration) over (partition by created_month order by created_year) as difference_in_previous_year
Note that you don't actually need a subquery for this. You can do:
select
extract(year from created_at) as created_year,
extract(month from created_at) as created_year
count(*) as number_of_registration,
count(*) - lag(count(*)) over(partition by extract(month from created_at) order by extract(year from created_at))
from users as u
group by created_year, created_month
order by created_year, created_month
I used count(*) instead of count(user_id), because I assume that user_id is not nullable (in which case count(*) is equivalent, and more efficient). Casting to a timestamp is also probably superfluous.
These queries work as long as you have data for every month. If you have gaps, then the problem should be addressed differently - but this is not the question you asked here.
I can get the registrations from each year as two tables and join them. But it is not that effective
SELECT
t1.created_year as year_2013
,t2.created_year as year_2014
,t1.created_month as month_of_year
,t1.number_of_registration_2013
,t2.number_of_registration_2014
,(t2.number_of_registration_2014 - t1.number_of_registration_2013) / t1.number_of_registration_2013 * 100 as percent_change_in_previous_year_month
FROM
(select
extract(year from created_at) as created_year
,extract(month from created_at) as created_month
,count(*) as number_of_registration_2013
from users
where extract(year from created_at) = '2013'
group by 1,2) t1
inner join
(select
extract(year from created_at) as created_year
,extract(month from created_at) as created_month
,count(*) as number_of_registration_2014
from users
where extract(year from created_at) = '2014'
group by 1,2) t2
on t1.created_month = t2.created_month
First off, Why are you using strings to hold date/time values? Your 1st step should to define created_at, activated_at as a proper timestamps. In the resulting query I assume this correction. If this is faulty (you do not correct it) then cast the string to timestamp in the CTE generating the date range. But keep in mind that if you leave it as text you will at some point get a conversion exception.
To calculate month-over-month use the formula "100*(Nt - Nl)/Nl" where Nt is the number of users this month and Nl is the number of users last month. There are 2 potential issues:
There are gaps in the data.
Nl is 0 (would incur divide by 0 exception)
The following handles this by first generating the months between the earliest date to the latest date then outer joining monthly counts to the generated dates. When Nl = 0 the query returns NULL indication the percent change could not be calculated.
with full_range(the_month) as
(select generate_series(low_month, high_month, interval '1 month')
from (select min(date_trunc('month',created_at)) low_month
, max(date_trunc('month',created_at)) high_month
from users
) m
)
select to_char(the_month,'yyyy-mm')
, users_this_month
, case when users_last_month = 0
then null::float
else round((100.00*(users_this_month-users_last_month)/users_last_month),2)
end percent_change
from (
select the_month, users_this_month , lag(users_this_month) over(order by the_month) users_last_month
from ( select f.the_month, count(u.created_at) users_this_month
from full_range f
left join users u on date_trunc('month',u.created_at) = f.the_month
group by f.the_month
) mc
) pc
order by the_month;
NOTE: There are several places there the above can be shortened. But the longer form is intentional to show how the final vales are derived.

Calculating differences in average monthly between years

I have a table which contains average monthly values from a sensor over the last 3 years
Is there a way in which I can calculate the differences between, for example, the monthly values in 2019 and the monthly values in 2018, and perhaps create a new table or view that includes the 2018 dates in one column, 2019 dates in another and the difference in sensor reading value in a third ?
Thanks
TP
Assuming that your data has no missing month/year, one option uses window functions:
select
t.*,
lag(average) over(
partition by sensor_id, extract(month from m)
order by extract(year from m)
) last_year_average
from mytable
This puts all rows that belong to the same sensor and the same month in the same partition. You can then use the year part of the timestamp as an ordering column.
You can use the new column as needed to compare it to the current average.
If you have a value for every month, you can just use a 12-month lag:
select t.*,
lag(average, 12) over (partition by sensor_id
order by m
) as last_year_average
from t;
Filtering this to just 2019/2018 requires a subquery:
select t.*
from (select t.*,
lag(average, 12) over (partition by sensor_id
order by m
) as last_year_average
from t
) t
where m >= '2019-01-01'::date and
m < '2020-01-01'::date
If you are missing months, then neither this (nor GMB's answer) will work correctly. Instead, you can use a join, aggregation, or window function:
select t.*
from t left join
t tprev
on tprev.sensor_id = t.sensor_id and
tprev.m = t.m - interval '12 month'
where t.m >= '2019-01-01'::date and t.m < '2020-01-01'::date;
Two other methods are:
select t.sensor_id, month(t.m)
max(average) filter (where year(t.m) = 2019) as avg_2019,
max(average) filter (where year(t.m) = 2018) as avg_2018
from t
group by t.sensor_id, month(t.m);
And to use window functions safely if there is the possibility of missing months:
select t.*,
max(average) over (partition by sensor_id
order by m
range between '1 year preceding' and '1 year preceding'
) as average_prev
from t;

grouping by column but getting multiple results for each

I am trying to calculate the median response time for conversations on each date for the last X days.
I use the following query below, but for some reason, it will generate multiple rows with the same date.
with grouping as (
SELECT a.id, d.date, extract(epoch from (first_response_at - started_at)) as response_time
FROM (
select to_char(date_trunc('day', (current_date - offs)), 'YYYY-MM-DD') AS date
FROM generate_series(0, 2) AS offs
) d
LEFT OUTER JOIN apps a on true
LEFT OUTER JOIN conversations c ON (d.date=to_char(date_trunc('day'::varchar, c.started_at), 'YYYY-MM-DD')) and a.id = c.app_id
and c.app_id = a.id and c.first_response_at > (current_date - (2 || ' days')::interval)::date
)
select
*
from grouping
where grouping.id = 'ASnYW1-RgCl0I'
Any ideas?
First a number of issues with your query, assuming there aren't any parts you haven't shown us:
You don't need a CTE for this query.
From table apps you only use column id whose value is the same as c.app_id. You can remove the table apps and select c.app_id for the same result.
When you use to_char() you do not first have to date_trunc() to a date, the to_char() function handles that.
generate_series() also works with timestamps. Just enter day values with an interval and cast the end result to date before using it.
So, removing all the flotsam we end up with this which does exactly the same as the query in your question but now we can at least see what is going on.
SELECT c.app_id, to_date(d.date, 'YYYY-MM-DD') AS date,
extract(epoch from (first_response_at - started_at)) AS response_time
FROM generate_series(CURRENT_DATE - 2, CURRENT_DATE, interval '1 day') d(date)
LEFT JOIN conversations c ON d.date::date = c.started_at::date
AND c.app_id = 'ASnYW1-RgCl0I'
AND c.first_response_at > CURRENT_DATE - 2;
You don't calculate the median response time anywhere, so that is a big problem you need to solve. This only requires data from table conversations and would look somewhat like this to calculate the median response time for the past 2 days:
SELECT app_id, started_at::date AS start_date,
percentile_disc(0.5) WITHIN GROUP (ORDER BY first_response_at - started_at) AS median_response
FROM conversations
WHERE app_id = 'ASnYW1-RgCl0I'
AND first_response_at > CURRENT_DATE - 2
GROUP BY 2;
When we fold the two queries, and put the parameters handily in a single place, this is the final result:
SELECT p.id, to_date(d.date, 'YYYY-MM-DD') AS date,
extract(epoch from (c.median_response)) AS response_time
FROM (VALUES ('ASnYW1-RgCl0I', 2)) p(id, days)
JOIN generate_series(CURRENT_DATE - p.days, CURRENT_DATE, interval '1 day') d(date) ON true
LEFT JOIN LATERAL (
SELECT started_at::date AS start_date,
percentile_disc(0.5) WITHIN GROUP (ORDER BY first_response_at - started_at) AS median_response
FROM conversations
WHERE app_id = p.id
AND first_response_at > CURRENT_DATE - p.days
GROUP BY 2) c ON d.date::date = c.start_date;
If you want to change the id of the app or the number of days to look back, you only have to change the VALUES clause accordingly. You can also wrap the whole thing in a SQL function and convert the VALUES clause into two parameters.

Aggregates for today and the previous day depending on data

Having trouble putting together a query to pull the aggregate values of a give timestamp and the timestamp before it. Given the following schema:
name TEXT,
ts TIMESTAMP,
X NUMERIC,
Y NUMERIC
where there are gaps in the ts column due to gaps in data, I'm trying to construct a query to produce
name,
date_trunc('day' q1.ts),
avg(q1.X),
sum(q2.Y),
date_trunc('day', q2.ts),
avg(q2.X),
sum(q2.Y)
The first half is straightforward:
SELECT q1.name, date_trunc('day', q1.ts), avg(q1.X), sum(q1.Y)
FROM data as q1
GROUP BY 1, 2
ORDER BY 1, 2;
But not sure how to generate the relation to find the "day" before for each row. I'm trying to work an inner join like this:
SELECT q1.name, q1.day, q1.avg, q1.sum, q2.day, q2.avg, q2.sum
FROM (
SELECT name, date_trunc('day', ts) AS day, avg(X) AS avg, sum(Y) as sum
FROM data
GROUP BY 1,2
ORDER BY 1,2
) q1 INNER JOIN (
SELECT name, date_trunc('day', ts) AS day, avg(X) AS avg, sum(Y) as sum
FROM data
GROUP BY 1,2
ORDER BY 1,2
) q2 ON (
q1.name = q2.name
AND q2.day = q1.day - interval '1 day'
);
The problem with this is, it doesn't cover the cases when the next "day" is more than 1 day before the current day.
The special difficulty here is that you need to number days after aggregating rows. You can do this in a single query level with the window function row_number(), since window functions are applied after aggregation by GROUP BY.
Also, use a CTE to avoid executing the same subquery multiple times:
WITH q AS (
SELECT name, ts::date AS day
,avg(x) AS avg_x, sum(y) AS sum_y
,row_number() OVER (PARTITION BY name ORDER BY ts::date) AS rn
FROM data
GROUP BY 1,2
)
SELECT q1.name, q1.day, q1.avg_x, q1.sum_y
,q2.day AS day2, q2.avg_x AS avg_x2, q2.sum_y AS sum_y2
FROM q q1
LEFT JOIN q q2 ON q1.name = q2.name
AND q1.rn = q2.rn + 1
ORDER BY 1,2;
Using the simpler cast to date (ts::date) instead of date_trunc('day', ts) to get "days".
LEFT [OUTER] JOIN (as opposed to [INNER] JOIN) is instrumental to preserve the corner case of the first row, where there is no previous day.
And ORDER BY should be applied to the outer query.
The question isn't crystal clear, but it sounds like you're actually trying to fill gaps while keeping track of leading/lagging rows.
To fill the gaps, look into generate_series() and left join it with your table:
select d
from generate_series(timestamp '2013-12-01', timestamp '2013-12-31', interval '1 day') d;
http://www.postgresql.org/docs/current/static/functions-srf.html
For previous and next row values, look into lead() and lag() window functions:
select date_trunc('day', ts) as curr_row_day,
lag(date_trunc('day', ts)) over w as prev_row_day
from data
window w as (order by ts)
http://www.postgresql.org/docs/current/static/tutorial-window.html