Postgresql Serial Daily Count of Records - sql

I am trying to get a total count of records from 1st Jan till date, without skipping dates and returning 0 for dates that have no records.
I have tried the following: orders is an example table and orderdate is a timestamp column
with days as (
select generate_series(
date_trunc('day','2020-01-01'::timestamp),
date_trunc('day', now()),
'1 day'::interval
) as day
)
select
days.day,
count(orders.id)
from days
left outer join orders on date_trunc('day', orders.orderdate) = days.day
where orders.orders_type='C'
group by 1
The issue is that dates are skipped.
yet if i execute:
select generate_series(
date_trunc('day','2020-01-01'::timestamp),
date_trunc('day', now()),
'1 day'::interval
)
i get the right series with no dates skipped.

The where condition should belong to the on clause of the left join, ie this:
from days
left outer join orders on date_trunc('day', orders.orderdate) = days.day
where orders.orders_type='C'
Should be written:
from days
left outer join orders
on date_trunc('day', orders.orderdate) = days.day
and orders.orders_type='C'
Notes:
you don't actually need a cte here, you can put generate_series() directly in the from clause
the date join condition can be optimized to an index-friendly expression that avoids date_trunc()
table aliases make the query easier to read and write
You could write the query as:
select d.day, count(o.id)
from generate_series(date '2020-01-01', now(), '1 day') as d(day)
left outer join orders o
on o.orderdate >= d.day
and o.orderdate < d.day + '1 day'::interval
and o.orders_type='C'
group by d.day

Related

SELF JOIN a query to obtain the number of reactivated users

Assume you have the table given below containing information on Facebook user logins. Write a query to obtain the number of reactivated users (which are dormant users who did not log in the previous month, who then logged in during the current month). Output the current month and number of reactivated users.
I have tried this question by first making an inner join combining a user's previous month to current month with this code.
WITH CTE as
(SELECT user_id,
EXTRACT(month from login_date) as current_month,
EXTRACT(month from login_date)-1 as prev_month
FROM user_logins)
SELECT a.user_id as user_id, a.current_month, a.prev_month,
b.user_id as prev_month_user
FROM CTE a LEFT JOIN CTE b
ON a.prev_month = b.current_month
My idea is to use a case statement
CASE WHEN a.user_id IN
(SELECT b.user_id
WHERE b.current_month = a.prev_month)
THEN 0 ELSE 1 END
BUT that is giving me wrong output for user_id 245 in current_month 4.
https://drive.google.com/file/d/1dOQQxaJWv7j7o7M1Q98nlj77KCzIHxKl/view?usp=sharing
How to fix this?
This gets you the first day of the current month:
select date_trunc('month', current_date)
You can add or subtract an interval of one month to get the previous or next month's starting date.
The complete query:
select *
from users
where user_id in
(
select user_id
from user_logins
where login_date >= date_trunc('month', current_date)
and login_date < date_trunc('month', current_date) + interval '1 month'
)
and user_id not in
(
select user_id
from user_logins
where login_date >= date_trunc('month', current_date) - interval '1 month'
and login_date < date_trunc('month', current_date)
)
Well, admittedly
and login_date < date_trunc('month', current_date) + interval '1 month'
is probably unnecessary here, because the table won't contain future logins :-) So, keep it or remove it, as you like.
If you want a self join, you should get distinct user/month pairs first. Then, as you want to get user/month pairs for which not exists a user/month-1 pair (and for which NOT EXISTS would be appropriate) your join must be an anti join. This means you outer join the user/month-1 pair and only keep the outer joined rows, i.e. the non-matches.
WITH cte AS
(
SELECT DISTINCT user_id, DATE_TRUNC('month', login_date) AS month
FROM user_logins
)
SELECT mon.month, mon.user_id
FROM cte mon
LEFT JOIN cte prev ON prev.user_id = mon.user_id
AND prev.month = mon.month - INTERVAL '1 month'
WHERE prev.month IS NULL -- anti join
ORDER BY mon.month, mon.user_id;
I don't find anti joins very readable and would use NOT EXISTS instead. But that's a matter of personal preference, I guess. The query gives you all users who logged in a month, but not the previous month. You can of course limit this to the cutrent month. Or you can aggregate per month and count. Or remove the WHERE clause and count repeating users vs. new ones (COUNT(*) = all that month, COUNT(prev.month) = all repeating users, COUNT(*) - COUNT(prev.month) = all new users).
Well having said this, ... wasn't the task about reactivated users? Then you are looking for users who were active once, then paused a month, then became active again. Here is a simple query to get this for users who paused last month:
select user_id
from user_logins
group by user_id
having min(login_date) < date_trunc('month', current_date) - interval '1 month'
and max(login_date) >= date_trunc('month', current_date)
and count(*) filter (where login_date >= date_trunc('month', current_date) - interval '1 month'
and login_date < date_trunc('month', current_date)) = 0;

Filtering query with join is not returning the correct results

I am trying to filter my query that looks at payments data joining with another table (accounts table) as I want the data filtered by the condition accounts.provider = 'z'. However, the results I'm returned are exact multiples of the real figures (times 13, 20 etc) - different dates are a different multiple. The query is also really slow, so looking for advice to make it run quicker too.
SELECT
distinct on (t.day) t.day as day,
coalesce(collected_payments,0)
from
( SELECT day::date
FROM generate_series(timestamp '2017-03-13', current_date + interval '1 week', interval '1 day') day
) d
left JOIN (
SELECT date_trunc('day', t.payment_date)::date AS day,
sum(case when t.payment_amount > 0
and t.description not ilike '%credit%'
and t.state = 'success'
then t.payment_amount end) as collected_payments
FROM payments t
inner join payments p on p.payment_date = date_trunc('day', t.payment_date)::date
inner join accounts on accounts.id = p.account_id and accounts.provider = 'z'
where date_trunc('day', t.payment_date)::date <= current_date + interval '1 week'
and date_trunc('day', t.payment_date)::date >= current_date - interval'1 months'
GROUP BY 1
) t USING (day)
ORDER BY day desc

Speed up query where results with count(*) = 0 are included

I have a table squitters with, amongst others, a column parsed_time. I want to know the number of records per hour for the last two days and used this query:
SELECT date_trunc('hour', parsed_time) AS hour , count(*)
FROM squitters
WHERE parsed_time > date_trunc('hour', now()) - interval '2 day'
GROUP BY hour
ORDER BY hour DESC;
This works, but hours with zero records do not appear in the result. I want to have hours
with zero records also in the result with a count equal to zero, so I wrote this query using the generate_series function:
SELECT bins.hour, count(squitters.parsed_time)
FROM generate_series(date_trunc('hour', now() - interval '2 day'), now(), '1 hour') bins(hour)
LEFT OUTER JOIN squitters ON bins.hour = date_trunc('hours', squitters.parsed_time)
GROUP BY bins.hour
ORDER BY bins.hour DESC;
This works, in the results are hour-bins with counts equal to zero, but is considerably slower.
How can I have the speed of the first query with the count=zero results of the second query?
(btw. there is an index on parsed_time)
You could try and change the join condition so no date function is applied on column parsed_time:
SELECT b.hour, COUNT(s.parsed_time) cnt
FROM generate_series(date_trunc('hour', now() - interval '2 day'), now(), '1 hour') b(hour)
LEFT OUTER JOIN squitters s
ON s.parsed_time >= b.hour
AND s.parsed_time < b.hours + interval '1 hour'
GROUP BY b.hour
ORDER BY b.hour DESC;
Alternatively, you could also try using a correlated subquery (or a lateral join) instead of a left join - this avoids the need for outer aggregation:
SELECT
b.hour,
(
SELECT COUNT(*)
FROM squitters s
WHERE s.parsed_time >= b.hour AND s.parsed_time < b.hours + interval '1 hour'
) cnt
FROM generate_series(date_trunc('hour', now() - interval '2 day'), now(), '1 hour') b(hour)
ORDER BY b.hour desc
You could take advantage of Common Table Expressions to divide your problem into small chunks:
WITH cte AS (
--First query your table
SELECT date_trunc('hour', parsed_time) AS sq_hour , count(*)
FROM squitters
WHERE parsed_time > date_trunc('hour', now()) - interval '2 day'
GROUP BY hour
ORDER BY hour DESC
), series AS (
--Create the series without the data returned from 1st query
SELECT
bins.series_hour,
0
FROM
generate_series(date_trunc('hour', now() - interval '2 day'), now(), '1 hour') bins(series_hour)
WHERE
series_hour not in (SELECT sq_hour FROM cte)
)
--Union the result
SELECT * FROM cte
UNION
SELECT * FROM series
ORDER BY 1

Left Outer Join with 2 columns

I'm having an issue with a postrgresql join. I might be approaching it incorrectly, but here's the scenario.
I have a table which contains two relevant columns: dates and months (along with other data). Each date should have the next 5 months associated with it, inclusive. This isn't always the case; I want to find when this isn't the case. Additionally, there is no guarantee that each date is in the table (for which there should be 5 months), but I have another table which contains these dates.
The table should contain (for one date):
However, due to many possibilities the table may only contain:
I have attempted to find the missing dates by generating a series for the expected dates and joining a series of months that should be associated with the date. I'm running into an issue because I need to join the tables on the two columns I need, so if one doesn't exist, it doesn't make it through the ON or WHERE clause.
I might need to approach this differently, but here is my current attempt.
SELECT
D.date, JOINMONTH::date, DT.month
FROM
day D
CROSS JOIN
generate_series(date_trunc('month', D.date),
date_trunc('month', D.date) + INTERVAL '4 months',
'1 month') AS JOINMONTH
LEFT JOIN
dates_table DT ON D.date = DT.date
AND JOINMONTH::date = DT.month
WHERE
D.date >= '2018-01-01';
What I would like to see:
EDIT:
This db-fiddle gives my full query. I omitted some of the where clause because I thought it was irrelevant, but it seems to be part of the problem. With this in mind, my selected answer will solve my problem represented by this structure/query but #Gordon Linoff's answer is correct for the original question.
Is this what you are looking for?
SELECT D.date, JOINMONTH::date, DT.month
FROM day D CROSS JOIN LATERAL
generate_series(date_trunc('month', D.date),
date_trunc('month', D.date) + INTERVAL '4 months',
'1 month') AS JOINMONTH LEFT JOIN
dates_table DT
ON GD.date = DT.date AND JOINMONTH::date = DT.month
WHERE D.date >= '2018-01-01' AND DT.date IS NULL;
SELECT D.date, JOINMONTH::date, DT.month
FROM day D
CROSS JOIN LATERAL
generate_series(date_trunc('month', D.date),
date_trunc('month', D.date) + INTERVAL '4 months',
'1 month') AS JOINMONTH
LEFT JOIN dates_table DT
ON D.date = DT.date
AND JOINMONTH::date = DT.month
AND DT.source = 'S1' AND
DT.tf = TRUE
WHERE
D.date = '2018-11-02';
I needed to move parts of my where clause into the join itself.

Redshift FULL OUTER JOIN doesn't output NULL

We have a 'numbers' table that holds 0-10000 values in its single value 'n'.
We have tableX that has calculated_at datetime and a term.
We are trying to fill the holes where in tableX doesnt have matches in the given dates. HOWEVER, this doesn't seem to yield NULL or 0 for the non-matching...
select term
, avg(total::float)
, date_trunc('day', series.date) as date1
, date_trunc('day', calculated_at) as date2
from (select
(current_timestamp - interval '1 day' * numbers.n)::date as date
from numbers) as series
full outer join terms
on series.date = date_trunc('day', calculated_at)
where series.date BETWEEN '2017-07-01' AND '2017-07-30'
AND (term in ('term111') or term is null)
group by term
, date_trunc('day', series.date)
, date_trunc('day', calculated_at)
order by date_trunc('day', series.date) asc
The full outer join is fine. The problem is the filters. These are really tricky with a full outer join. I would recommend:
select t.term, avg(total::float),
date_trunc('day', series.date) as date1,
date_trunc('day', calculated_at) as date2
from (select (current_timestamp - interval '1 day' * numbers.n)::date as date
from numbers
where (current_timestamp - interval '1 day' * numbers.n)::date BETWEEN '2017-07-01' AND '2017-07-30'
) series full outer join
(select t.*
from terms
where term = 'term111'
) t
on series.date = date_trunc('day', t.calculated_at)
group by t.term, date_trunc('day', series.date), date_trunc('day', calculated_at)
order by date_trunc('day', series.date) asc;
My guess though is that a left join would do what you want. I doubt a full outer join is what you really intend. If you have doubts, ask another question and provide sample data and desired results.