Left Outer Join with 2 columns - sql

I'm having an issue with a postrgresql join. I might be approaching it incorrectly, but here's the scenario.
I have a table which contains two relevant columns: dates and months (along with other data). Each date should have the next 5 months associated with it, inclusive. This isn't always the case; I want to find when this isn't the case. Additionally, there is no guarantee that each date is in the table (for which there should be 5 months), but I have another table which contains these dates.
The table should contain (for one date):
However, due to many possibilities the table may only contain:
I have attempted to find the missing dates by generating a series for the expected dates and joining a series of months that should be associated with the date. I'm running into an issue because I need to join the tables on the two columns I need, so if one doesn't exist, it doesn't make it through the ON or WHERE clause.
I might need to approach this differently, but here is my current attempt.
SELECT
D.date, JOINMONTH::date, DT.month
FROM
day D
CROSS JOIN
generate_series(date_trunc('month', D.date),
date_trunc('month', D.date) + INTERVAL '4 months',
'1 month') AS JOINMONTH
LEFT JOIN
dates_table DT ON D.date = DT.date
AND JOINMONTH::date = DT.month
WHERE
D.date >= '2018-01-01';
What I would like to see:
EDIT:
This db-fiddle gives my full query. I omitted some of the where clause because I thought it was irrelevant, but it seems to be part of the problem. With this in mind, my selected answer will solve my problem represented by this structure/query but #Gordon Linoff's answer is correct for the original question.

Is this what you are looking for?
SELECT D.date, JOINMONTH::date, DT.month
FROM day D CROSS JOIN LATERAL
generate_series(date_trunc('month', D.date),
date_trunc('month', D.date) + INTERVAL '4 months',
'1 month') AS JOINMONTH LEFT JOIN
dates_table DT
ON GD.date = DT.date AND JOINMONTH::date = DT.month
WHERE D.date >= '2018-01-01' AND DT.date IS NULL;

SELECT D.date, JOINMONTH::date, DT.month
FROM day D
CROSS JOIN LATERAL
generate_series(date_trunc('month', D.date),
date_trunc('month', D.date) + INTERVAL '4 months',
'1 month') AS JOINMONTH
LEFT JOIN dates_table DT
ON D.date = DT.date
AND JOINMONTH::date = DT.month
AND DT.source = 'S1' AND
DT.tf = TRUE
WHERE
D.date = '2018-11-02';
I needed to move parts of my where clause into the join itself.

Related

SELF JOIN a query to obtain the number of reactivated users

Assume you have the table given below containing information on Facebook user logins. Write a query to obtain the number of reactivated users (which are dormant users who did not log in the previous month, who then logged in during the current month). Output the current month and number of reactivated users.
I have tried this question by first making an inner join combining a user's previous month to current month with this code.
WITH CTE as
(SELECT user_id,
EXTRACT(month from login_date) as current_month,
EXTRACT(month from login_date)-1 as prev_month
FROM user_logins)
SELECT a.user_id as user_id, a.current_month, a.prev_month,
b.user_id as prev_month_user
FROM CTE a LEFT JOIN CTE b
ON a.prev_month = b.current_month
My idea is to use a case statement
CASE WHEN a.user_id IN
(SELECT b.user_id
WHERE b.current_month = a.prev_month)
THEN 0 ELSE 1 END
BUT that is giving me wrong output for user_id 245 in current_month 4.
https://drive.google.com/file/d/1dOQQxaJWv7j7o7M1Q98nlj77KCzIHxKl/view?usp=sharing
How to fix this?
This gets you the first day of the current month:
select date_trunc('month', current_date)
You can add or subtract an interval of one month to get the previous or next month's starting date.
The complete query:
select *
from users
where user_id in
(
select user_id
from user_logins
where login_date >= date_trunc('month', current_date)
and login_date < date_trunc('month', current_date) + interval '1 month'
)
and user_id not in
(
select user_id
from user_logins
where login_date >= date_trunc('month', current_date) - interval '1 month'
and login_date < date_trunc('month', current_date)
)
Well, admittedly
and login_date < date_trunc('month', current_date) + interval '1 month'
is probably unnecessary here, because the table won't contain future logins :-) So, keep it or remove it, as you like.
If you want a self join, you should get distinct user/month pairs first. Then, as you want to get user/month pairs for which not exists a user/month-1 pair (and for which NOT EXISTS would be appropriate) your join must be an anti join. This means you outer join the user/month-1 pair and only keep the outer joined rows, i.e. the non-matches.
WITH cte AS
(
SELECT DISTINCT user_id, DATE_TRUNC('month', login_date) AS month
FROM user_logins
)
SELECT mon.month, mon.user_id
FROM cte mon
LEFT JOIN cte prev ON prev.user_id = mon.user_id
AND prev.month = mon.month - INTERVAL '1 month'
WHERE prev.month IS NULL -- anti join
ORDER BY mon.month, mon.user_id;
I don't find anti joins very readable and would use NOT EXISTS instead. But that's a matter of personal preference, I guess. The query gives you all users who logged in a month, but not the previous month. You can of course limit this to the cutrent month. Or you can aggregate per month and count. Or remove the WHERE clause and count repeating users vs. new ones (COUNT(*) = all that month, COUNT(prev.month) = all repeating users, COUNT(*) - COUNT(prev.month) = all new users).
Well having said this, ... wasn't the task about reactivated users? Then you are looking for users who were active once, then paused a month, then became active again. Here is a simple query to get this for users who paused last month:
select user_id
from user_logins
group by user_id
having min(login_date) < date_trunc('month', current_date) - interval '1 month'
and max(login_date) >= date_trunc('month', current_date)
and count(*) filter (where login_date >= date_trunc('month', current_date) - interval '1 month'
and login_date < date_trunc('month', current_date)) = 0;

Filtering query with join is not returning the correct results

I am trying to filter my query that looks at payments data joining with another table (accounts table) as I want the data filtered by the condition accounts.provider = 'z'. However, the results I'm returned are exact multiples of the real figures (times 13, 20 etc) - different dates are a different multiple. The query is also really slow, so looking for advice to make it run quicker too.
SELECT
distinct on (t.day) t.day as day,
coalesce(collected_payments,0)
from
( SELECT day::date
FROM generate_series(timestamp '2017-03-13', current_date + interval '1 week', interval '1 day') day
) d
left JOIN (
SELECT date_trunc('day', t.payment_date)::date AS day,
sum(case when t.payment_amount > 0
and t.description not ilike '%credit%'
and t.state = 'success'
then t.payment_amount end) as collected_payments
FROM payments t
inner join payments p on p.payment_date = date_trunc('day', t.payment_date)::date
inner join accounts on accounts.id = p.account_id and accounts.provider = 'z'
where date_trunc('day', t.payment_date)::date <= current_date + interval '1 week'
and date_trunc('day', t.payment_date)::date >= current_date - interval'1 months'
GROUP BY 1
) t USING (day)
ORDER BY day desc

Postgresql Serial Daily Count of Records

I am trying to get a total count of records from 1st Jan till date, without skipping dates and returning 0 for dates that have no records.
I have tried the following: orders is an example table and orderdate is a timestamp column
with days as (
select generate_series(
date_trunc('day','2020-01-01'::timestamp),
date_trunc('day', now()),
'1 day'::interval
) as day
)
select
days.day,
count(orders.id)
from days
left outer join orders on date_trunc('day', orders.orderdate) = days.day
where orders.orders_type='C'
group by 1
The issue is that dates are skipped.
yet if i execute:
select generate_series(
date_trunc('day','2020-01-01'::timestamp),
date_trunc('day', now()),
'1 day'::interval
)
i get the right series with no dates skipped.
The where condition should belong to the on clause of the left join, ie this:
from days
left outer join orders on date_trunc('day', orders.orderdate) = days.day
where orders.orders_type='C'
Should be written:
from days
left outer join orders
on date_trunc('day', orders.orderdate) = days.day
and orders.orders_type='C'
Notes:
you don't actually need a cte here, you can put generate_series() directly in the from clause
the date join condition can be optimized to an index-friendly expression that avoids date_trunc()
table aliases make the query easier to read and write
You could write the query as:
select d.day, count(o.id)
from generate_series(date '2020-01-01', now(), '1 day') as d(day)
left outer join orders o
on o.orderdate >= d.day
and o.orderdate < d.day + '1 day'::interval
and o.orders_type='C'
group by d.day

Redshift FULL OUTER JOIN doesn't output NULL

We have a 'numbers' table that holds 0-10000 values in its single value 'n'.
We have tableX that has calculated_at datetime and a term.
We are trying to fill the holes where in tableX doesnt have matches in the given dates. HOWEVER, this doesn't seem to yield NULL or 0 for the non-matching...
select term
, avg(total::float)
, date_trunc('day', series.date) as date1
, date_trunc('day', calculated_at) as date2
from (select
(current_timestamp - interval '1 day' * numbers.n)::date as date
from numbers) as series
full outer join terms
on series.date = date_trunc('day', calculated_at)
where series.date BETWEEN '2017-07-01' AND '2017-07-30'
AND (term in ('term111') or term is null)
group by term
, date_trunc('day', series.date)
, date_trunc('day', calculated_at)
order by date_trunc('day', series.date) asc
The full outer join is fine. The problem is the filters. These are really tricky with a full outer join. I would recommend:
select t.term, avg(total::float),
date_trunc('day', series.date) as date1,
date_trunc('day', calculated_at) as date2
from (select (current_timestamp - interval '1 day' * numbers.n)::date as date
from numbers
where (current_timestamp - interval '1 day' * numbers.n)::date BETWEEN '2017-07-01' AND '2017-07-30'
) series full outer join
(select t.*
from terms
where term = 'term111'
) t
on series.date = date_trunc('day', t.calculated_at)
group by t.term, date_trunc('day', series.date), date_trunc('day', calculated_at)
order by date_trunc('day', series.date) asc;
My guess though is that a left join would do what you want. I doubt a full outer join is what you really intend. If you have doubts, ask another question and provide sample data and desired results.

Postgres - Fast way to sum over rows from last day of month

I want to query a table and sum a column for all of the rows from the last day of the month.
Let's use the following table as an example:
CREATE TABLE example(dt date, value int)
(The real table has many more columns and is relatively large, and the real query is more complicated)
I have the following query:
SELECT dt, SUM(value)
FROM example
WHERE dt IN (SELECT DISTINCT
date_trunc('MONTH', generate_series('2012-01-01'::date,
'2016-12-01'::date,
interval '1 day') + INTERVAL '1 MONTH - 1 day')::date)
GROUP BY dt
It runs in about ~2 seconds on my real table.
However, if I generate the full list of end-of-month days in my range and parameterise the query like so:
SELECT dt, SUM(value)
FROM example
WHERE dt IN ('2012-01-31', ...)
GROUP BY dt
It's much quicker, ~750ms.
I would prefer not to generate the dates and pass them through to the query like that, is there a way I can do this entirely in SQL and make it as fast as the latter version?
The sub-select is needlessly complicated. It can be simplified to:
SELECT dt, SUM(value)
FROM example
WHERE dt IN (SELECT d::date
from generate_series('2012-01-01'::date, '2016-12-01'::date, interval '1 month') dates (d)
GROUP BY dt; --<< the group by is necessary
Maybe that speeds up the query.
You can also try to put the date generation into a CTE:
with dates (d) as (
SELECT t::date
from generate_series('2012-01-01'::date, '2016-12-01'::date, interval '1 month') t
)
SELECT dt, SUM(value)
FROM example
WHERE dt IN ( select d from dates)
GROUP BY dt;
Sometimes doing a JOIN is also more efficient:
with dates (d) as (
SELECT t::date
from generate_series('2012-01-01'::date, '2016-12-01'::date, interval '1 month') t
)
SELECT dt, SUM(value)
FROM example
JOIN dates on example.dt = dates.d
GROUP BY dt;
The performance problem in your query comes from the fact that you are generating a daily series. Change it to monthly, remove the distinct and add a group by
select dt, sum(value)
from
example
inner join (
select date_trunc('month', dt) + interval '1 month - 1 day' as dt
from generate_series('2012-01-01'::date, '2016-12-01', '1 month') gs (dt)
) d using (dt)
group by dt