Redshift FULL OUTER JOIN doesn't output NULL - sql

We have a 'numbers' table that holds 0-10000 values in its single value 'n'.
We have tableX that has calculated_at datetime and a term.
We are trying to fill the holes where in tableX doesnt have matches in the given dates. HOWEVER, this doesn't seem to yield NULL or 0 for the non-matching...
select term
, avg(total::float)
, date_trunc('day', series.date) as date1
, date_trunc('day', calculated_at) as date2
from (select
(current_timestamp - interval '1 day' * numbers.n)::date as date
from numbers) as series
full outer join terms
on series.date = date_trunc('day', calculated_at)
where series.date BETWEEN '2017-07-01' AND '2017-07-30'
AND (term in ('term111') or term is null)
group by term
, date_trunc('day', series.date)
, date_trunc('day', calculated_at)
order by date_trunc('day', series.date) asc

The full outer join is fine. The problem is the filters. These are really tricky with a full outer join. I would recommend:
select t.term, avg(total::float),
date_trunc('day', series.date) as date1,
date_trunc('day', calculated_at) as date2
from (select (current_timestamp - interval '1 day' * numbers.n)::date as date
from numbers
where (current_timestamp - interval '1 day' * numbers.n)::date BETWEEN '2017-07-01' AND '2017-07-30'
) series full outer join
(select t.*
from terms
where term = 'term111'
) t
on series.date = date_trunc('day', t.calculated_at)
group by t.term, date_trunc('day', series.date), date_trunc('day', calculated_at)
order by date_trunc('day', series.date) asc;
My guess though is that a left join would do what you want. I doubt a full outer join is what you really intend. If you have doubts, ask another question and provide sample data and desired results.

Related

Filtering query with join is not returning the correct results

I am trying to filter my query that looks at payments data joining with another table (accounts table) as I want the data filtered by the condition accounts.provider = 'z'. However, the results I'm returned are exact multiples of the real figures (times 13, 20 etc) - different dates are a different multiple. The query is also really slow, so looking for advice to make it run quicker too.
SELECT
distinct on (t.day) t.day as day,
coalesce(collected_payments,0)
from
( SELECT day::date
FROM generate_series(timestamp '2017-03-13', current_date + interval '1 week', interval '1 day') day
) d
left JOIN (
SELECT date_trunc('day', t.payment_date)::date AS day,
sum(case when t.payment_amount > 0
and t.description not ilike '%credit%'
and t.state = 'success'
then t.payment_amount end) as collected_payments
FROM payments t
inner join payments p on p.payment_date = date_trunc('day', t.payment_date)::date
inner join accounts on accounts.id = p.account_id and accounts.provider = 'z'
where date_trunc('day', t.payment_date)::date <= current_date + interval '1 week'
and date_trunc('day', t.payment_date)::date >= current_date - interval'1 months'
GROUP BY 1
) t USING (day)
ORDER BY day desc

Postgresql Serial Daily Count of Records

I am trying to get a total count of records from 1st Jan till date, without skipping dates and returning 0 for dates that have no records.
I have tried the following: orders is an example table and orderdate is a timestamp column
with days as (
select generate_series(
date_trunc('day','2020-01-01'::timestamp),
date_trunc('day', now()),
'1 day'::interval
) as day
)
select
days.day,
count(orders.id)
from days
left outer join orders on date_trunc('day', orders.orderdate) = days.day
where orders.orders_type='C'
group by 1
The issue is that dates are skipped.
yet if i execute:
select generate_series(
date_trunc('day','2020-01-01'::timestamp),
date_trunc('day', now()),
'1 day'::interval
)
i get the right series with no dates skipped.
The where condition should belong to the on clause of the left join, ie this:
from days
left outer join orders on date_trunc('day', orders.orderdate) = days.day
where orders.orders_type='C'
Should be written:
from days
left outer join orders
on date_trunc('day', orders.orderdate) = days.day
and orders.orders_type='C'
Notes:
you don't actually need a cte here, you can put generate_series() directly in the from clause
the date join condition can be optimized to an index-friendly expression that avoids date_trunc()
table aliases make the query easier to read and write
You could write the query as:
select d.day, count(o.id)
from generate_series(date '2020-01-01', now(), '1 day') as d(day)
left outer join orders o
on o.orderdate >= d.day
and o.orderdate < d.day + '1 day'::interval
and o.orders_type='C'
group by d.day

Speed up query where results with count(*) = 0 are included

I have a table squitters with, amongst others, a column parsed_time. I want to know the number of records per hour for the last two days and used this query:
SELECT date_trunc('hour', parsed_time) AS hour , count(*)
FROM squitters
WHERE parsed_time > date_trunc('hour', now()) - interval '2 day'
GROUP BY hour
ORDER BY hour DESC;
This works, but hours with zero records do not appear in the result. I want to have hours
with zero records also in the result with a count equal to zero, so I wrote this query using the generate_series function:
SELECT bins.hour, count(squitters.parsed_time)
FROM generate_series(date_trunc('hour', now() - interval '2 day'), now(), '1 hour') bins(hour)
LEFT OUTER JOIN squitters ON bins.hour = date_trunc('hours', squitters.parsed_time)
GROUP BY bins.hour
ORDER BY bins.hour DESC;
This works, in the results are hour-bins with counts equal to zero, but is considerably slower.
How can I have the speed of the first query with the count=zero results of the second query?
(btw. there is an index on parsed_time)
You could try and change the join condition so no date function is applied on column parsed_time:
SELECT b.hour, COUNT(s.parsed_time) cnt
FROM generate_series(date_trunc('hour', now() - interval '2 day'), now(), '1 hour') b(hour)
LEFT OUTER JOIN squitters s
ON s.parsed_time >= b.hour
AND s.parsed_time < b.hours + interval '1 hour'
GROUP BY b.hour
ORDER BY b.hour DESC;
Alternatively, you could also try using a correlated subquery (or a lateral join) instead of a left join - this avoids the need for outer aggregation:
SELECT
b.hour,
(
SELECT COUNT(*)
FROM squitters s
WHERE s.parsed_time >= b.hour AND s.parsed_time < b.hours + interval '1 hour'
) cnt
FROM generate_series(date_trunc('hour', now() - interval '2 day'), now(), '1 hour') b(hour)
ORDER BY b.hour desc
You could take advantage of Common Table Expressions to divide your problem into small chunks:
WITH cte AS (
--First query your table
SELECT date_trunc('hour', parsed_time) AS sq_hour , count(*)
FROM squitters
WHERE parsed_time > date_trunc('hour', now()) - interval '2 day'
GROUP BY hour
ORDER BY hour DESC
), series AS (
--Create the series without the data returned from 1st query
SELECT
bins.series_hour,
0
FROM
generate_series(date_trunc('hour', now() - interval '2 day'), now(), '1 hour') bins(series_hour)
WHERE
series_hour not in (SELECT sq_hour FROM cte)
)
--Union the result
SELECT * FROM cte
UNION
SELECT * FROM series
ORDER BY 1

Netezza range of dates

I want to solve an issue to reduce manual labour in a specific query. Hope i can phrase this correctly to be understood.
In Netezza, I want to generate a (date ) value and run the query for every different value specified.
What I want to do is replace all the unions into one query.
SELECT DISTINCT COUNT(DISTINCT a.x) AS NO_OF_X, a.column1, 'JAN 2019'
FROM my_table a
WHERE 1=1
AND current_date BETWEEN a.date_from and a.date_to
GROUP BY 2,3
UNION ALL
SELECT DISTINCT COUNT(DISTINCT a.x) AS NO_OF_X, a.column1, 'DEC 2018'
FROM my_table a
WHERE 1=1
AND '2018-12-31' BETWEEN a.date_from and a.date_to
GROUP BY 2,3
UNION ALL
SELECT DISTINCT COUNT(DISTINCT a.x) AS NO_OF_X, a.column1, 'NOV 2018'
FROM my_table a
WHERE 1=1
AND '2018-11-30' BETWEEN a.date_from and a.date_to
GROUP BY 2,3
What i want to do is something like this
SELECT DISTINCT COUNT(DISTINCT a.x) AS NO_OF_X, a.column1, last_day( date ) as "MONTH"
FROM my_table a
WHERE 1=1
AND /*run the query for all last_days in a range */
GROUP BY 2,3
Is this possible? Tried to make a CTE but it is really important to get results for each specific last day of month, because our datawarehouse is designed, to store different time slices for each transaction etc. And i want to get only transactions with time slices on a specific last_day().
Cheers.
You need to generate a list of dates. Here is one method:
select to_char(m.dte, 'MMM YYYY'), t.column1,
count(distinct a.x) AS NO_OF_X
from (SELECT current_date as dte UNION ALL
SELECT date_trunc('month', current_date) - interval '1 day' as dte UNION ALL
SELECT date_trunc('month', current_date) - interval '1 day' - interval '1 month' as dte UNION ALL
SELECT date_trunc('month', current_date) - interval '1 day' - interval '2 month' as dte
) m left join
my_table t
where m.dte between t.date_from and t.date_to
group by to_char(m.dte, 'MMM YYYY'), t.column1
order by min(m.dte), t.column1;

Sum of values from 3rd previous month

I'm having difficulty grabbing rows from December (anything from the 3rd previous month). I'm attempting to count the amount of products sold within a certain time period. This is my current query:
SELECT
a.id,
a.default_code,
(
SELECT SUM(product_uom_qty)
AS
"Total Sold"
FROM
sale_order_line c
WHERE
c.product_id = a.id
),
(
SELECT SUM(product_uom_qty)
AS
"Month 3"
FROM sale_order_line c
WHERE
c.product_id = a.id
AND
MONTH(c.create_date) = MONTH(CURRENT_DATE - INTERVAL '3 Months')
AND
YEAR(c.create_date) = YEAR(CURRENT_DATE - INTERVAL '3 Months')
)
FROM
product_product a
This is what the DB looks like:
sale_order_line
product_id product_uom_qty create_date
33 230 2014-07-01 16:47:45.294313
product_product
id default_code
33 WHDXEB33
Here's the error I'm receiving:
ERROR: function month(timestamp without time zone) does not exist
LINE 21: MONTH(c.create_date) = MONTH(CURRENT_DATE - INTERVAL
Any help pointing me in the right direction?
Use date_trunc() to calculate timestamp bounds:
SELECT id, default_code
, (SELECT SUM(product_uom_qty)
FROM sale_order_line c
WHERE c.product_id = a.id
) AS "Total Sold"
, (SELECT SUM(product_uom_qty)
FROM sale_order_line c
WHERE c.product_id = a.id
AND c.create_date >= date_trunc('month', now()) - interval '2 month'
AND c.create_date < date_trunc('month', now()) - interval '1 month'
) AS "Month 3"
FROM product_product a;
To get December (now being February), use these expressions:
AND c.create_date >= date_trunc('month', now()) - interval '2 month'
AND c.create_date < date_trunc('month', now()) - interval '1 month'
date_trunc('month', now()) yields '2015-02-01 00:00', after subtracting 2 months, you get '2014-12-01 00:00'. So, "3 months" can be deceiving.
Also, be sure to use sargable expressions like demonstrated for faster performance and to allow index usage.
Alternatives
Depending on your actual DB design and data distribution, this may be faster:
SELECT a.id, a.default_code, c."Total Sold", c."Month 3"
FROM product_product a
LEFT JOIN (
SELECT product_id AS id
, SUM(product_uom_qty) AS "Total Sold"
, SUM(CASE WHEN c.create_date >= date_trunc('month', now()) - interval '2 month'
AND c.create_date < date_trunc('month', now()) - interval '1 month'
THEN product_uom_qty ELSE 0 END) AS "Month 3"
FROM sale_order_line
GROUP BY 1
) c USING (id);
Since you are selecting all rows, this is probably faster than correlated subqueries. While being at it, aggregate before you join, that's cheaper, yet.
When selecting a single or few products, this may actually be slower, though! Compare:
Aggregate a single column in query with many columns
Optimize GROUP BY query to retrieve latest record per user
Or with the FILTER clause in Postgres 9.4+:
...
, SUM(product_uom_qty)
FILTER (WHERE c.create_date >= date_trunc('month', now()) - interval '2 month'
AND c.create_date < date_trunc('month', now()) - interval '1 month'
) AS "Month 3"
...
Details:
Select multiple row values into single row with multi-table clauses
This will avoid the costly correlated subquery
select
pp.id, pp.default_code,
sum(sol.product_uom_qty) as "Total Sold",
sum((
date_trunc('month', pp.create_date) =
date_trunc('month', current_date) - interval '3 months'
)::int * sol.product_uom_qty
) as "Month 3"
from
product_product pp
left join
sale_order_line sol on pp.id = sol.product_id
group by 1, 2
The cast from boolean to integer results in 0 or 1 which is convenient to be multiplied by the value to be summed