Postgresql left join date_trunc with default values - sql

I have 3 tables which I'm querying to get the data based on different conditions. I have from and to params and these are the ones I'm using to create a range of time in which I'm looking for the data in those tables.
For instance if I have from equals to '2020-07-01' and to equals to '2020-08-01' I'm expecting to receive the grouped row values of the tables by week, if in some case some of the weeks don't have records I want to return 0, if some tables have records for the same week, I'd like to sum them.
Currently I have this:
SELECT d.day, COALESCE(t.total, 0)
FROM (
SELECT day::date
FROM generate_series(timestamp '2020-07-01',
timestamp '2020-08-01',
interval '1 week') day
) d
LEFT JOIN (
SELECT date AS day,
SUM(total)
FROM table1
WHERE id = '1'
AND date BETWEEN '2020-07-01' AND '2020-08-01'
GROUP BY day
) t USING (day)
ORDER BY d.day;
I'm generating a series of dates grouped by week, and on top of that I'm doing adding a left join. Now for some reason, it only works if the dates match completely, otherwise COALESCE(t.total, 0) returns 0 even if in that week the SUM(total) is not 0.
The same way I'm applying the LEFT JOIN, I'm using other left joins with other tables in the same query, so I'm falling with the same problem.

Please see if this works for you. Whenever you find yourself aggregating more than once, ask yourself whether it is necessary.
Rather than try to match on discrete days, use time ranges.
with limits as (
select '2020-07-01'::timestamp as dt_start,
'2020-08-01'::timestamp as dt_end
), weeks as (
SELECT x.day::date as day, least(x.day::date + 7, dt_end::date) as day_end
FROM limits l
CROSS JOIN LATERAL
generate_series(l.dt_start, l.dt_end, interval '1 week') as x(day)
WHERE x.day::date != least(x.day::date + 7, dt_end::date)
), t1 as (
select w.day,
sum(coalesce(t.total, 0)) as t1total
from weeks w
left join table1 t
on t.id = 1
and t.date >= w.day
and t.date < w.day_end
group by w.day
), t2 as (
select w.day,
sum(coalesce(t.sum_measure, 0)) as t2total
from weeks w
left join table2 t
on t.something = 'whatever'
and t.date >= w.day
and t.date < w.day_end
group by w.day
)
select t1.day,
t1.t1total,
t2.t2total
from t1
join t2 on t2.day = t1.day;
You can keep adding tables like that with CTEs.
My earlier example with multiple left join was bad because it blows out the rows due to a lack of join conditions between the left-joined tables.
There is an interesting corner case for e.g. 2019-02-01 to 2019-03-01 which returns an empty interval as the last week. I have updated to filter that out.

Related

How can I get the count to display zero for months that have no records

I am pulling transactions that happen on an attribute (attribute ID 4205 in table 1235) by the date that a change happened to the attribute (found in the History table) and counting up the number of changes that occurred by month. So far I have
SELECT TOP(100) PERCENT MONTH(H.transactiondate) AS Month, COUNT(*) AS Count
FROM hsi.rmObjectInstance1235 AS O LEFT OUTER JOIN
hsi.rmObjectHistory AS H ON H.objectID = O.objectID
WHERE H.attributeid = 4205) AND Year(H.transaction date) = '2020'
GROUP BY MONTH(H.transactiondate)
And I get
Month Count
---------------
1 9
2 4
3 11
4 14
5 1
I need to display a zero for months June - December instead of excluding those months.
One option uses a recursive query to generate the dates, and then brings the original query with a left join:
with all_dates as (
select cast('2020-01-01' as date) dt
union all
select dateadd(month, 1, dt) from all_dates where dt < '2020-12-01'
)
select
month(d.dt) as month,
count(h.objectid) as cnt
from all_dates d
left join hsi.rmobjecthistory as h
on h.attributeid = 4205
and h.transaction_date >= d.dt
and h.transaction_date < dateadd(month, 1, d.dt)
and exists (select 1 from hsi.rmObjectInstance1235 o where o.objectID = h.objectID)
group by month(d.dt)
I am quite unclear about the intent of the table hsi.rmObjectInstance1235 in the query, as none of its column are used in the select and group by clauses; it it is meant to filter hsi.rmobjecthistory by objectID, then you can rewrite this as an exists condition, as shown in the above solution. Possibly, you might as well be able to just remove that part of the query.
Also, note that
top without order by does not really make sense
top (100) percent is a no op
As a consequence, I removed that row-limiting clause.

How to show a row for the dates not in records of a table as zero

I am trying to show the records as zero for the dates not found.
Below is my basic query:
Select date_col, count(distinct file_col), count(*) from tab1
where date_col between 'date1' and 'date2'
group by date_col;
The output is for one date.
I want all the dates to be shown in result.
The general way to deal with this type of problem is to use something called a calendar table. This calendar table contains all the dates which you want to appear in your report. We can create a crude one by using a subquery:
SELECT
t1.date,
COUNT(DISTINCT t2.file_col) AS d_cnt,
COUNT(t2.file_col) AS cnt
FROM
(
SELECT '2018-06-01' AS date UNION ALL
SELECT '2018-06-02' UNION ALL
...
) t1
LEFT JOIN tab1 t2
ON t1.date = t2.date_col
WHERE
t1.date BETWEEN 'date1' and 'date2'
GROUP BY
t1.date;
Critical here is that we left join the calendar table to your table containing the actual data, but we count a column in your data table. This means that zero would be reported for any day not having matching data.
If you are using postgreSQL, you could generate series with necessary dates period.
SELECT
t1.date,
COUNT(DISTINCT t2.file_col) AS d_cnt,
COUNT(t2.file_col) AS cnt
FROM
(
select to_char( '?'::DATE + (interval '1' month * generate_series(0,11)),'yyyy-mm-dd')as month) x
...
) t1
LEFT JOIN tab1 t2
ON t1.date = to_char(t2.date_col,'yyyy-mm')
WHERE
t1.date BETWEEN 'date1' and 'date2'
GROUP BY
t1.date;
In this example show how to generate sequence for month period.

SQL - Unequal left join BigQuery

New here. I am trying to get the Daily and Weekly active users over time. they have 30 days before they are considered inactive. My goal is to create graph's that can be split by user_id to show cohorts, regions, categories, etc.
I have created a date table to get every day for the time period and I have the simplified orders table with the base info that I need to calculate this.
I am trying to do a Left Join to get the status by date using the following SQL Query:
WITH daily_use AS (
SELECT
__key__.id AS user_id
, DATE_TRUNC(date(placeOrderDate), day) AS activity_date
FROM `analysis.Order`
where isBuyingGroupOrder = TRUE
AND testOrder = FALSE
GROUP BY 1, 2
),
dates AS (
SELECT DATE_ADD(DATE "2016-01-01", INTERVAL d.d DAY) AS date
FROM
(
SELECT ROW_NUMBER() OVER(ORDER BY __key__.id) -1 AS d
FROM `analysis.Order`
ORDER BY __key__.id
LIMIT 1096
) AS d
ORDER BY 1 DESC
)
SELECT
daily_use.user_id
, wd.date AS date
, MIN(DATE_DIFF(wd.date, daily_use.activity_date, DAY)) AS days_since_last_action
FROM dates AS wd
LEFT JOIN daily_use
ON wd.date >= daily_use.activity_date
AND wd.date < DATE_ADD(daily_use.activity_date, INTERVAL 30 DAY)
GROUP BY 1,2
I am getting this Error: LEFT OUTER JOIN cannot be used without a condition that is an equality of fields from both sides of the join. In BigQuery and was wondering how can I go around this. I am using Standard SQL within BigQuery.
Thank you
Below is for BigQuery Standard SQL and mostly reproduce logic in your query with exception of not including days where no activity at all is found
#standardSQL
SELECT
daily_use.user_id
, wd.date AS DATE
, MIN(DATE_DIFF(wd.date, daily_use.activity_date, DAY)) AS days_since_last_action
FROM dates AS wd
CROSS JOIN daily_use
WHERE wd.date BETWEEN
daily_use.activity_date AND DATE_ADD(daily_use.activity_date, INTERVAL 30 DAY)
GROUP BY 1,2
-- ORDER BY 1,2
if for whatever reason you still need to exactly reproduce your logic - you can embrace above with final left join as below:
#standardSQL
SELECT *
FROM dates AS wd
LEFT JOIN (
SELECT
daily_use.user_id
, wd.date AS date
, MIN(DATE_DIFF(wd.date, daily_use.activity_date, DAY)) AS days_since_last_action
FROM dates AS wd
CROSS JOIN daily_use
WHERE wd.date BETWEEN
daily_use.activity_date AND DATE_ADD(daily_use.activity_date, INTERVAL 30 DAY)
GROUP BY 1,2
) AS daily_use
USING (date)
-- ORDER BY 1,2

grouping by column but getting multiple results for each

I am trying to calculate the median response time for conversations on each date for the last X days.
I use the following query below, but for some reason, it will generate multiple rows with the same date.
with grouping as (
SELECT a.id, d.date, extract(epoch from (first_response_at - started_at)) as response_time
FROM (
select to_char(date_trunc('day', (current_date - offs)), 'YYYY-MM-DD') AS date
FROM generate_series(0, 2) AS offs
) d
LEFT OUTER JOIN apps a on true
LEFT OUTER JOIN conversations c ON (d.date=to_char(date_trunc('day'::varchar, c.started_at), 'YYYY-MM-DD')) and a.id = c.app_id
and c.app_id = a.id and c.first_response_at > (current_date - (2 || ' days')::interval)::date
)
select
*
from grouping
where grouping.id = 'ASnYW1-RgCl0I'
Any ideas?
First a number of issues with your query, assuming there aren't any parts you haven't shown us:
You don't need a CTE for this query.
From table apps you only use column id whose value is the same as c.app_id. You can remove the table apps and select c.app_id for the same result.
When you use to_char() you do not first have to date_trunc() to a date, the to_char() function handles that.
generate_series() also works with timestamps. Just enter day values with an interval and cast the end result to date before using it.
So, removing all the flotsam we end up with this which does exactly the same as the query in your question but now we can at least see what is going on.
SELECT c.app_id, to_date(d.date, 'YYYY-MM-DD') AS date,
extract(epoch from (first_response_at - started_at)) AS response_time
FROM generate_series(CURRENT_DATE - 2, CURRENT_DATE, interval '1 day') d(date)
LEFT JOIN conversations c ON d.date::date = c.started_at::date
AND c.app_id = 'ASnYW1-RgCl0I'
AND c.first_response_at > CURRENT_DATE - 2;
You don't calculate the median response time anywhere, so that is a big problem you need to solve. This only requires data from table conversations and would look somewhat like this to calculate the median response time for the past 2 days:
SELECT app_id, started_at::date AS start_date,
percentile_disc(0.5) WITHIN GROUP (ORDER BY first_response_at - started_at) AS median_response
FROM conversations
WHERE app_id = 'ASnYW1-RgCl0I'
AND first_response_at > CURRENT_DATE - 2
GROUP BY 2;
When we fold the two queries, and put the parameters handily in a single place, this is the final result:
SELECT p.id, to_date(d.date, 'YYYY-MM-DD') AS date,
extract(epoch from (c.median_response)) AS response_time
FROM (VALUES ('ASnYW1-RgCl0I', 2)) p(id, days)
JOIN generate_series(CURRENT_DATE - p.days, CURRENT_DATE, interval '1 day') d(date) ON true
LEFT JOIN LATERAL (
SELECT started_at::date AS start_date,
percentile_disc(0.5) WITHIN GROUP (ORDER BY first_response_at - started_at) AS median_response
FROM conversations
WHERE app_id = p.id
AND first_response_at > CURRENT_DATE - p.days
GROUP BY 2) c ON d.date::date = c.start_date;
If you want to change the id of the app or the number of days to look back, you only have to change the VALUES clause accordingly. You can also wrap the whole thing in a SQL function and convert the VALUES clause into two parameters.

Hits per day in Google Big Query

I am using Google Big Query to find hits per day. Here is my query,
SELECT COUNT(*) AS Key,
DATE(EventDateUtc) AS Value
FROM [myDataSet.myTable]
WHERE .....
GROUP BY Value
ORDER BY Value DESC
LIMIT 1000;
This is working fine but it ignores the date with 0 hits. I wanna include this. I cannot create temp table in Google Big Query. How to fix this.
Tested getting error Field 'day' not found.
SELECT COUNT(*) AS Key,
DATE(t.day) AS Value from (
select date(date_add(day, i, "DAY")) day
from (select '2015-05-01 00:00' day) a
cross join
(select
position(
split(
rpad('', datediff(CURRENT_TIMESTAMP(),'2015-05-01 00:00')*2, 'a,'))) i
from (select NULL)) b
) d
left join [sample_data.requests] t on d.day = t.day
GROUP BY Value
ORDER BY Value DESC
LIMIT 1000;
You can query data that exists in your tables, the query cannot guess which dates are missing from your table. This problem you need to handle either in your programming language, or you could join with a numbers table and generates the dates on the fly.
If you know the date range you have in your query, you can generate the days:
select date(date_add(day, i, "DAY")) day
from (select '2015-01-01' day) a
cross join
(select
position(
split(
rpad('', datediff('2015-01-15','2015-01-01')*2, 'a,'))) i
from (select NULL)) b;
Then you can join this result with your query table:
SELECT COUNT(*) AS Key,
DATE(t.day) AS Value from (...the.above.query.pasted.here...) d
left join [myDataSet.myTable] t on d.day = t.day
WHERE .....
GROUP BY Value
ORDER BY Value DESC
LIMIT 1000;