Conditional Aggregation for multiple days of data - sql

I am able to do conditional aggregation for a single day using below SQL but wondering how can I accomplish it within a single query for multiple days. I am trying to do a cartesian product between logs_20190715 and dates but not able to think through further to solve this. Any inputs would be appreciated.
--1
SELECT CAST('2018-11-19' AS TIMESTAMP ) AS time_id,
city_id,
COUNT( DISTINCT CASE WHEN DATE_TRUNC('DAY',logged_at) = CAST( '2018-11-19' AS TIMESTAMP ) THEN user_id END ) AS A,
COUNT( DISTINCT CASE WHEN logged_at >= CAST( '2018-11-19' AS TIMESTAMP )
AND logged_at < CAST( '2018-11-19' AS TIMESTAMP ) + interval '7' DAY
THEN user_id
END
) AS B,
COUNT( DISTINCT CASE WHEN logged_at < CAST( '2018-11-19' AS TIMESTAMP )
AND logged_at >= CAST( '2018-11-19' AS TIMESTAMP ) - interval '7' DAY
THEN user_id
END
) AS C,
COUNT( DISTINCT CASE WHEN logged_at < CAST( '2018-11-19' AS TIMESTAMP )
AND logged_at >= CAST( '2018-11-19' AS TIMESTAMP ) - interval '28' DAY
THEN user_id
END
) AS D,
'2018-11-19'
FROM logs_20190715
WHERE logged_at <= CAST('2018-11-19' AS TIMESTAMP) + interval '10' DAY
AND logged_at >= CAST('2018-11-19' AS TIMESTAMP) - interval '40' DAY
GROUP BY 1,2;
--2
SELECT CAST('2018-11-18' AS TIMESTAMP ) AS time_id,
city_id,
COUNT( DISTINCT CASE WHEN DATE_TRUNC('DAY',logged_at) = CAST( '2018-11-18' AS TIMESTAMP ) THEN user_id END ) AS A,
COUNT( DISTINCT CASE WHEN logged_at >= CAST( '2018-11-18' AS TIMESTAMP )
AND logged_at < CAST( '2018-11-18' AS TIMESTAMP ) + interval '7' DAY
THEN user_id
END
) AS B,
COUNT( DISTINCT CASE WHEN logged_at < CAST( '2018-11-18' AS TIMESTAMP )
AND logged_at >= CAST( '2018-11-18' AS TIMESTAMP ) - interval '7' DAY
THEN user_id
END
) AS C,
COUNT( DISTINCT CASE WHEN logged_at < CAST( '2018-11-18' AS TIMESTAMP )
AND logged_at >= CAST( '2018-11-18' AS TIMESTAMP ) - interval '28' DAY
THEN user_id
END
) AS D,
'2018-11-18'
FROM logs_20190715
WHERE logged_at <= CAST('2018-11-18' AS TIMESTAMP) + interval '10' DAY
AND logged_at >= CAST('2018-11-18' AS TIMESTAMP) - interval '40' DAY
GROUP BY 1,2;
How can I combine the above two queries into a single query and have the same results produced?( Have date dimension which has all dates in it. )

Related

Difference between two dates in business days? Google Bigquery

How do I calculate the difference between two dates in business days in Google Bigquery?
I want to replicate this example below:
I have tried these examples but they do not give the expected results:
DATE_DIFF but only counting business days
I also used this logic,ionand it did not work:
CREATE TEMP FUNCTION BusinessDateDiff(start_date DATE, end_date DATE) AS (
(SELECT COUNTIF(MOD(EXTRACT(DAYOFWEEK FROM date), 7) > 1)
FROM UNNEST(GENERATE_DATE_ARRAY(
start_date, DATE_SUB(end_date, INTERVAL 1 DAY))) AS date)
);
Consider below
create temp function BusinessDateDiff(delivery DATE, eta DATE) AS ((
select if(delivery > eta, 1, -1) * count(*)
from unnest(generate_date_array(
least(delivery, eta), greatest(delivery, eta) - 1
)) day
where not extract(dayofweek from day) in (1, 7)
));
select *,
BusinessDateDiff(DELIVERY_DATE, ORIGINAL_ETA_DATE) as BUSINESS_DAYS
from your_table
if applied to sample data as in your question - output is
getting desired result as follows:
CREATE TEMP FUNCTION BusinessDateDiff(start_date DATE, end_date DATE) AS (
(SELECT -1*COUNTIF(MOD(EXTRACT(DAYOFWEEK FROM date), 7) > 1)
FROM UNNEST(GENERATE_DATE_ARRAY( start_date , DATE_SUB(end_date,INTERVAL 1 DAY))) AS date));
CREATE TEMP FUNCTION BusinessDateDiff1( end_date DATE, start_date DATE) AS (
(SELECT COUNTIF(MOD(EXTRACT(DAYOFWEEK FROM date), 7) > 1)
FROM UNNEST(GENERATE_DATE_ARRAY( end_date , DATE_SUB(start_date,INTERVAL 1 DAY))) AS date));
WITH OrdersTable AS (
SELECT DATE '2022-06-28' AS DELIVERY_DATE,
DATE '2022-08-17' AS ORIGINAL_ETA_DATE
UNION ALL
SELECT DATE '2022-07-01' AS DELIVERY_DATE,
DATE '2022-07-14' AS ORIGINAL_ETA_DATE
UNION ALL
SELECT DATE '2022-06-30' AS DELIVERY_DATE,
DATE '2022-07-08' AS ORIGINAL_ETA_DATE
UNION ALL
SELECT DATE '2022-06-30' AS DELIVERY_DATE,
DATE '2022-07-08' AS ORIGINAL_ETA_DATE
UNION ALL
SELECT DATE '2022-06-29' AS DELIVERY_DATE,
DATE '2022-07-06' AS ORIGINAL_ETA_DATE
UNION ALL
SELECT DATE '2022-06-27' AS DELIVERY_DATE,
DATE '2022-07-01' AS ORIGINAL_ETA_DATE
UNION ALL
SELECT DATE '2022-06-30' AS DELIVERY_DATE,
DATE '2022-07-05' AS ORIGINAL_ETA_DATE
UNION ALL
SELECT DATE '2022-06-30' AS DELIVERY_DATE,
DATE '2022-06-28' AS ORIGINAL_ETA_DATE
)
SELECT
DELIVERY_DATE,
ORIGINAL_ETA_DATE,
case when DELIVERY_DATE < ORIGINAL_ETA_DATE then
BusinessDateDiff(DELIVERY_DATE, ORIGINAL_ETA_DATE)
when DELIVERY_DATE > ORIGINAL_ETA_DATE then
BusinessDateDiff1(ORIGINAL_ETA_DATE, DELIVERY_DATE)
else 0 end AS BUSINESS_DAYS
FROM OrdersTable
[![Desired Result][1]][1]
[1]: https://i.stack.imgur.com/efmw3.png

Get columns of data with two different date range

I would like to get the average rating for last 7 days and last 14 days.
I tried using WITH AS to get the data but it's taking way too long to load. Any other way that is better and could reduce the run time?
syntax:
WITH last_7_days AS (
SELECT item, rating
FROM sales
WHERE (
rating IS NOT NULL
AND (entry_date >= CAST((CAST(now() AS timestamp) + (INTERVAL '-7 day')) AS date) AND entry_date < CAST((CAST(now() AS timestamp) + (INTERVAL '1 day')) AS date))
)
),
last_14_days AS (
SELECT item, rating
FROM sales
WHERE (
rating IS NOT NULL
AND (entry_date >= CAST((CAST(now() AS timestamp) + (INTERVAL '-14 day')) AS date) AND entry_date < CAST((CAST(now() AS timestamp) + (INTERVAL '1 day')) AS date))
)
)
SELECT last_7_days.item, avg(last_7_days.score) as "avg_last_7_days", avg(last_14_days.rating) as "avg_last_14_days", count(*) AS "count"
FROM last_7_days, last_14_days
WHERE last_7_days.item = last_14_days.item
GROUP BY last_7_days.item
ORDER BY "avg_last_7_days" DESC, last_7_days.item ASC
Result should be something like this:
item|avg_last_7_days|avg_last_14_days|count|
thank you
Use conditional aggregation:
SELECT item,
AVG(rating) FILTER (WHERE entry_date >= NOW() + interval '-7 day' AND entry_date < NOW() + interval '1 day') AS avg_rating_last_seven_days,
AVG(rating) FILTER (WHERE entry_date >= NOW() + interval '-14 day' AND entry_date < NOW() + interval '1 day') AS avg_rating_last_fourteen_days
FROM sales
WHERE rating IS NOT NULL AND
(entry_date >= NOW() + interval '-14 day' AND entry_date < NOW() + interval '1 day')
GROUP BY item;
Note: If you only care about the date, then perhaps you should use CURRENT_DATE or even NOW()::date.
Getting rid of all the casts and aggregating directly on the CTEs should help, try with the following:
WITH last_7_days AS (
SELECT
item,
AVG(rating) AS avg_rating_last_seven_days
FROM
sales
WHERE
rating IS NOT NULL AND
(entry_date >= NOW() + interval '-7 day' AND entry_date < NOW() + interval '1 day')
GROUP BY
1
),
last_14_days AS (
SELECT
item,
AVG(rating) AS avg_rating_last_fourteen_days
FROM
sales
WHERE
rating IS NOT NULL AND
(entry_date >= NOW() + interval '-14 day' AND entry_date < NOW() + interval '1 day')
GROUP BY
1
)
SELECT
lsd.item,
avg_rating_last_seven_days,
avg_rating_last_fourteen_days
FROM
last_7_days AS lsd
INNER JOIN
last_14_days AS lfd ON lsd.item = lfd.item
Let me know in case it helped on improving your current performance!

Oracle SQL - How to retrieve the ID Count difference between today vs yesterday

I have a table that captures when a customer purchases a product. It captures a unique purchase id along with a timestamp of when the purchase was made.
I want to be able to query, the difference between how many purchases were taken today vs yesterday?
Not sure how to query this on oracle?
You can use conditional aggregation:
select sum(case when trunc(datecol) = trunc(sysdate - 1) then 1 else 0 end) as num_yesterday,
sum(case when trunc(datecol) = trunc(sysdate) then 1 else 0 end) as num_today,
sum(case when trunc(datecol) = trunc(sysdate) then 1
when trunc(datecol) = trunc(sysdate - 1) then -1
end) as diff
from t
where datecol >= trunc(sysdate - 1);
you can use the Group function to grouping the purchase day with timestamp information and count the purchase id.
select trunc(purchase_ts) Day, count(purchase_id) Count
from purchase
group by trunc(purchase_ts)
order by 1
Using TRUNC on the column will prevent Oracle from using an index on that column (instead you would need a separate function-based index); instead use a CASE statement to test whether the date is between the start of the day and the start of the next day and then COUNT the values between those ranges:
SELECT COUNT(
CASE
WHEN TRUNC( SYSDATE ) - INTERVAL '1' DAY <= your_date_column
AND your_date_coumn < TRUNC( SYSDATE )
THEN 1
END
) AS count_for_yesterday,
COUNT(
CASE
WHEN TRUNC( SYSDATE ) <= your_date_column
AND your_date_coumn < TRUNC( SYSDATE ) + INTERVAL '1' DAY
THEN 1
END
) AS count_for_today
FROM your_table
WHERE TRUNC( SYSDATE ) - INTERVAL '1' DAY <= your_date_column
AND your_date_coumn < TRUNC( SYSDATE ) + INTERVAL '1' DAY

Calculate all days, group by week and include empty weeks?

I need a query that sums all the values for each day in a given week and groups by week including empty weeks.
This query groups by week and includes empty weeks but it isn't summing all days in the week as expected:
Expected Output:
[
...
{"week"=>"2019-02-28", "amount_net"=>"0"},
{"week"=>"2019-03-07", "amount_net"=>"300"}
]
Actual Output:
[
...
{"week"=>"2019-02-28", "amount_net"=>"0"},
{"week"=>"2019-03-07", "amount_net"=>"0"}
]
Here is the query I came up with:
SELECT
week,
COALESCE (amount_net, 0) as amount_net
FROM
(
SELECT
to_char(
generate_series(
timestamp '2018-12-13 22:34:31 UTC',
timestamp '2019-03-14', interval '1 week'
):: date,
'YYYY-MM-DD'
) as week
) d
LEFT JOIN (
SELECT
to_char(
date_trunc('week', created_at),
'YYYY-MM-DD'
) AS week,
SUM(
ROUND(
(
coalesce(cost_items.base_price, 0) - coalesce(cost_items.base_discount, 0) + coalesce(cost_items.base_fee, 0) + coalesce(cost_items.base_taxes_total, 0) + coalesce(
cost_items.base_commission_included,
0
) - coalesce(cost_items.base_voided_price, 0) + coalesce(
cost_items.base_voided_discount,
0
) - coalesce(cost_items.base_voided_fee, 0) - coalesce(
cost_items.base_voided_taxes_total,
0
) - coalesce(
cost_items.base_voided_commission_included,
0
)
):: numeric,
2
)
) as amount_net
FROM
cost_items
WHERE
id IN ('0', '1', '2')
GROUP BY
1
) t USING (week)
ORDER BY
week;
How do I adjust this query to properly sum all values for each day in the week?
Figured it out:
with host_weeks as (
SELECT
generate_series(
timestamp '2018-12-01',
timestamp '2019-04-01', interval '1 day'
)::date as host_week )
select date_trunc('week', day)::date as week, sum(amount_net) from
(
select hw.host_week as day,
SUM(
ROUND(
(
coalesce(ci.base_price, 0) - coalesce(ci.base_discount, 0) + coalesce(ci.base_fee, 0) + coalesce(ci.base_taxes_total, 0) + coalesce(
ci.base_commission_included,
0
) - coalesce(ci.base_voided_price, 0) + coalesce(
ci.base_voided_discount,
0
) - coalesce(ci.base_voided_fee, 0) - coalesce(
ci.base_voided_taxes_total,
0
) - coalesce(
ci.base_voided_commission_included,
0
)
):: numeric,
2
)
) as amount_net
from host_weeks hw left join cost_items ci on hw.host_week = ci.created_at::date and ci.id in (....)
group by 1 order by 1) t group by 1 order by 1;

left join causes huge increase in time for query resolution

The following query helps me to calculate the average of historical values distributed on even time intervals.
EXPLAIN ANALYZE SELECT start_date as date, AVG(hcv1.value::float) as value
FROM generate_series(cast('2017-01-01' as abstime), cast('2017-12-01' as abstime), interval '86400 seconds') start_date
LEFT JOIN history_values hv
ON (
hv.variable_id = 3 AND
hv.created_at BETWEEN start_date AND start_date + interval '86400 seconds'
)
GROUP BY start_date
ORDER BY start_date
Here the report of the query: https://explain.depesz.com/s/q29a
Now if I try to add an extra column value2 pointing to another variable_id the query time goes from 2 seconds to 150 seconds:
EXPLAIN ANALYZE SELECT start_date as date,
AVG(hv1.value::float) as value1,
AVG(hv2.value::float) as value2
FROM generate_series(cast('2017-01-01' as abstime), cast('2017-12-01' as abstime), interval '86400 seconds') start_date
LEFT JOIN history_values hv1
ON (
hv1.variable_id = 2 AND
hv.created_at BETWEEN start_date AND start_date + interval '86400 seconds'
)
LEFT JOIN history_values hv2
ON (
hv2.variable_id = 3 AND
hv.created_at BETWEEN start_date AND start_date + interval '86400 seconds'
)
GROUP BY start_date
ORDER BY start_date
Here is the report: https://explain.depesz.com/s/V1sV
Could anybody tell me why? I was really expecting the time to be around 4 seconds, not almost 75 times more.
Also note that:
SELECT COUNT(*) FROM history_values WHERE variable_id = 2 -- ~25k records
SELECT COUNT(*) FROM history_values WHERE variable_id = 3 -- ~25k records
You're not adding an extra column, you're adding another join condition. And you don't need that extra join anyway..
Try instead, just filtering the avg()
EXPLAIN ANALYZE
SELECT start_date as date,
AVG(hv1.value::float) FILTER ( WHERE hv1.variable_id = 1 ) as value1,
AVG(hv2.value::float) FILTER ( WHERE hv1.variable_id = 2 ) as value2
FROM generate_series(
cast('2017-01-01' as abstime)
, cast('2017-12-01' as abstime),
, interval '86400 seconds'
) AS start_date
LEFT JOIN history_values hv1
ON (
hv1.created_at >= cast('2017-01-01' as abstime) AND
hv1.created_at <= cast('2017-12-01' as abstime) AND
hv1.created_at >= start_date AND
hv1.created_at < start_date + interval '86400 seconds'
)
GROUP BY start_date
ORDER BY start_date
As a side note, you should not ever be using abstime. That should be for internal use only. Instead, I would use
EXPLAIN ANALYZE
SELECT start_date::date AS date,
AVG(hv1.value::float) FILTER ( WHERE hv1.variable_id = 1 ) as value1,
AVG(hv2.value::float) FILTER ( WHERE hv1.variable_id = 2 ) as value2
FROM generate_series(
timestamp with time zone '2017-01-01',
timestamp with time zone '2017-12-01',
interval '1 day'
) AS start_date
LEFT JOIN history_values hv1
ON (
hv1.created_at BETWEEN (
timestamp with time zone '2017-01-01'
AND timestamp with time zone '2017-12-01'
) AND
hv1.created_at >= start_date AND
hv1.created_at < start_date + interval '1 day' AND
hv1.variable_id IN (1,2)
)
GROUP BY start_date
ORDER BY start_date
I would also think you could collapse those ranges down..
EXPLAIN ANALYZE
SELECT start_date::date AS date,
AVG(hv1.value::float) FILTER ( WHERE hv1.variable_id = 1 ) as value1,
AVG(hv2.value::float) FILTER ( WHERE hv1.variable_id = 2 ) as value2
FROM generate_series(
timestamp with time zone '2017-01-01',
timestamp with time zone '2017-12-01' - interval '1 day'
interval '1 day'
) AS start_date
LEFT JOIN history_values hv1
ON hv1.created_at BETWEEN start_date AND (start_date + interval '1 day' )
AND hv1.variable_id IN (1,2)
GROUP BY start_date
ORDER BY start_date
In the future, please ask questions specific to PostgreSQL on http://dba.stackexchange.com. I would flag this for migration there. The admins will gladly move it.