Hi I am trying to calculate the average of previous 4 Tuesdays. I have daily sales data and I am trying to calculate what the average for previous 4 weeks were for the same weekday.
Attached is a snapshot of how my dataset looks like
Now for March 6, I would like to know what is the average for the previous 4 weeks were, (namely Feb 6, Feb 13, Feb 20 and Feb 27). This value needs to be assigned to Monthly Average column
I am using a PostGres DB.
Thanks
You can use window functions:
select t.*,
avg(dailycount) over (partition by seller_name, day
order by date
rows between 3 preceding and current row
) as avg_4_weeks
from t
where day = 'Tuesday';
This assumes that "previous 4 weeks" is the current date plus the previous three weeks. If it starts the week before, only the windowing clause needs to change:
select t.*,
avg(dailycount) over (partition by seller_name, day
order by date
rows between 4 preceding and 1 preceding
) as avg_4_weeks
from t
where day = 'Tuesday';
I decided to post my answer also, for anyone else searching. My answer will allow you to put in any date and get the average for the previous 4 weeks ( current day + previous 3 weeks matching the day).
SQL Fiddle
PostgreSQL 9.3 Schema Setup:
CREATE TABLE sales (sellerName varchar(10), dailyCount int, saleDay date) ;
INSERT INTO sales (sellerName, dailyCount, saleDay)
SELECT 'ABC',10,to_date('2018-03-15','YYYY-MM-DD') UNION ALL /* THIS ONE */
SELECT 'ABC',11,to_date('2018-03-14','YYYY-MM-DD') UNION ALL
SELECT 'ABC',12,to_date('2018-03-12','YYYY-MM-DD') UNION ALL
SELECT 'ABC',13,to_date('2018-03-11','YYYY-MM-DD') UNION ALL
SELECT 'ABC',14,to_date('2018-03-10','YYYY-MM-DD') UNION ALL
SELECT 'ABC',15,to_date('2018-03-09','YYYY-MM-DD') UNION ALL
SELECT 'ABC',16,to_date('2018-03-08','YYYY-MM-DD') UNION ALL /* THIS ONE */
SELECT 'ABC',17,to_date('2018-03-07','YYYY-MM-DD') UNION ALL
SELECT 'ABC',18,to_date('2018-03-06','YYYY-MM-DD') UNION ALL
SELECT 'ABC',19,to_date('2018-03-05','YYYY-MM-DD') UNION ALL
SELECT 'ABC',20,to_date('2018-03-04','YYYY-MM-DD') UNION ALL
SELECT 'ABC',21,to_date('2018-03-03','YYYY-MM-DD') UNION ALL
SELECT 'ABC',22,to_date('2018-03-02','YYYY-MM-DD') UNION ALL
SELECT 'ABC',23,to_date('2018-03-01','YYYY-MM-DD') UNION ALL /* THIS ONE */
SELECT 'ABC',24,to_date('2018-02-28','YYYY-MM-DD') UNION ALL
SELECT 'ABC',25,to_date('2018-02-22','YYYY-MM-DD') UNION ALL /* THIS ONE */
SELECT 'ABC',26,to_date('2018-02-15','YYYY-MM-DD') UNION ALL
SELECT 'ABC',27,to_date('2018-02-08','YYYY-MM-DD') UNION ALL
SELECT 'ABC',28,to_date('2018-02-01','YYYY-MM-DD')
;
Now For The Query:
WITH theDay AS (
SELECT to_date('2018-03-15','YYYY-MM-DD') AS inDate
)
SELECT AVG(dailyCount) AS totalCount /* 18.5 = (10(3/15)+16(3/8)+23(3/1)+25(2/22))/4 */
FROM sales
CROSS JOIN theDay
WHERE extract(dow from saleDay) = extract(dow from theDay.inDate)
AND saleDay <= theDay.inDate
AND saleDay >= theDay.inDate-INTERVAL '3 weeks' /* Since we want to include the entered
day, for the INTERVAL we need 1 less week than we want */
Results:
| totalcount |
|------------|
| 18.5 |
Related
Hi I just want to ask how to resolve this problem.
Example in the query indicated below.
In the next query I will prepare, I want to filter the last 7 days of the delivery date. Do not use current_date because the maximum date is very late.
Assuming the current date is 7/12/2022 but the query shows a maximum date of 7/07/2022. How can I filter the date from 7/1/2022 to 7/07/2022?
, Datas1 as
(select distinct (delivery_due_date) as delivery_date
, Specialist
, Id_number
, Staff_Total as Total_Items
from joining
where Delivery_Due_Date is not null
)
Actually I used max function in where but I get an error. Please help me.
Created Examples of such data in first block.
Performed the select on that data in second block.
Extracted Maximum Delivery data in 3rd Block.
Restricted last block for 7 days of data collected from 3rd block.
WITH joining AS(
SELECT '2022-07-01' AS delivery_due_date, 'ABC' as Specialist,222 as Id_number, 21 as Staff_Total union all
SELECT '2022-07-07' AS delivery_due_date, 'ABC2' as Specialist,223 as Id_number, 01 as Staff_Total union all
SELECT '2022-07-15' AS delivery_due_date, 'ABC4' as Specialist,212 as Id_number, 25 as Staff_Total union all
SELECT '2022-07-20' AS delivery_due_date, 'AB5C' as Specialist,224 as Id_number, 15 as Staff_Total union all
SELECT '2022-07-05' AS delivery_due_date, 'ABC7' as Specialist,226 as Id_number, 87 as Staff_Total ),
Datas1 as (select distinct (delivery_due_date) as delivery_date , Specialist
, Id_number , Staff_Total as Total_Items from joining where Delivery_Due_Date is not null ),
Datas2 as (
select max(delivery_date) as ddd from Datas1)
select Datas1.* from Datas1,Datas2 where date(delivery_date) between date_sub(date(Datas2.ddd), interval 7 day) and date(Datas2.ddd)
I need to query data on a large transaction dataset in a Oracle SQL Developer database but unfortunately I didn't came up with the correct solution yet - hopefully someone can help me.
The dataset consists of the following:
Customer_ID, Counterparty_ID, Transaction ID, Transaction_Amount, Date
Task :
Flag the transactions of a customer, when a customers made transactions (>= 1000 € each) with >= 5 different counterparties in a 7day time window.
The time window should be a moving time window: E.g. if the transaction date is 17.6. the time window would be +- 6 days (11. - 23.06.).
Within a time window only distinct counterparties should be counted. E.g. if a customer has made 5 transactions with counterparty X in time window A, it is counted as 1. If the customer made additional transactions with counterparty X but in time window B, it is again counted as 1 for that time window.
So far I was only able to solve the task with calendar weeks as time window but that is not as intended.
You haven't provided any sample data or expected output so its difficult to know exactly what you want to output; however, you can immediately filter out all the rows with transaction_amount < 1000 and then, from Oracle 12, you use MATCH_RECOGNIZE to perform row-by-row matching:
SELECT *
FROM (
SELECT m.*,
COUNT(DISTINCT counterparty_id) OVER (
PARTITION BY customer_id, match
) AS num_counterparties
FROM (
SELECT *
FROM table_name
WHERE transaction_amount >= 1000
)
MATCH_RECOGNIZE (
PARTITION BY customer_id
ORDER BY "DATE"
MEASURES
MATCH_NUMBER() AS match
ALL ROWS PER MATCH
AFTER MATCH SKIP TO NEXT ROW
PATTERN ( within_week+ )
DEFINE
within_week AS "DATE" < FIRST("DATE") + INTERVAL '7' DAY
) m
)
WHERE num_counterparties >= 5;
Which, for the sample data:
CREATE TABLE table_name (
Customer_ID, Counterparty_ID, Transaction_ID, Transaction_Amount, "DATE"
) AS
SELECT 1, 1, 1, 1000, DATE '1970-01-01' FROM DUAL UNION ALL
SELECT 1, 2, 2, 1000, DATE '1970-01-02' FROM DUAL UNION ALL
SELECT 1, 3, 3, 1000, DATE '1970-01-03' FROM DUAL UNION ALL
SELECT 1, 4, 4, 1000, DATE '1970-01-04' FROM DUAL UNION ALL
SELECT 1, 5, 5, 1000, DATE '1970-01-05' FROM DUAL;
Outputs:
CUSTOMER_ID
DATE
MATCH
COUNTERPARTY_ID
TRANSACTION_ID
TRANSACTION_AMOUNT
NUM_COUNTERPARTIES
1
01-JAN-70
1
1
1
1000
5
1
02-JAN-70
1
2
2
1000
5
1
03-JAN-70
1
3
3
1000
5
1
04-JAN-70
1
4
4
1000
5
1
05-JAN-70
1
5
5
1000
5
db<>fiddle here
Example relevant table schema:
+---------------------------+-------------------+
| activity_date - TIMESTAMP | user_id - STRING |
+---------------------------+-------------------+
| 2017-02-22 17:36:08 UTC | fake_id_i24385787 |
+---------------------------+-------------------+
| 2017-02-22 04:27:08 UTC | fake_id_234885747 |
+---------------------------+-------------------+
| 2017-02-22 08:36:08 UTC | fake_id_i24385787 |
+---------------------------+-------------------+
I need to count active distinct users over a large data set over a rolling time period (90 days), and am running into issues due to the size of the dataset.
At first, I attempted to use a window function, similar to the answer here.
https://stackoverflow.com/a/27574474
WITH
daily AS (
SELECT
DATE(activity_date) day,
user_id
FROM
`fake-table`)
SELECT
day,
SUM(APPROX_COUNT_DISTINCT(user_id)) OVER (ORDER BY day ROWS BETWEEN 89 PRECEDING AND CURRENT ROW) ninty_day_window_apprx
FROM
daily
GROUP BY
1
ORDER BY
1 DESC
However, this resulted in getting the distinct number of users per day, then summing these up - but distincts could be duplicated within the window, if they appeared multiple times. So this is not a true accurate measure of distinct users over 90 days.
The next thing I tried is to use the following solution
https://stackoverflow.com/a/47659590
- concatenating all the distinct user_ids for each window to an array and then counting the distincts within this.
WITH daily AS (
SELECT date(activity_date) day, STRING_AGG(DISTINCT user_id) users
FROM `fake-table`
GROUP BY day
), temp2 AS (
SELECT
day,
STRING_AGG(users) OVER(ORDER BY UNIX_DATE(day) RANGE BETWEEN 89 PRECEDING AND CURRENT ROW) users
FROM daily
)
SELECT day,
(SELECT APPROX_COUNT_DISTINCT(id) FROM UNNEST(SPLIT(users)) AS id) Unique90Days
FROM temp2
order by 1 desc
However this quickly ran out of memory with anything large.
Next was to use a HLL sketch to represent the distinct IDs in a much smaller value, so memory would be less of an issue. I thought my problems were solved, but I'm getting an error when running the following: The error is simply "Function MERGE_PARTIAL is not supported." I tried with MERGE as well and got the same error. It only happens when using the window function. Creating the sketches for each day's value works fine.
I read through the BigQuery Standard SQL documentation and don't see anything about HLL_COUNT.MERGE_PARTIAL and HLL_COUNT.MERGE with window functions. Presumably this should take the 90 sketches and combine them into one HLL sketch, representing the distinct values between the 90 original sketches?
WITH
daily AS (
SELECT
DATE(activity_date) day,
HLL_COUNT.INIT(user_id) sketch
FROM
`fake-table`
GROUP BY
1
ORDER BY
1 DESC),
rolling AS (
SELECT
day,
HLL_COUNT.MERGE_PARTIAL(sketch) OVER (ORDER BY UNIX_DATE(day) RANGE BETWEEN 89 PRECEDING AND CURRENT ROW) rolling_sketch
FROM daily)
SELECT
day,
HLL_COUNT.EXTRACT(rolling_sketch)
FROM
rolling
ORDER BY
1
"Image of the error - Function MERGE_PARTIAL is not supported"
Any ideas why this error happens or how to adjust?
Below is for BigQuery Standard SQL and does exactly what you want with use of window function
#standardSQL
SELECT day,
(SELECT HLL_COUNT.MERGE(sketch) FROM UNNEST(rolling_sketch_arr) sketch) rolling_sketch
FROM (
SELECT day,
ARRAY_AGG(ids_sketch) OVER(ORDER BY UNIX_DATE(day) RANGE BETWEEN 89 PRECEDING AND CURRENT ROW) rolling_sketch_arr
FROM (
SELECT day, HLL_COUNT.INIT(id) ids_sketch
FROM `project.dataset.table`
GROUP BY day
)
)
You can test, play with above using [totally] dummy data as in below example
#standardSQL
WITH `project.dataset.table` AS (
SELECT 1 id, DATE '2019-01-01' day UNION ALL
SELECT 2, '2019-01-01' UNION ALL
SELECT 3, '2019-01-01' UNION ALL
SELECT 1, '2019-01-02' UNION ALL
SELECT 4, '2019-01-02' UNION ALL
SELECT 2, '2019-01-03' UNION ALL
SELECT 3, '2019-01-03' UNION ALL
SELECT 4, '2019-01-03' UNION ALL
SELECT 5, '2019-01-03' UNION ALL
SELECT 1, '2019-01-04' UNION ALL
SELECT 4, '2019-01-04' UNION ALL
SELECT 2, '2019-01-05' UNION ALL
SELECT 3, '2019-01-05' UNION ALL
SELECT 5, '2019-01-05' UNION ALL
SELECT 6, '2019-01-05'
)
SELECT day,
(SELECT HLL_COUNT.MERGE(sketch) FROM UNNEST(rolling_sketch_arr) sketch) rolling_sketch
FROM (
SELECT day,
ARRAY_AGG(ids_sketch) OVER(ORDER BY UNIX_DATE(day) RANGE BETWEEN 2 PRECEDING AND CURRENT ROW) rolling_sketch_arr
FROM (
SELECT day, HLL_COUNT.INIT(id) ids_sketch
FROM `project.dataset.table`
GROUP BY day
)
)
-- ORDER BY day
with result
Row day rolling_sketch
1 2019-01-01 3
2 2019-01-02 4
3 2019-01-03 5
4 2019-01-04 5
5 2019-01-05 6
Combine HLL_COUNT.INIT and HLL_COUNT.MERGE. This solution uses a 90 days cross join with GENERATE_ARRAY(1, 90) instead of OVER.
#standardSQL
SELECT DATE_SUB(date, INTERVAL i DAY) date_grp
, HLL_COUNT.MERGE(sketch) unique_90_day_users
, HLL_COUNT.MERGE(DISTINCT IF(i<31,sketch,null)) unique_30_day_users
, HLL_COUNT.MERGE(DISTINCT IF(i<8,sketch,null)) unique_7_day_users
FROM (
SELECT DATE(creation_date) date, HLL_COUNT.INIT(owner_user_id) sketch
FROM `bigquery-public-data.stackoverflow.posts_questions`
WHERE EXTRACT(YEAR FROM creation_date)=2017
GROUP BY 1
), UNNEST(GENERATE_ARRAY(1, 90)) i
GROUP BY 1
ORDER BY date_grp
#EDIT - Following the comments, I rephrase my question
I have a BigQuery table that i want to use to get some KPI of my application.
In this table, I save each create or update as a new line in order to keep a better history.
So I have several times the same data with a different state.
Example of the table :
uuid |status |date
––––––|–––––––––––|––––––––––
3 |'inactive' |2018-05-12
1 |'active' |2018-05-10
1 |'inactive' |2018-05-08
2 |'active' |2018-05-08
3 |'active' |2018-05-04
2 |'inactive' |2018-04-22
3 |'inactive' |2018-04-18
We can see that we have multiple value of each data.
What I would like to get:
I would like to have the number of current 'active' entry (So there must be no 'inactive' entry with the same uuid after). And to complicate everything, I need this total per day.
So for each day, the amount of 'active' entries, including those from previous days.
So with this example I should have this result :
date | actives
____________|_________
2018-05-02 | 0
2018-05-03 | 0
2018-05-04 | 1
2018-05-05 | 1
2018-05-06 | 1
2018-05-07 | 1
2018-05-08 | 2
2018-05-09 | 2
2018-05-10 | 3
2018-05-11 | 3
2018-05-12 | 2
Actually i've managed to get the good amount of actives for one day. But my problem is when i want the results for each days.
What I've tried:
I'm stuck with two solutions that each return a different error.
First solution :
WITH
dates AS(
SELECT GENERATE_DATE_ARRAY(
DATE_SUB(CURRENT_DATE(), INTERVAL 6 MONTH), CURRENT_DATE(), INTERVAL 1 DAY)
arr_dates )
SELECT
i_date date,
(
SELECT COUNT(uuid)
FROM (
SELECT
uuid, status, date,
RANK() OVER(PARTITION BY uuid ORDER BY date DESC) rank
FROM users
WHERE
PARSE_DATE("%Y-%m-%d", FORMAT_DATETIME("%Y-%m-%d",date)) <= i_date
)
WHERE
status = 'active'
and rank = 1
## rank is the condition which causes the error
) users
FROM
dates, UNNEST(arr_dates) i_date
ORDER BY i_date;
The SELECT with the RANK() OVER correctly returns the users with a rank column that allow me to know which entry is the last for each uuid.
But when I try this, I got a :
Correlated subqueries that reference other tables are not supported unless they can be de-correlated, such as by transforming them into an efficient JOIN. because of the rank = 1 condition.
Second solution :
WITH
dates AS(
SELECT GENERATE_DATE_ARRAY(
DATE_SUB(CURRENT_DATE(), INTERVAL 6 MONTH), CURRENT_DATE(), INTERVAL 1 DAY)
arr_dates )
SELECT
i_date date,
(
SELECT
COUNT(t1.uuid)
FROM
users t1
WHERE
t1.date = (
SELECT MAX(t2.date)
FROM users t2
WHERE
t2.uuid = t1.uuid
## Here that's the i_date condition which causes problem
AND PARSE_DATE("%Y-%m-%d", FORMAT_DATETIME("%Y-%m-%d", t2.date)) <= i_date
)
AND status='active' ) users
FROM
dates,
UNNEST(arr_dates) i_date
ORDER BY i_date;
Here, the second select is working too and correctly returning the number of active user for a current day.
But the problem is when i try to use i_date to retrieve datas among the multiple days.
And Here i got a LEFT OUTER JOIN cannot be used without a condition that is an equality of fields from both sides of the join. error...
Which solution is more able to succeed ? What should i change ?
And, if my way of storing the data isn't good, how should i proceed in order to keep a precise history ?
Below is for BigQuery Standard SQL
#standardSQL
SELECT date, COUNT(DISTINCT uuid) total_active
FROM `project.dataset.table`
WHERE status = 'active'
GROUP BY date
-- ORDER BY date
Update to address your "rephrased" question :o)
Below example is using dummy data from your question
#standardSQL
WITH `project.dataset.users` AS (
SELECT 3 uuid, 'inactive' status, DATE '2018-05-12' date UNION ALL
SELECT 1, 'active', '2018-05-10' UNION ALL
SELECT 1, 'inactive', '2018-05-08' UNION ALL
SELECT 2, 'active', '2018-05-08' UNION ALL
SELECT 3, 'active', '2018-05-04' UNION ALL
SELECT 2, 'inactive', '2018-04-22' UNION ALL
SELECT 3, 'inactive', '2018-04-18'
), dates AS (
SELECT day FROM UNNEST((
SELECT GENERATE_DATE_ARRAY(MIN(date), MAX(date))
FROM `project.dataset.users`
)) day
), active_users AS (
SELECT uuid, status, date first, DATE_SUB(next_status.date, INTERVAL 1 DAY) last FROM (
SELECT uuid, date, status, LEAD(STRUCT(status, date)) OVER(PARTITION BY uuid ORDER BY date ) next_status
FROM `project.dataset.users` u
)
WHERE status = 'active'
)
SELECT day, COUNT(DISTINCT uuid) actives
FROM dates d JOIN active_users u
ON day BETWEEN first AND IFNULL(last, day)
GROUP BY day
-- ORDER BY day
with result
Row day actives
1 2018-05-04 1
2 2018-05-05 1
3 2018-05-06 1
4 2018-05-07 1
5 2018-05-08 2
6 2018-05-09 2
7 2018-05-10 3
8 2018-05-11 3
9 2018-05-12 2
I think this -- or something similar -- will do what you want:
SELECT day,
coalesce(running_actives, 0) - coalesce(running_inactives, 0)
FROM UNNEST(GENERATE_DATE_ARRAY(DATE('2015-05-11'), DATE('2018-06-29'), INTERVAL 1 DAY)
) AS day left join
(select date, sum(countif(status = 'active')) over (order by date) as running_actives,
sum(countif(status = 'active')) over (order by date) as running_inactives
from t
group by date
) a
on a.date = day
order by day;
The exact solution depends on whether the "inactive" is inclusive of the day (as above) or takes effect the next day. Either is handled the same way, by using cumulative sums of actives and inactives and then taking the difference.
In order to get data on all days, this generates the days using arrays and unnest(). If you have data on all days, that step may be unnecessary
I have a table which has a table like this.
Month-----Book_Type-----sold_in_Dollars
Jan----------A------------ 100
Jan----------B------------ 120
Feb----------A------------ 50
Mar----------A------------ 60
Mar----------B------------ 30
and so on
I have to calculate the expected sales for each month and book type based on the last 2 months sales.
So for March and type A it would be (100+50)/2 = 75
For March and type B it is 120/1 since no data for Feb is there.
I was trying to use the lag function but it wouldn't work since there is data missing in a few rows.
Any ideas on this?
Since it plans to ignore missing values, this should probably work. Don't have a database to test it on at the moment but will give it another go in the morning
select
month,
book_type,
sold_in_dollars,
avg(sold_in_dollars) over (partition by book_type order by month
range between interval '2' month preceding and interval '1' month preceding) as avg_sales
from myTable;
This sort of assumes that month has a date datatype and can be sorted on... if it's just a text string then you'll need something else.
Normally you could just use rows between 2 preceding and 1 preceding but but this will take the two previous data points and not necessarily the two previous months if there are rows missing.
You could work it out with lag but it would be a bit more complicated.
As far as I know, you can give a default value to lag() :
SELECT Book_Type,
(lag(sold_in_Dollars, 1, 0) OVER(PARTITION BY Book_Type ORDER BY Month) + lag(sold_in_Dollars, 2, 0) OVER(PARTITION BY Book_Type ORDER BY Month))/2 AS expected_sales
FROM your_table
GROUP BY Book_Type
(Assuming Month column doesn't really contain JAN or FEB but real, orderable dates.)
What about something like (forgive the sql server syntax, but you get the idea):
Select Book_type, AVG(sold_in_dollars)
from MyTable
where Month in (Month(DATEADD('mm'-1,GETDATE)),Month(DATEADD('mm'-2,GETDATE)))
group by booktype
A partition outer join can help create the missing data. Create a set of months and join those values to each row by the month and perform the join once for each book type. I created the months January through April in this example:
with test_data as
(
select to_date('01-JAN-2010', 'DD-MON-YYYY') month, 'A' book_type, 100 sold_in_dollars from dual union all
select to_date('01-JAN-2010', 'DD-MON-YYYY') month, 'B' book_type, 120 sold_in_dollars from dual union all
select to_date('01-FEB-2010', 'DD-MON-YYYY') month, 'A' book_type, 50 sold_in_dollars from dual union all
select to_date('01-MAR-2010', 'DD-MON-YYYY') month, 'A' book_type, 60 sold_in_dollars from dual union all
select to_date('01-MAR-2010', 'DD-MON-YYYY') month, 'B' book_type, 30 sold_in_dollars from dual
)
select book_type, month, sold_in_dollars
,case when denominator = 0 then 'N/A' else to_char(numerator / denominator) end expected_sales
from
(
select test_data.book_type, all_months.month, sold_in_dollars
,count(sold_in_dollars) over
(partition by book_type order by all_months.month rows between 2 preceding and 1 preceding) denominator
,sum(sold_in_dollars) over
(partition by book_type order by all_months.month rows between 2 preceding and 1 preceding) numerator
from
(
select add_months(to_date('01-JAN-2010', 'DD-MON-YYYY'), level-1) month from dual connect by level <= 4
) all_months
left outer join test_data partition by (test_data.book_type) on all_months.month = test_data.month
)
order by book_type, month