alternative to lag SQL command - sql

I have a table which has a table like this.
Month-----Book_Type-----sold_in_Dollars
Jan----------A------------ 100
Jan----------B------------ 120
Feb----------A------------ 50
Mar----------A------------ 60
Mar----------B------------ 30
and so on
I have to calculate the expected sales for each month and book type based on the last 2 months sales.
So for March and type A it would be (100+50)/2 = 75
For March and type B it is 120/1 since no data for Feb is there.
I was trying to use the lag function but it wouldn't work since there is data missing in a few rows.
Any ideas on this?

Since it plans to ignore missing values, this should probably work. Don't have a database to test it on at the moment but will give it another go in the morning
select
month,
book_type,
sold_in_dollars,
avg(sold_in_dollars) over (partition by book_type order by month
range between interval '2' month preceding and interval '1' month preceding) as avg_sales
from myTable;
This sort of assumes that month has a date datatype and can be sorted on... if it's just a text string then you'll need something else.
Normally you could just use rows between 2 preceding and 1 preceding but but this will take the two previous data points and not necessarily the two previous months if there are rows missing.
You could work it out with lag but it would be a bit more complicated.

As far as I know, you can give a default value to lag() :
SELECT Book_Type,
(lag(sold_in_Dollars, 1, 0) OVER(PARTITION BY Book_Type ORDER BY Month) + lag(sold_in_Dollars, 2, 0) OVER(PARTITION BY Book_Type ORDER BY Month))/2 AS expected_sales
FROM your_table
GROUP BY Book_Type
(Assuming Month column doesn't really contain JAN or FEB but real, orderable dates.)

What about something like (forgive the sql server syntax, but you get the idea):
Select Book_type, AVG(sold_in_dollars)
from MyTable
where Month in (Month(DATEADD('mm'-1,GETDATE)),Month(DATEADD('mm'-2,GETDATE)))
group by booktype

A partition outer join can help create the missing data. Create a set of months and join those values to each row by the month and perform the join once for each book type. I created the months January through April in this example:
with test_data as
(
select to_date('01-JAN-2010', 'DD-MON-YYYY') month, 'A' book_type, 100 sold_in_dollars from dual union all
select to_date('01-JAN-2010', 'DD-MON-YYYY') month, 'B' book_type, 120 sold_in_dollars from dual union all
select to_date('01-FEB-2010', 'DD-MON-YYYY') month, 'A' book_type, 50 sold_in_dollars from dual union all
select to_date('01-MAR-2010', 'DD-MON-YYYY') month, 'A' book_type, 60 sold_in_dollars from dual union all
select to_date('01-MAR-2010', 'DD-MON-YYYY') month, 'B' book_type, 30 sold_in_dollars from dual
)
select book_type, month, sold_in_dollars
,case when denominator = 0 then 'N/A' else to_char(numerator / denominator) end expected_sales
from
(
select test_data.book_type, all_months.month, sold_in_dollars
,count(sold_in_dollars) over
(partition by book_type order by all_months.month rows between 2 preceding and 1 preceding) denominator
,sum(sold_in_dollars) over
(partition by book_type order by all_months.month rows between 2 preceding and 1 preceding) numerator
from
(
select add_months(to_date('01-JAN-2010', 'DD-MON-YYYY'), level-1) month from dual connect by level <= 4
) all_months
left outer join test_data partition by (test_data.book_type) on all_months.month = test_data.month
)
order by book_type, month

Related

How to differentiate iteration using date filed in bigquery

I have a process that occur every 30 days but can take few days.
How can I differentiate between each iteration in order to sum the output of the process?
for Example
the output I except is
Name
Date
amount
iteration (optional)
Sophia Liu
2016-01-01
4
1
Sophia Liu
2016-02-01
5
2
Nikki Leith
2016-01-02
5
1
Nikki Leith
2016-02-01
10
2
I tried using lag function on the date filed and using the difference between that column and the date column.
WITH base AS
(SELECT 'Sophia Liu' as name, DATE '2016-01-01' as date, 3 as amount
UNION ALL SELECT 'Sophia Liu', DATE '2016-01-02', 1
UNION ALL SELECT 'Sophia Liu', DATE '2016-02-01', 3
UNION ALL SELECT 'Sophia Liu', DATE '2016-02-02', 2
UNION ALL SELECT 'Nikki Leith', DATE '2016-01-02', 5
UNION ALL SELECT 'Nikki Leith', DATE '2016-02-01', 5
UNION ALL SELECT 'Nikki Leith', DATE '2016-02-02', 3
UNION ALL SELECT 'Nikki Leith', DATE '2016-02-03', 1
UNION ALL SELECT 'Nikki Leith', DATE '2016-02-04', 1)
select
name
,date
,lag(date) over (partition by name order by date) as lag_func
,date_diff(date,lag(date) over (partition by name order by date),day) date_differacne
,case when date_diff(date,lag(date) over (partition by name order by date),day) >= 10
or date_diff(date,lag(date) over (partition by name order by date),day) is null then true else false end as new_iteration
,amount
from base
Edited answer
After your clarification and looking at what's actually in your SQL code. I'm guessing you are looking for a solution to what's called a gaps and islands problem. That is, you want to identify the "islands" of activity and sum the amount for each iteration or island. Taking your example you can first identify the start of a new session (or "gap") and then use that to create a unique iteration ("island") identifier for each user. You can then use that identifier to perform a SUM().
gaps as (
select
name,
date,
amount,
if(date_diff(date, lag(date,1) over(partition by name order by date), DAY) >= 10, 1, 0) new_iteration
from base
),
islands as (
select
*,
1 + sum(new_iteration) over(partition by name order by date) iteration_id
from gaps
)
select
*,
sum(amount) over(partition by name, iteration_id) iteration_amount
from islands
Previous answer
Sounds like you just need a RANK() to count the iterations in your window functions. Depending on your need you can then sum cumulative or total amounts in a similar window function. Something like this:
select
name
,date
,rank() over (partition by name order by date) as iteration
,sum(amount) over (partition by name order by date) as cumulative_amount
,sum(amount) over (partition by name) as total_amount
,amount
from base

How to filter the last 7 days based on the previous query? -BigQuery

Hi I just want to ask how to resolve this problem.
Example in the query indicated below.
In the next query I will prepare, I want to filter the last 7 days of the delivery date. Do not use current_date because the maximum date is very late.
Assuming the current date is 7/12/2022 but the query shows a maximum date of 7/07/2022. How can I filter the date from 7/1/2022 to 7/07/2022?
, Datas1 as
(select distinct (delivery_due_date) as delivery_date
, Specialist
, Id_number
, Staff_Total as Total_Items
from joining
where Delivery_Due_Date is not null
)
Actually I used max function in where but I get an error. Please help me.
Created Examples of such data in first block.
Performed the select on that data in second block.
Extracted Maximum Delivery data in 3rd Block.
Restricted last block for 7 days of data collected from 3rd block.
WITH joining AS(
SELECT '2022-07-01' AS delivery_due_date, 'ABC' as Specialist,222 as Id_number, 21 as Staff_Total union all
SELECT '2022-07-07' AS delivery_due_date, 'ABC2' as Specialist,223 as Id_number, 01 as Staff_Total union all
SELECT '2022-07-15' AS delivery_due_date, 'ABC4' as Specialist,212 as Id_number, 25 as Staff_Total union all
SELECT '2022-07-20' AS delivery_due_date, 'AB5C' as Specialist,224 as Id_number, 15 as Staff_Total union all
SELECT '2022-07-05' AS delivery_due_date, 'ABC7' as Specialist,226 as Id_number, 87 as Staff_Total ),
Datas1 as (select distinct (delivery_due_date) as delivery_date , Specialist
, Id_number , Staff_Total as Total_Items from joining where Delivery_Due_Date is not null ),
Datas2 as (
select max(delivery_date) as ddd from Datas1)
select Datas1.* from Datas1,Datas2 where date(delivery_date) between date_sub(date(Datas2.ddd), interval 7 day) and date(Datas2.ddd)

Calculate Revenue Recognition Per Month in Oracle SQL

I have a table with the order lines which show the Booking Amount and the booked date, but the revenue is recognised over 3 months (so 1/3 in the booked month and a further 1/3 in each of the next 2 months).
I need to create a query that would show the total revenue recognised in each month.
Is there an analytic function that could work this out? as at the moment I have cobbled together 3 joined queries that give the number but in 3 seperate columns, where I need it in one column:
select TRUNC(OM.BOOKING_DATE, 'MONTH') as Month
, SUM(OM.BOOKED_VALUE)/3 as Month_1
, M2.Month_2
, M3.Month_3
from ORDERS.OM,
(select TRUNC(ADD_MONTHS(OM.BOOKING_DATE,1), 'MONTH') as Month
, SUM(OM.BOOKED_VALUE)/3 as Month_2
from ORDERS.OM
GROUP By TRUNC(ADD_MONTHS(OM.BOOKING_DATE,1), 'MONTH')) M2,
(select TRUNC(ADD_MONTHS(OM.BOOKING_DATE,2), 'MONTH') as Month
, SUM(OM.BOOKED_VALUE)/3 as Month_3
from ORDERS.OM
GROUP By TRUNC(ADD_MONTHS(OM.BOOKING_DATE,2), 'MONTH')) M3
WHERE TRUNC(OM.BOOKING_DATE, 'MONTH') = M2.MONTH
AND TRUNC(OM.BOOKING_DATE, 'MONTH') = M3.MONTH
GROUP By TRUNC(OM.BOOKING_DATE, 'MONTH'), M2.Month_2, M3.Month_3
Order by 1 DESC
Triple every row and sum
select t.Month, SUM(t.Val) as Value
from ORDERS.OM
cross join lateral (select TRUNC(OM.BOOKING_DATE, 'MONTH') as Month, OM.BOOKED_VALUE/3.0 as Val from dual union all
select TRUNC(ADD_MONTHS(OM.BOOKING_DATE,1), 'MONTH'), OM.BOOKED_VALUE/3.0 from dual union all
select TRUNC(ADD_MONTHS(OM.BOOKING_DATE,2), 'MONTH'), OM.BOOKED_VALUE/3.0 from dual ) t
group by t.Month

BigQuery: How to merge HLL Sketches over a window function? (Count distinct values over a rolling window)

Example relevant table schema:
+---------------------------+-------------------+
| activity_date - TIMESTAMP | user_id - STRING |
+---------------------------+-------------------+
| 2017-02-22 17:36:08 UTC | fake_id_i24385787 |
+---------------------------+-------------------+
| 2017-02-22 04:27:08 UTC | fake_id_234885747 |
+---------------------------+-------------------+
| 2017-02-22 08:36:08 UTC | fake_id_i24385787 |
+---------------------------+-------------------+
I need to count active distinct users over a large data set over a rolling time period (90 days), and am running into issues due to the size of the dataset.
At first, I attempted to use a window function, similar to the answer here.
https://stackoverflow.com/a/27574474
WITH
daily AS (
SELECT
DATE(activity_date) day,
user_id
FROM
`fake-table`)
SELECT
day,
SUM(APPROX_COUNT_DISTINCT(user_id)) OVER (ORDER BY day ROWS BETWEEN 89 PRECEDING AND CURRENT ROW) ninty_day_window_apprx
FROM
daily
GROUP BY
1
ORDER BY
1 DESC
However, this resulted in getting the distinct number of users per day, then summing these up - but distincts could be duplicated within the window, if they appeared multiple times. So this is not a true accurate measure of distinct users over 90 days.
The next thing I tried is to use the following solution
https://stackoverflow.com/a/47659590
- concatenating all the distinct user_ids for each window to an array and then counting the distincts within this.
WITH daily AS (
SELECT date(activity_date) day, STRING_AGG(DISTINCT user_id) users
FROM `fake-table`
GROUP BY day
), temp2 AS (
SELECT
day,
STRING_AGG(users) OVER(ORDER BY UNIX_DATE(day) RANGE BETWEEN 89 PRECEDING AND CURRENT ROW) users
FROM daily
)
SELECT day,
(SELECT APPROX_COUNT_DISTINCT(id) FROM UNNEST(SPLIT(users)) AS id) Unique90Days
FROM temp2
order by 1 desc
However this quickly ran out of memory with anything large.
Next was to use a HLL sketch to represent the distinct IDs in a much smaller value, so memory would be less of an issue. I thought my problems were solved, but I'm getting an error when running the following: The error is simply "Function MERGE_PARTIAL is not supported." I tried with MERGE as well and got the same error. It only happens when using the window function. Creating the sketches for each day's value works fine.
I read through the BigQuery Standard SQL documentation and don't see anything about HLL_COUNT.MERGE_PARTIAL and HLL_COUNT.MERGE with window functions. Presumably this should take the 90 sketches and combine them into one HLL sketch, representing the distinct values between the 90 original sketches?
WITH
daily AS (
SELECT
DATE(activity_date) day,
HLL_COUNT.INIT(user_id) sketch
FROM
`fake-table`
GROUP BY
1
ORDER BY
1 DESC),
rolling AS (
SELECT
day,
HLL_COUNT.MERGE_PARTIAL(sketch) OVER (ORDER BY UNIX_DATE(day) RANGE BETWEEN 89 PRECEDING AND CURRENT ROW) rolling_sketch
FROM daily)
SELECT
day,
HLL_COUNT.EXTRACT(rolling_sketch)
FROM
rolling
ORDER BY
1
"Image of the error - Function MERGE_PARTIAL is not supported"
Any ideas why this error happens or how to adjust?
Below is for BigQuery Standard SQL and does exactly what you want with use of window function
#standardSQL
SELECT day,
(SELECT HLL_COUNT.MERGE(sketch) FROM UNNEST(rolling_sketch_arr) sketch) rolling_sketch
FROM (
SELECT day,
ARRAY_AGG(ids_sketch) OVER(ORDER BY UNIX_DATE(day) RANGE BETWEEN 89 PRECEDING AND CURRENT ROW) rolling_sketch_arr
FROM (
SELECT day, HLL_COUNT.INIT(id) ids_sketch
FROM `project.dataset.table`
GROUP BY day
)
)
You can test, play with above using [totally] dummy data as in below example
#standardSQL
WITH `project.dataset.table` AS (
SELECT 1 id, DATE '2019-01-01' day UNION ALL
SELECT 2, '2019-01-01' UNION ALL
SELECT 3, '2019-01-01' UNION ALL
SELECT 1, '2019-01-02' UNION ALL
SELECT 4, '2019-01-02' UNION ALL
SELECT 2, '2019-01-03' UNION ALL
SELECT 3, '2019-01-03' UNION ALL
SELECT 4, '2019-01-03' UNION ALL
SELECT 5, '2019-01-03' UNION ALL
SELECT 1, '2019-01-04' UNION ALL
SELECT 4, '2019-01-04' UNION ALL
SELECT 2, '2019-01-05' UNION ALL
SELECT 3, '2019-01-05' UNION ALL
SELECT 5, '2019-01-05' UNION ALL
SELECT 6, '2019-01-05'
)
SELECT day,
(SELECT HLL_COUNT.MERGE(sketch) FROM UNNEST(rolling_sketch_arr) sketch) rolling_sketch
FROM (
SELECT day,
ARRAY_AGG(ids_sketch) OVER(ORDER BY UNIX_DATE(day) RANGE BETWEEN 2 PRECEDING AND CURRENT ROW) rolling_sketch_arr
FROM (
SELECT day, HLL_COUNT.INIT(id) ids_sketch
FROM `project.dataset.table`
GROUP BY day
)
)
-- ORDER BY day
with result
Row day rolling_sketch
1 2019-01-01 3
2 2019-01-02 4
3 2019-01-03 5
4 2019-01-04 5
5 2019-01-05 6
Combine HLL_COUNT.INIT and HLL_COUNT.MERGE. This solution uses a 90 days cross join with GENERATE_ARRAY(1, 90) instead of OVER.
#standardSQL
SELECT DATE_SUB(date, INTERVAL i DAY) date_grp
, HLL_COUNT.MERGE(sketch) unique_90_day_users
, HLL_COUNT.MERGE(DISTINCT IF(i<31,sketch,null)) unique_30_day_users
, HLL_COUNT.MERGE(DISTINCT IF(i<8,sketch,null)) unique_7_day_users
FROM (
SELECT DATE(creation_date) date, HLL_COUNT.INIT(owner_user_id) sketch
FROM `bigquery-public-data.stackoverflow.posts_questions`
WHERE EXTRACT(YEAR FROM creation_date)=2017
GROUP BY 1
), UNNEST(GENERATE_ARRAY(1, 90)) i
GROUP BY 1
ORDER BY date_grp

POSTGRES - Average for previous 4 weekdays

Hi I am trying to calculate the average of previous 4 Tuesdays. I have daily sales data and I am trying to calculate what the average for previous 4 weeks were for the same weekday.
Attached is a snapshot of how my dataset looks like
Now for March 6, I would like to know what is the average for the previous 4 weeks were, (namely Feb 6, Feb 13, Feb 20 and Feb 27). This value needs to be assigned to Monthly Average column
I am using a PostGres DB.
Thanks
You can use window functions:
select t.*,
avg(dailycount) over (partition by seller_name, day
order by date
rows between 3 preceding and current row
) as avg_4_weeks
from t
where day = 'Tuesday';
This assumes that "previous 4 weeks" is the current date plus the previous three weeks. If it starts the week before, only the windowing clause needs to change:
select t.*,
avg(dailycount) over (partition by seller_name, day
order by date
rows between 4 preceding and 1 preceding
) as avg_4_weeks
from t
where day = 'Tuesday';
I decided to post my answer also, for anyone else searching. My answer will allow you to put in any date and get the average for the previous 4 weeks ( current day + previous 3 weeks matching the day).
SQL Fiddle
PostgreSQL 9.3 Schema Setup:
CREATE TABLE sales (sellerName varchar(10), dailyCount int, saleDay date) ;
INSERT INTO sales (sellerName, dailyCount, saleDay)
SELECT 'ABC',10,to_date('2018-03-15','YYYY-MM-DD') UNION ALL /* THIS ONE */
SELECT 'ABC',11,to_date('2018-03-14','YYYY-MM-DD') UNION ALL
SELECT 'ABC',12,to_date('2018-03-12','YYYY-MM-DD') UNION ALL
SELECT 'ABC',13,to_date('2018-03-11','YYYY-MM-DD') UNION ALL
SELECT 'ABC',14,to_date('2018-03-10','YYYY-MM-DD') UNION ALL
SELECT 'ABC',15,to_date('2018-03-09','YYYY-MM-DD') UNION ALL
SELECT 'ABC',16,to_date('2018-03-08','YYYY-MM-DD') UNION ALL /* THIS ONE */
SELECT 'ABC',17,to_date('2018-03-07','YYYY-MM-DD') UNION ALL
SELECT 'ABC',18,to_date('2018-03-06','YYYY-MM-DD') UNION ALL
SELECT 'ABC',19,to_date('2018-03-05','YYYY-MM-DD') UNION ALL
SELECT 'ABC',20,to_date('2018-03-04','YYYY-MM-DD') UNION ALL
SELECT 'ABC',21,to_date('2018-03-03','YYYY-MM-DD') UNION ALL
SELECT 'ABC',22,to_date('2018-03-02','YYYY-MM-DD') UNION ALL
SELECT 'ABC',23,to_date('2018-03-01','YYYY-MM-DD') UNION ALL /* THIS ONE */
SELECT 'ABC',24,to_date('2018-02-28','YYYY-MM-DD') UNION ALL
SELECT 'ABC',25,to_date('2018-02-22','YYYY-MM-DD') UNION ALL /* THIS ONE */
SELECT 'ABC',26,to_date('2018-02-15','YYYY-MM-DD') UNION ALL
SELECT 'ABC',27,to_date('2018-02-08','YYYY-MM-DD') UNION ALL
SELECT 'ABC',28,to_date('2018-02-01','YYYY-MM-DD')
;
Now For The Query:
WITH theDay AS (
SELECT to_date('2018-03-15','YYYY-MM-DD') AS inDate
)
SELECT AVG(dailyCount) AS totalCount /* 18.5 = (10(3/15)+16(3/8)+23(3/1)+25(2/22))/4 */
FROM sales
CROSS JOIN theDay
WHERE extract(dow from saleDay) = extract(dow from theDay.inDate)
AND saleDay <= theDay.inDate
AND saleDay >= theDay.inDate-INTERVAL '3 weeks' /* Since we want to include the entered
day, for the INTERVAL we need 1 less week than we want */
Results:
| totalcount |
|------------|
| 18.5 |