How to differentiate iteration using date filed in bigquery - sql

I have a process that occur every 30 days but can take few days.
How can I differentiate between each iteration in order to sum the output of the process?
for Example
the output I except is
Name
Date
amount
iteration (optional)
Sophia Liu
2016-01-01
4
1
Sophia Liu
2016-02-01
5
2
Nikki Leith
2016-01-02
5
1
Nikki Leith
2016-02-01
10
2
I tried using lag function on the date filed and using the difference between that column and the date column.
WITH base AS
(SELECT 'Sophia Liu' as name, DATE '2016-01-01' as date, 3 as amount
UNION ALL SELECT 'Sophia Liu', DATE '2016-01-02', 1
UNION ALL SELECT 'Sophia Liu', DATE '2016-02-01', 3
UNION ALL SELECT 'Sophia Liu', DATE '2016-02-02', 2
UNION ALL SELECT 'Nikki Leith', DATE '2016-01-02', 5
UNION ALL SELECT 'Nikki Leith', DATE '2016-02-01', 5
UNION ALL SELECT 'Nikki Leith', DATE '2016-02-02', 3
UNION ALL SELECT 'Nikki Leith', DATE '2016-02-03', 1
UNION ALL SELECT 'Nikki Leith', DATE '2016-02-04', 1)
select
name
,date
,lag(date) over (partition by name order by date) as lag_func
,date_diff(date,lag(date) over (partition by name order by date),day) date_differacne
,case when date_diff(date,lag(date) over (partition by name order by date),day) >= 10
or date_diff(date,lag(date) over (partition by name order by date),day) is null then true else false end as new_iteration
,amount
from base

Edited answer
After your clarification and looking at what's actually in your SQL code. I'm guessing you are looking for a solution to what's called a gaps and islands problem. That is, you want to identify the "islands" of activity and sum the amount for each iteration or island. Taking your example you can first identify the start of a new session (or "gap") and then use that to create a unique iteration ("island") identifier for each user. You can then use that identifier to perform a SUM().
gaps as (
select
name,
date,
amount,
if(date_diff(date, lag(date,1) over(partition by name order by date), DAY) >= 10, 1, 0) new_iteration
from base
),
islands as (
select
*,
1 + sum(new_iteration) over(partition by name order by date) iteration_id
from gaps
)
select
*,
sum(amount) over(partition by name, iteration_id) iteration_amount
from islands
Previous answer
Sounds like you just need a RANK() to count the iterations in your window functions. Depending on your need you can then sum cumulative or total amounts in a similar window function. Something like this:
select
name
,date
,rank() over (partition by name order by date) as iteration
,sum(amount) over (partition by name order by date) as cumulative_amount
,sum(amount) over (partition by name) as total_amount
,amount
from base

Related

How to filter the last 7 days based on the previous query? -BigQuery

Hi I just want to ask how to resolve this problem.
Example in the query indicated below.
In the next query I will prepare, I want to filter the last 7 days of the delivery date. Do not use current_date because the maximum date is very late.
Assuming the current date is 7/12/2022 but the query shows a maximum date of 7/07/2022. How can I filter the date from 7/1/2022 to 7/07/2022?
, Datas1 as
(select distinct (delivery_due_date) as delivery_date
, Specialist
, Id_number
, Staff_Total as Total_Items
from joining
where Delivery_Due_Date is not null
)
Actually I used max function in where but I get an error. Please help me.
Created Examples of such data in first block.
Performed the select on that data in second block.
Extracted Maximum Delivery data in 3rd Block.
Restricted last block for 7 days of data collected from 3rd block.
WITH joining AS(
SELECT '2022-07-01' AS delivery_due_date, 'ABC' as Specialist,222 as Id_number, 21 as Staff_Total union all
SELECT '2022-07-07' AS delivery_due_date, 'ABC2' as Specialist,223 as Id_number, 01 as Staff_Total union all
SELECT '2022-07-15' AS delivery_due_date, 'ABC4' as Specialist,212 as Id_number, 25 as Staff_Total union all
SELECT '2022-07-20' AS delivery_due_date, 'AB5C' as Specialist,224 as Id_number, 15 as Staff_Total union all
SELECT '2022-07-05' AS delivery_due_date, 'ABC7' as Specialist,226 as Id_number, 87 as Staff_Total ),
Datas1 as (select distinct (delivery_due_date) as delivery_date , Specialist
, Id_number , Staff_Total as Total_Items from joining where Delivery_Due_Date is not null ),
Datas2 as (
select max(delivery_date) as ddd from Datas1)
select Datas1.* from Datas1,Datas2 where date(delivery_date) between date_sub(date(Datas2.ddd), interval 7 day) and date(Datas2.ddd)

date window function

Dont know how to solve the problem.May be you can show right direction or give a link.
I have a table:
id Date
23 01.01.2020
23 03.01.2020
23 04.01.2020
56 07.01.2020
56 08.01.2020
87 11.01.2020
23 12.01.2020
23 18.01.2020
I want to aggregate data (id, Date_min) and add new column like this one:
id Date_min Date_new
23 01.01.2020 07.01.2020
56 07.01.2020 11.01.2020
87 11.01.2020 12.01.2020
23 12.01.2020 18.01.2020
In column Data_new I want to see next user's first date. If there is no next user, add user`s max date
LEAD will give you the next date, but we also have the slight sticking problem that your ID repeats, so we need something to make the second 23 distinct from the first. For that I guess we can establish a counter that ticks up every time the ID changes:
with a as(
select '23' as id, '01.01.2020' as "date" union all
select '23' as id, '03.01.2020' as "date" union all
select '23' as id, '04.01.2020' as "date" union all
select '56' as id, '07.01.2020' as "date" union all
select '56' as id, '08.01.2020' as "date" union all
select '87' as id, '11.01.2020' as "date" union all
select '23' as id, '12.01.2020' as "date" union all
select '23' as id, '18.01.2020' as "date"
), b as (
SELECT *, LAG(id) OVER(ORDER BY "date") as last_id FROM a
), c AS(
SELECT *,
LEAD("date") OVER(ORDER BY "date") as next_date,
SUM(CASE WHEN last_id <> id THEN 1 ELSE 0 END) OVER(ORDER BY "date" ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) id_ctr
FROM b
)
SELECT id, MIN("date"), MAX(next_date)
FROM c
GROUP BY id, id_ctr
I haven't got a PG instance to test this on, but it works in SQLS and I'm pretty sure that PG supports everything that SQLS does - there isn't any SQLS specific stuff here
a takes the place of your table - you can drop it from your query and just straight d a with b as (select... from yourtablenamehere)
b calculates the previous ID; we'll use this to detect if the id has changed between current row and prev row. If it changes we'll put a 1 otherwise a 0. When these are summed as a running total it effectively means the counter ticks up every time the ID changes, so we can group by this counter as well as the ID to split our two 23s apart. We need to do this separately because window functions can't be nested
c takes the last_id and does the running total. It also does the next_date with a simple window function that pulls the date from the following row (rows ordered by date). the ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW is techincally unnecessary as it's the default action for a SUM OVER ORDERBY, but I find being explicit helps document/change if needed
then all that is required is to select the id, min date and max next_date, but throw the counter in there too to split the 23s up - you're allowed to group by more columns than you select but not the other way round
This is a particularly simply type of gaps-and-islands problem.
You can simply use lag() to determine the first row of each bunch of rows and then a lead() to get date_new:
select id, date as date_min,
lead(date, 1, max_date) over (order by date) as date_max
from (select t.*,
lag(id) over (order by date) as prev_id,
max(date) over () as max_date
from t
) t
where prev_id is null or prev_id <> id;
Here is a db<>fiddle.
Three window functions and no aggregation: this should be by far the fastest approach to this problem.

BigQuery: How to merge HLL Sketches over a window function? (Count distinct values over a rolling window)

Example relevant table schema:
+---------------------------+-------------------+
| activity_date - TIMESTAMP | user_id - STRING |
+---------------------------+-------------------+
| 2017-02-22 17:36:08 UTC | fake_id_i24385787 |
+---------------------------+-------------------+
| 2017-02-22 04:27:08 UTC | fake_id_234885747 |
+---------------------------+-------------------+
| 2017-02-22 08:36:08 UTC | fake_id_i24385787 |
+---------------------------+-------------------+
I need to count active distinct users over a large data set over a rolling time period (90 days), and am running into issues due to the size of the dataset.
At first, I attempted to use a window function, similar to the answer here.
https://stackoverflow.com/a/27574474
WITH
daily AS (
SELECT
DATE(activity_date) day,
user_id
FROM
`fake-table`)
SELECT
day,
SUM(APPROX_COUNT_DISTINCT(user_id)) OVER (ORDER BY day ROWS BETWEEN 89 PRECEDING AND CURRENT ROW) ninty_day_window_apprx
FROM
daily
GROUP BY
1
ORDER BY
1 DESC
However, this resulted in getting the distinct number of users per day, then summing these up - but distincts could be duplicated within the window, if they appeared multiple times. So this is not a true accurate measure of distinct users over 90 days.
The next thing I tried is to use the following solution
https://stackoverflow.com/a/47659590
- concatenating all the distinct user_ids for each window to an array and then counting the distincts within this.
WITH daily AS (
SELECT date(activity_date) day, STRING_AGG(DISTINCT user_id) users
FROM `fake-table`
GROUP BY day
), temp2 AS (
SELECT
day,
STRING_AGG(users) OVER(ORDER BY UNIX_DATE(day) RANGE BETWEEN 89 PRECEDING AND CURRENT ROW) users
FROM daily
)
SELECT day,
(SELECT APPROX_COUNT_DISTINCT(id) FROM UNNEST(SPLIT(users)) AS id) Unique90Days
FROM temp2
order by 1 desc
However this quickly ran out of memory with anything large.
Next was to use a HLL sketch to represent the distinct IDs in a much smaller value, so memory would be less of an issue. I thought my problems were solved, but I'm getting an error when running the following: The error is simply "Function MERGE_PARTIAL is not supported." I tried with MERGE as well and got the same error. It only happens when using the window function. Creating the sketches for each day's value works fine.
I read through the BigQuery Standard SQL documentation and don't see anything about HLL_COUNT.MERGE_PARTIAL and HLL_COUNT.MERGE with window functions. Presumably this should take the 90 sketches and combine them into one HLL sketch, representing the distinct values between the 90 original sketches?
WITH
daily AS (
SELECT
DATE(activity_date) day,
HLL_COUNT.INIT(user_id) sketch
FROM
`fake-table`
GROUP BY
1
ORDER BY
1 DESC),
rolling AS (
SELECT
day,
HLL_COUNT.MERGE_PARTIAL(sketch) OVER (ORDER BY UNIX_DATE(day) RANGE BETWEEN 89 PRECEDING AND CURRENT ROW) rolling_sketch
FROM daily)
SELECT
day,
HLL_COUNT.EXTRACT(rolling_sketch)
FROM
rolling
ORDER BY
1
"Image of the error - Function MERGE_PARTIAL is not supported"
Any ideas why this error happens or how to adjust?
Below is for BigQuery Standard SQL and does exactly what you want with use of window function
#standardSQL
SELECT day,
(SELECT HLL_COUNT.MERGE(sketch) FROM UNNEST(rolling_sketch_arr) sketch) rolling_sketch
FROM (
SELECT day,
ARRAY_AGG(ids_sketch) OVER(ORDER BY UNIX_DATE(day) RANGE BETWEEN 89 PRECEDING AND CURRENT ROW) rolling_sketch_arr
FROM (
SELECT day, HLL_COUNT.INIT(id) ids_sketch
FROM `project.dataset.table`
GROUP BY day
)
)
You can test, play with above using [totally] dummy data as in below example
#standardSQL
WITH `project.dataset.table` AS (
SELECT 1 id, DATE '2019-01-01' day UNION ALL
SELECT 2, '2019-01-01' UNION ALL
SELECT 3, '2019-01-01' UNION ALL
SELECT 1, '2019-01-02' UNION ALL
SELECT 4, '2019-01-02' UNION ALL
SELECT 2, '2019-01-03' UNION ALL
SELECT 3, '2019-01-03' UNION ALL
SELECT 4, '2019-01-03' UNION ALL
SELECT 5, '2019-01-03' UNION ALL
SELECT 1, '2019-01-04' UNION ALL
SELECT 4, '2019-01-04' UNION ALL
SELECT 2, '2019-01-05' UNION ALL
SELECT 3, '2019-01-05' UNION ALL
SELECT 5, '2019-01-05' UNION ALL
SELECT 6, '2019-01-05'
)
SELECT day,
(SELECT HLL_COUNT.MERGE(sketch) FROM UNNEST(rolling_sketch_arr) sketch) rolling_sketch
FROM (
SELECT day,
ARRAY_AGG(ids_sketch) OVER(ORDER BY UNIX_DATE(day) RANGE BETWEEN 2 PRECEDING AND CURRENT ROW) rolling_sketch_arr
FROM (
SELECT day, HLL_COUNT.INIT(id) ids_sketch
FROM `project.dataset.table`
GROUP BY day
)
)
-- ORDER BY day
with result
Row day rolling_sketch
1 2019-01-01 3
2 2019-01-02 4
3 2019-01-03 5
4 2019-01-04 5
5 2019-01-05 6
Combine HLL_COUNT.INIT and HLL_COUNT.MERGE. This solution uses a 90 days cross join with GENERATE_ARRAY(1, 90) instead of OVER.
#standardSQL
SELECT DATE_SUB(date, INTERVAL i DAY) date_grp
, HLL_COUNT.MERGE(sketch) unique_90_day_users
, HLL_COUNT.MERGE(DISTINCT IF(i<31,sketch,null)) unique_30_day_users
, HLL_COUNT.MERGE(DISTINCT IF(i<8,sketch,null)) unique_7_day_users
FROM (
SELECT DATE(creation_date) date, HLL_COUNT.INIT(owner_user_id) sketch
FROM `bigquery-public-data.stackoverflow.posts_questions`
WHERE EXTRACT(YEAR FROM creation_date)=2017
GROUP BY 1
), UNNEST(GENERATE_ARRAY(1, 90)) i
GROUP BY 1
ORDER BY date_grp

Finding lowest two minimum values and finding difference between the two in SQL Server?

I have a transaction table where I have to find the first and second date of transaction of every customer. Finding first date is very simple where I can use MIN() func to find the first date but the second and in particular finding the difference between the two is getting very challenging and somehow I am not able to find out any feasible way:
select a.customer_id, a.transaction_date, a.Row_Count2
from ( select
transaction_date as transaction_date,
reference_no as customer_id,
row_number() over (partition by reference_no
ORDER BY reference_no, transaction_date) AS Row_Count2
from transaction_detail
) a
where a.Row_Count2 < 3
ORDER BY a.customer_id, a.transaction_date, a.Row_Count2
Gives me this :
What I want is , following columns:
||CustomerID|| ||FirstDateofPurchase|| ||SecondDateofPuchase|| ||Diff. b/w Second & First Date ||
You can use window functions LEAD/LAG to return results you are looking for
First try to find all the leading dates by reference number using LEAD, generate row number for each row using your original logic. You can then do difference on dates for row number value 1 row from the result set.
Ex (I'm not excluding same day transactions and treating them as separate and generating row number based on result set from your query above, you can easily change the sql below to consider these as one and remove them so that you get next date as second date):
declare #tbl table(reference_no int, transaction_date datetime)
insert into #tbl
select 1000, '2018-07-11'
UNION ALL
select 1001, '2018-07-12'
UNION ALL
select 1001, '2018-07-12'
UNIOn ALL
select 1001, '2018-07-13'
UNIOn ALL
select 1002, '2018-07-11'
UNIOn ALL
select 1002, '2018-07-15'
select customer_id, transaction_date as firstdate,
transaction_date_next seconddate,
datediff(day, transaction_date, transaction_date_next) diff_in_days
from
(
select reference_no as customer_id, transaction_date,
lead(transaction_date) over (partition by reference_no
order by transaction_date) transaction_date_next,
row_number() over (partition by reference_no ORDER BY transaction_date) AS Row_Count
from #tbl
) src
where Row_Count = 1
You can do this with CROSS APPLY.
SELECT td.customer_id, MIN(ca.transaction_date), MAX(ca.transaction_date),
DATEDIFF(day, MIN(ca.transaction_date), MAX(ca.transaction_date))
FROM transaction_detail td
CROSS APPLY (SELECT TOP 2 *
FROM transaction_detail
WHERE customer_id = td.customer_id
ORDER BY transaction_date) ca
GROUP BY td.customer_id

alternative to lag SQL command

I have a table which has a table like this.
Month-----Book_Type-----sold_in_Dollars
Jan----------A------------ 100
Jan----------B------------ 120
Feb----------A------------ 50
Mar----------A------------ 60
Mar----------B------------ 30
and so on
I have to calculate the expected sales for each month and book type based on the last 2 months sales.
So for March and type A it would be (100+50)/2 = 75
For March and type B it is 120/1 since no data for Feb is there.
I was trying to use the lag function but it wouldn't work since there is data missing in a few rows.
Any ideas on this?
Since it plans to ignore missing values, this should probably work. Don't have a database to test it on at the moment but will give it another go in the morning
select
month,
book_type,
sold_in_dollars,
avg(sold_in_dollars) over (partition by book_type order by month
range between interval '2' month preceding and interval '1' month preceding) as avg_sales
from myTable;
This sort of assumes that month has a date datatype and can be sorted on... if it's just a text string then you'll need something else.
Normally you could just use rows between 2 preceding and 1 preceding but but this will take the two previous data points and not necessarily the two previous months if there are rows missing.
You could work it out with lag but it would be a bit more complicated.
As far as I know, you can give a default value to lag() :
SELECT Book_Type,
(lag(sold_in_Dollars, 1, 0) OVER(PARTITION BY Book_Type ORDER BY Month) + lag(sold_in_Dollars, 2, 0) OVER(PARTITION BY Book_Type ORDER BY Month))/2 AS expected_sales
FROM your_table
GROUP BY Book_Type
(Assuming Month column doesn't really contain JAN or FEB but real, orderable dates.)
What about something like (forgive the sql server syntax, but you get the idea):
Select Book_type, AVG(sold_in_dollars)
from MyTable
where Month in (Month(DATEADD('mm'-1,GETDATE)),Month(DATEADD('mm'-2,GETDATE)))
group by booktype
A partition outer join can help create the missing data. Create a set of months and join those values to each row by the month and perform the join once for each book type. I created the months January through April in this example:
with test_data as
(
select to_date('01-JAN-2010', 'DD-MON-YYYY') month, 'A' book_type, 100 sold_in_dollars from dual union all
select to_date('01-JAN-2010', 'DD-MON-YYYY') month, 'B' book_type, 120 sold_in_dollars from dual union all
select to_date('01-FEB-2010', 'DD-MON-YYYY') month, 'A' book_type, 50 sold_in_dollars from dual union all
select to_date('01-MAR-2010', 'DD-MON-YYYY') month, 'A' book_type, 60 sold_in_dollars from dual union all
select to_date('01-MAR-2010', 'DD-MON-YYYY') month, 'B' book_type, 30 sold_in_dollars from dual
)
select book_type, month, sold_in_dollars
,case when denominator = 0 then 'N/A' else to_char(numerator / denominator) end expected_sales
from
(
select test_data.book_type, all_months.month, sold_in_dollars
,count(sold_in_dollars) over
(partition by book_type order by all_months.month rows between 2 preceding and 1 preceding) denominator
,sum(sold_in_dollars) over
(partition by book_type order by all_months.month rows between 2 preceding and 1 preceding) numerator
from
(
select add_months(to_date('01-JAN-2010', 'DD-MON-YYYY'), level-1) month from dual connect by level <= 4
) all_months
left outer join test_data partition by (test_data.book_type) on all_months.month = test_data.month
)
order by book_type, month