Partition by weeknumber and next weeknumber - sql

I would like to count for each customer how many times a specific product was purchased in the past. I want to highlight the purchases for the same product (where the second order date is close to the first order date) with a rn = 2, so I can only count the rows with rn = 1
I created the following query and also included the current output. Its containing a partition by week number, to filter out purchases for the same product in the same week. It's working quite good, but the behaviour is not exactly what I was hoping for.
create table sandbox.hm_orders as
select o.customer_id
,o.product_id
,o.order_date
,ROW_NUMBER() over (partition by o.customer_id, o.product_id,concat(EXTRACT(year FROM order_date),EXTRACT(week FROM order_date)) order by o.order_date asc) as rn
,concat(EXTRACT(year FROM o.order_date),'_',EXTRACT(week FROM o.order_date)) as weeknr
from datamarts.orders o
where o.label_id = 1
and o.order_date > '2020-01-01'
and o.payment_status = 'PAID'
Current output:
customer_id
product_ID
order_date
rn
weeknr
4708818
128703
2020-05-11 20:19:25
1
2020_20
4708818
128703
2020-05-12 22:13:09
2
2020_20
4708818
128703
2020-06-06 21:45:04
1
2020_23
4708818
274578
2020-07-02 22:02:10
1
2020_27
4753958
137482
2021-03-14 18:13:04
1
2021_10
4753958
137482
2021-03-15 17:29:03
1
2021_11
As you can see in first two rows, the difference between the first the rows is 1 day and it will mark the second row with a rowNumber 2. For the last 2 rows, the difference between the orders is also 1 day. But since the weeknumbers are different, it will not give the second row a rowNumber 2.
Therefore I would like to find a way to also include the next weeknumber for the partition by. In this case, the order that have been done in 2021-11 needs a row number 2, and the week number 10 needs row number 1
desired output
customer_id
product_ID
order_date
rn
weeknr
4708818
128703
2020-05-11 20:19:25
1
2020_20
4708818
128703
2020-05-12 22:13:09
2
2020_20
4708818
128703
2020-06-06 21:45:04
1
2020_23
4708818
274578
2020-07-02 22:02:10
1
2020_27
4753958
137482
2021-03-14 18:13:04
1
2021_10
4753958
137482
2021-03-15 17:29:03
2
2021_11

A bit complicated way is to calculate a ranking by summing over a calculated flag.
Then use the rank in the row_number.
SELECT *
, ROW_NUMBER() over (partition by customer_id, product_id, rnk order by order_date asc) as rn
, TO_CHAR(order_date, 'YYYY_WW') as weeknr
FROM
(
SELECT *
, SUM(flag) over (partition by customer_id, product_id order by order_date asc) as rnk
FROM
(
SELECT
o.customer_id
, o.product_id
, o.order_date
, CASE WHEN 1 >= DATE_PART('day', o.order_date - LAG(o.order_date) over (partition by o.customer_id, o.product_id order by o.order_date asc))
THEN 0 ELSE 1 END AS flag
FROM orders o
WHERE o.label_id = 1
AND o.order_date > '2020-01-01'
AND o.payment_status = 'PAID'
) q1
) q2
customer_id
product_id
order_date
flag
rnk
rn
weeknr
4708818
128703
2020-05-11 20:19:25
1
1
1
2020_19
4708818
128703
2020-05-12 22:13:09
0
1
2
2020_19
4708818
128703
2020-06-06 21:45:04
1
2
1
2020_23
4708818
274578
2020-07-02 22:02:10
1
1
1
2020_27
4753958
137482
2021-03-14 18:13:04
1
1
1
2021_11
4753958
137482
2021-03-15 17:29:03
0
1
2
2021_11
Test on db<>fiddle here

Related

Calculating average time between customer orders and average order value in Postgres

In PostgreSQL I have an orders table that represents orders made by customers of a store:
SELECT * FROM orders
order_id
customer_id
value
created_at
1
1
188.01
2020-11-24
2
2
25.74
2022-10-13
3
1
159.64
2022-09-23
4
1
201.41
2022-04-01
5
3
357.80
2022-09-05
6
2
386.72
2022-02-16
7
1
200.00
2022-01-16
8
1
19.99
2020-02-20
For a specified time range (e.g. 2022-01-01 to 2022-12-31), I need to find the following:
Average 1st order value
Average 2nd order value
Average 3rd order value
Average 4th order value
E.g. the 1st purchases for each customer are:
for customer_id 1, order_id 8 is their first purchase
customer 2, order 6
customer 3, order 5
So, the 1st-purchase average order value is (19.99 + 386.72 + 357.80) / 3 = $254.84
This needs to be found for the 2nd, 3rd and 4th purchases also.
I also need to find the average time between purchases:
order 1 to order 2
order 2 to order 3
order 3 to order 4
The final result would ideally look something like this:
order_number
AOV
av_days_since_last_order
1
254.84
0
2
300.00
28
3
322.22
21
4
350.00
20
Note that average days since last order for order 1 would always be 0 as it's the 1st purchase.
Thanks.
select order_number
,round(avg(value),2) as AOV
,coalesce(round(avg(days_between_orders),0),0) as av_days_since_last_order
from
(
select *
,row_number() over(partition by customer_id order by created_at) as order_number
,created_at - lag(created_at) over(partition by customer_id order by created_at) as days_between_orders
from t
) t
where created_at between '2022-01-01' and '2022-12-31'
group by order_number
order by order_number
order_number
aov
av_days_since_last_order
1
372.26
0
2
25.74
239
3
200.00
418
4
201.41
75
5
159.64
175
Fiddle
Im suppose it should be something like this
WITH prep_data AS (
SELECT order_id,
cuntomer_id,
ROW_NUMBER() OVER(PARTITION BY order_id, cuntomer_id ORDER BY created_at) AS pushcase_num,
created_at,
value
FROM pushcases
WHERE created_at BETWEEN :date_from AND :date_to
), prep_data2 AS (
SELECT pd1.order_id,
pd1.cuntomer_id,
pd1.pushcase_num
pd2.created_at - pd1.created_at AS date_diff,
pd1.value
FROM prep_data pd1
LEFT JOIN prep_data pd2 ON (pd1.order_id = pd2.order_id AND pd1.cuntomer_id = pd2.cuntomer_id AND pd1.pushcase_num = pd2.pushcase_num+1)
)
SELECT order_id,
cuntomer_id,
pushcase_num,
avg(value) AS avg_val,
avg(date_diff) AS avg_date_diff
FROM prep_data2
GROUP BY pushcase_num

SQL query to find continuous local max, min of date based on category column

I have the following data set
Customer_ID Category FROM_DATE TO_DATE
1 5 1/1/2000 12/31/2001
1 6 1/1/2002 12/31/2003
1 5 1/1/2004 12/31/2005
2 7 1/1/2010 12/31/2011
2 7 1/1/2012 12/31/2013
2 5 1/1/2014 12/31/2015
3 7 1/1/2010 12/31/2011
3 7 1/5/2012 12/31/2013
3 5 1/1/2014 12/31/2015
The result I want to achieve is to find continuous local min/max date for Customers with the same category and identify any gap in dates:
Customer_ID FROM_Date TO_Date Category
1 1/1/2000 12/31/2001 5
1 1/1/2002 12/31/2003 6
1 1/1/2004 12/31/2005 5
2 1/1/2010 12/31/2013 7
2 1/1/2014 12/31/2015 5
3 1/1/2010 12/31/2011 7
3 1/5/2012 12/31/2013 7
3 1/1/2014 12/31/2015 5
My code works fine for customer 1 (return all 3 rows) and customer 2(return 2 rows with min and max date for each category) but for customer 3, it cannot identify the gap between 12/31/2011 and 1/5/2012 for category 7.
Customer_ID FROM_Date TO_Date Category
3 1/1/2010 12/31/2013 7
3 1/1/2014 12/31/2015 5
Here is my code:
SELECT Customer_ID, Category, min(From_Date), max(To_Date) FROM
(
SELECT Customer_ID, Category, From_Date,To_Date
,row_number() over (order by member_id, To_Date) - row_number() over (partition by Customer_ID order by Category) as p
FROM FFS_SAMP
) X
group by Customer_ID,Category,p
order by Customer_ID,min(From_Date),Max(To_Date)
This is a type of gaps and islands problem. Probably the safest method is to use a cumulative max() to look for overlaps with previous records. Where there is no overlap, then an "island" of records starts. So:
select customer_id, min(from_date), max(to_date), category
from (select t.*,
sum(case when prev_to_date >= from_date then 0 else 1 end) over
(partition by customer_id, category
order by from_date
) as grp
from (select t.*,
max(to_date) over (partition by customer_id, category
order by from_date
rows between unbounded preceding and 1 preceding
) as prev_to_date
from t
) t
) t
group by customer_id, category, grp;
Your attempt is quite close. You just need to fix the over() clause of the window functions:
select customer_id, category, min(from_date), max(to_date)
from (
select
fs.*,
row_number() over (partition by customer_id order from_date)
- row_number() over (partition by customer_id, category order by from_date) as grp
from ffs_samp fs
) x
group by customer_id, category, grp
order by customer_id, min(from_date)
Note that this method assumes no gaps or overlalp in the periods of a given customer, as show in your sample data.

Unable to resolve Rank Over Partition with multiple variables

I am trying to analyse a bunch of transaction data and have set up a series of different ranks to help me. The one I can't get right is the beneficiary rank. I want it to partition where there is a change in beneficiary chronologically rather than alphabetically.
Where the same beneficiary is paid from January to March and then again in June I would like the June to be classed a separate 'session'.
I am using Teradata SQL if that makes a difference.
I thought the solution was going to be a DENSE_RANK but if I PARTITION BY (CustomerID, Beneficiary) ORDER BY SystemDate it counts up the number of months. If I PARTITION BY (CustomerID) ORDER BY Beneficiary then it is not chronological, I need the highest rank to be the latest Beneficiary.
SELECT CustomerID, Beneficiary, Amount, SystemDate, Month
,RANK() OVER(PARTITION BY CustomerID ORDER BY SystemDate ASC) AS PaymentRank
,RANK() OVER(PARTITION BY CustomerID ORDER BY PaymentMonth ASC) AS MonthRank
,RANK() OVER(PARTITION BY CustomerID , Beneficiary ORDER BY SystemDate ASC) AS Beneficiary
,RANK() OVER(PARTITION BY CustomerID , Beneficiary, ROUND(TRNSCN_AMOUNT, 0) ORDER BY SYSTEM_DATE ASC) AS TransRank
FROM table ORDER BY CustomerID, PaymentRank
CustomerID Beneficiary Amount DateStamp Month PaymentRank MonthRank BeneficiaryRank TransactionRank
a aa 10 Jan 1 1 1 1
a aa 20 Feb 2 2 2 1
a aa 20 Mar 3 3 3 2
a aa 20 Apr 4 4 4 3
a bb 20 May 5 5 1 1
a bb 30 Jun 6 6 2 1
a aa 30 Jul 7 7 5 2
a aa 30 Aug 8 8 6 1
a cc 5 Sep 9 9 1 1
a cc 5 Oct 10 10 2 2
a cc 5 Nov 11 11 3 3
b cc 5 Dec 1 1 1 1
This is what I have so far, I want a column alongside this which will look like the below
CustomerID Beneficiary Amount DateStamp Month NewRank
a aa 10 Jan 1
a aa 20 Feb 1
a aa 20 Mar 1
a aa 20 Apr 1
a bb 20 May 2
a bb 30 Jun 2
a aa 30 Jul 3
a aa 30 Aug 3
a cc 5 Sep 4
a cc 5 Oct 4
a cc 5 Nov 4
b cc 5 Dec 1
This is a type of gaps-and-islands problem. I would recommend lag() and a cumulative sum:
select t.*,
sum(case when prev_systemdate > systemdate - interval '1' month then 0 else 1 end) over (partition by customerid, beneficiary order by systemdate)
from (select t.*,
lag(systemdate) over (partition by customerid, beneficiary order by systemdate) as prev_systemdate
from t
) t
Credits to #Gordon and #dnoeth for providing the ideas and code to get me on the right track.
The below is mostly ripped from dnoeth but needed to add ROWS unbounded preceding to get the aggregation correct. Without this it was just showing the total for the partition. I also changed the systemdate to paymentrank as I had to fiddle about a bit with duplicate entries on a day.
SELECT dt.*,
-- now do a Cumulative Sum over those 0/1
SUM(flag) OVER(PARTITION BY CustomerID ORDER BY PaymentRank ASC ROWS UNBOUNDED PRECEDING) AS NewRank
FROM
(
SELECT CustomerID, Beneficiary, Amount, SystemDate, Month
-- assign a 0 if current & previous Beneficiary are the same, otherwise 1
,CASE WHEN Beneficiary = MIN(Beneficiary) OVER (PARTITION BY CustomerID ORDER BY PaymentRank ASC ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING) THEN 0 ELSE 1 END AS Flag ) AS dt
ORDER BY CustomerID, PaymentRank
The inner query sets a flag whenever the beneficiary changes. The outer query then does a cumulative sum on those.
I was unsure what the unbounded preceding was doing and #dnoeth has a great explanation here Below is taken from that explanation.
•UNBOUNDED PRECEDING, all rows before the current row -> fixed
•UNBOUNDED FOLLOWING, all rows after the current row -> fixed
•x PRECEDING, x rows before the current row -> relative
•y FOLLOWING, y rows after the current row -> relative
SELECT dt.*,
-- now do a Cumulative Sum over those 0/1
SUM(flag)
OVER(PARTITION BY CustomerID
ORDER BY SystemDate ASC
,flag DESC -- needed if the order by columns are not unique
ROWS UNBOUNDED PRECEDING) AS NewRank
FROM
(
SELECT CustomerID, Beneficiary, Amount, SystemDate, Month
,RANK() OVER(PARTITION BY CustomerID ORDER BY SystemDate ASC) AS PaymentRank
,RANK() OVER(PARTITION BY CustomerID ORDER BY PaymentMonth ASC) AS MonthRank
,RANK() OVER(PARTITION BY CustomerID , Beneficiary ORDER BY SystemDate ASC) AS Beneficiary
,RANK() OVER(PARTITION BY CustomerID , Beneficiary, ROUND(TRNSCN_AMOUNT, 0) ORDER BY SYSTEM_DATE ASC) AS TransRank
-- assign a 0 if current & previous Beneficiary are the same, otherwise 1
,CASE WHEN Beneficiary = LAG(Beneficiary) OVER(PARTITION BY CustomerID ORDER BY SystemDate) THEN 0 ELSE 1 END AS flag
FROM table
) AS dt
ORDER BY CustomerID, PaymentRank
Your problem with Gordon's query is probably caused by your Teradata release, LAG is only supported in 16.10+. But there's a simple workaround:
LAG(Beneficiary) OVER(PARTITION BY CustomerID ORDER BY SystemDate)
--is equivalent to
MIN(Beneficiary) OVER(PARTITION BY CustomerID ORDER BY SystemDate
ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING))

Additional condition withing partition over

https://www.db-fiddle.com/f/rgLXTu3VysD3kRwBAQK3a4/3
My problem here is that I want function partition over to start counting the rows only from certain time range.
In this example, if I would add rn = 1 at the end, order_id = 5 would be excluded from the results (because partition is ordering by paid_date and there's order_id = 6 with earlier date) but it shouldn't be as I want that time range for partition starts from '2019-01-10'.
Adding condition rn = 1expected output should be order_id 3,5,11,15, now its only 3,11,15
it should include only orders with is_paid = 0 that are the first one within given time range (if there's preceeding order with is_paid = 1 it shouldn't be counted)
use correlated subquery with not exists
DEMO
SELECT order_id, customer_id, amount, is_paid, paid_date, rn FROM (
SELECT o.*,
ROW_NUMBER() OVER(PARTITION BY customer_id ORDER BY paid_date,order_id) rn
FROM orders o
WHERE paid_date between '2019-01-10'
and '2019-01-15'
) x where rn=1 and not exists (select 1 from orders o1 where x.order_id=o1.order_id
and is_paid=1)
OUTPUT:
order_id customer_id amount is_paid paid_date rn
3 101 30 0 10/01/2019 00:00:00 1
5 102 15 0 10/01/2019 00:00:00 1
11 104 31 0 10/01/2019 00:00:00 1
15 105 11 0 10/01/2019 00:00:00 1
If priority should be given to order_id then put that before paid date in the partition function order by clause, this will solve your issue.
SELECT order_id, customer_id, amount, is_paid, paid_date, rn FROM (
SELECT o.*,
ROW_NUMBER() OVER(PARTITION BY customer_id ORDER BY order_id,paid_date) rn
FROM orders o
) x WHERE is_paid = 0 and paid_date between
'2019-01-10' and '2019-01-15' and rn=1
Since you need the paid date to be ordered first you need to imply a where condition in the partitioning table in order to avoid unnecessary dates interrupting the partition function.
SELECT order_id, customer_id, amount, is_paid, paid_date, rn FROM (
SELECT o.*,
ROW_NUMBER() OVER(PARTITION BY customer_id ORDER BY paid_date, order_id) rn
FROM orders o
where paid_date between '2019-01-10' and '2019-01-15'
) x WHERE is_paid = 0 and rn=1

SQL Server order by closest value to zero

I have some duplicate values in a table and I'm trying to use Row_Number to filter them out. I want to order the rows using datediff and order the results based on the closest value to zero but I'm struggling to account for negative values.
Below is a sample of the data and my current Row_Number field (rn) column:
PersonID SurveyDate DischargeDate DaysToSurvey rn
93638 10/02/2015 30/03/2015 -48 1
93638 27/03/2015 30/03/2015 -3 2
250575 23/10/2014 29/10/2014 -6 1
250575 19/11/2014 24/11/2014 -5 2
203312 23/01/2015 26/01/2015 -3 1
203312 26/01/2015 26/01/2015 0 2
387737 19/02/2015 26/02/2015 -7 1
387737 26/02/2015 26/02/2015 0 2
751915 02/04/2015 04/04/2015 -2 1
751915 10/04/2015 25/03/2015 16 2
712364 24/01/2015 30/01/2015 -6 1
712364 26/01/2015 30/01/2015 -4 2
My select statement for the above is:
select
PersonID, SurveyDate, DischargeDate,
datediff(dd,dischargedate,surveydate) days,
ROW_NUMBER () over (partition by PersonID
order by datediff(dd, dischargedate, surveydate) asc) as rn
from
Table 1
order by
PersonID, rn
What I want to do is change the sort order so it displays like this:
PersonID SurveyDate DischargeDate DaysToSurvey rn
93638 27/03/2015 30/03/2015 -3 1
93638 10/02/2015 30/03/2015 -48 2
250575 19/11/2014 24/11/2014 -5 1
250575 23/10/2014 29/10/2014 -6 2
So the DaysToSurvey value that is closest to the DischargeDate is ranked as rn 1.
Is this possible?
You're close. Just add ABS() to calculate absolute values of the differences:
ROW_NUMBER () OVER (
PARTITION BY PersonID
ORDER BY abs(datediff(dd, dischargedate, surveydate)) asc
) AS rn
You could use abs to get the distance from zero:
select PersonID, SurveyDate, DischargeDate, datediff(dd,dischargedate,surveydate) days,
ROW_NUMBER () over (partition by PersonID order by abs(datediff(dd,dischargedate,surveydate)) asc) as rn
from Table 1
order by PersonID, rn
Add ABS():
select PersonID, SurveyDate, DischargeDate, datediff(dd,dischargedate,surveydate) days,
ROW_NUMBER () over (partition by PersonID order by ABS(datediff(dd,dischargedate,surveydate)) asc) as rn
from Table 1
order by PersonID, rn