Getting duplication WITH partition by - sql

I want to know which customers who ordered in June 2020 also ordered in June 2021. My code returns the correct DISTINCT orders, but discounted sales is incorrect for customers who placed more than one order in either year. For example, a customer who placed one order in 2020 and four orders in 2021 has 2020 discounted sales at 4x the actual amount. The four orders in 2021 have four rows, and the one 2020 order populates against each. I saw this by using ROW_NUMBER () which exposed the underlying problem. I cannot use DISTINCT with discounted sales because customers do place multiple orders for identical dollar amounts. How do I get the exact discounted sales using standard SQL for BQ?
SELECT
DISTINCT ly.cuid AS cuid,
COUNT(DISTINCT ly.order_id) OVER (PARTITION BY ly.cuid) AS ly_orders,
SUM(ly.discounted_sales) OVER (PARTITION BY ly.cuid) AS ly_demand,
COUNT(DISTINCT ty.order_id) OVER (PARTITION BY ty.cuid) AS ty_orders,
SUM(ty.discounted_sales) OVER (PARTITION BY ly.cuid) AS ty_demand
FROM table ly
LEFT JOIN table ty
ON ly.cuid = ty.cuid
WHERE ly.order_date BETWEEN '2020-06-01' AND '2020-06-30'
AND ty.order_date BETWEEN '2021-06-01'AND '2021-06-30'
AND ly.financial_status <> 'credit'
AND ty.financial_status <> 'credit'
AND ly.discounted_sales >0
AND ty.discounted_sales >0
AND ly.channel = 'b2b'
AND ty.channel = 'b2b'
ORDER BY ly.cuid asc
[Results]
cuid ly_orders ly_demand ty_orders ty_demand comments
D 1 22,466.40 4 154,596.24 ly is 4x actual
F 2 2,573.20 1 1,944.40 ty is 2x actual
G 1 32,134.40 4 1,632.00 ly is 4x actual
I 2 757.56 1 730.56 ty is 2x actual
J 2 54,859.00 2 23,822.32 both are 2x actual
THIS WORKED:
WITH prior_period AS (
SELECT
DISTINCT cuid AS ly_cuid,
COUNT(DISTINCT order_id) OVER (PARTITION BY cuid) AS ly_orders,
SUM(discounted_sales) OVER (PARTITION BY cuid) AS ly_demand
FROM TABLE
WHERE EXTRACT (YEAR FROM order_date) = 2020 AND EXTRACT(MONTH FROM order_date) = 6
AND financial_status <> 'credit'
AND discounted_sales >0
AND channel = 'b2b'
GROUP BY cuid, order_id, discounted_sales
ORDER BY cuid asc),
this_period AS (
SELECT
DISTINCT cuid AS ty_cuid,
COUNT(DISTINCT order_id) OVER (PARTITION BY cuid) AS ty_orders,
SUM(discounted_sales) OVER (PARTITION BY cuid) AS ty_demand
FROM TABLE
WHERE EXTRACT (YEAR FROM order_date) = 2021 AND EXTRACT(MONTH FROM order_date) = 6
AND financial_status <> 'credit'
AND discounted_sales >0
AND channel = 'b2b'
GROUP BY cuid, order_id, discounted_sales
ORDER BY cuid asc)
SELECT *
FROM prior_period ly
JOIN this_period ty ON ly.ly_cuid = ty.ty_cuid
ORDER BY ly.ly_cuid

Updated with your schema and approximate data:
Try this...
WITH periods AS (
SELECT cuid AS cuid
, COUNT(*) AS orders
, SUM(discounted_sales) AS demand
, EXTRACT(YEAR FROM order_date) AS yr
FROM demand2
WHERE EXTRACT(YEAR FROM order_date) IN (2020, 2021) AND EXTRACT(MONTH FROM order_date) = 6
AND financial_status <> 'credit'
AND discounted_sales > 0
AND channel = 'b2b'
GROUP BY cuid, EXTRACT(YEAR FROM order_date)
)
SELECT ly.cuid
, ly.orders AS ly_orders
, ly.demand AS ly_demand
, ty.orders AS ty_orders
, ty.demand AS ty_demand
FROM periods AS ly
JOIN periods AS ty
ON ly.cuid = ty.cuid
AND ly.yr = 2020
AND ty.yr = 2021
ORDER BY ly.cuid
;
The result:
+------+-----------+-----------+-----------+-----------+
| cuid | ly_orders | ly_demand | ty_orders | ty_demand |
+------+-----------+-----------+-----------+-----------+
| D | 1 | 5616.60 | 4 | 154596.24 |
| F | 2 | 2573.20 | 1 | 972.20 |
| G | 1 | 8033.60 | 4 | 1632.56 |
| I | 2 | 757.56 | 1 | 365.28 |
| J | 2 | 27429.50 | 2 | 11911.16 |
+------+-----------+-----------+-----------+-----------+
Here's a similar example with data, SQL and results to show both the incorrect result and the correct result.
The data:
SELECT * FROM demand ORDER BY account_id, period;
+----+------------+--------+--------+
| id | account_id | period | demand |
+----+------------+--------+--------+
| 1 | 1 | 202005 | 100 |
| 2 | 1 | 202005 | 120 |
| 3 | 1 | 202105 | 105 |
| 4 | 1 | 202105 | 125 |
| 5 | 1 | 202105 | 30 |
| 6 | 2 | 202005 | 200 |
| 7 | 2 | 202105 | 240 |
+----+------------+--------+--------+
The incorrect SQL, without SUMs to just show the join behavior:
SELECT t1.id, t1.account_id, t1.period, t1.demand AS demand1
, t2.id, t2.period, t2.demand AS demand2
FROM demand AS t1
LEFT JOIN demand AS t2
ON t1.account_id = t2.account_id
AND t2.period = 202105
WHERE t1.period = 202005
ORDER BY t1.account_id, t1.period, demand1, demand2
;
+----+------------+--------+---------+------+--------+---------+
| id | account_id | period | demand1 | id | period | demand2 |
+----+------------+--------+---------+------+--------+---------+
| 1 | 1 | 202005 | 100 | 5 | 202105 | 30 |
| 1 | 1 | 202005 | 100 | 3 | 202105 | 105 |
| 1 | 1 | 202005 | 100 | 4 | 202105 | 125 |
| 2 | 1 | 202005 | 120 | 5 | 202105 | 30 |
| 2 | 1 | 202005 | 120 | 3 | 202105 | 105 |
| 2 | 1 | 202005 | 120 | 4 | 202105 | 125 |
| 6 | 2 | 202005 | 200 | 7 | 202105 | 240 |
+----+------------+--------+---------+------+--------+---------+
Notice account 2 doesn't have a problem because only one demand was found last year and this year.
But account 1 found 2 demand rows for last year and 3 demand rows for this year, leading to (2 x 3) = 6 rows in the joined result. This is the source of the problem.
To correct this, we aggregate before the JOIN, so that we have at most one (1) row per account / period to be joined.
One form of the correct solution, with SUMs derived in the CTE term:
WITH demands AS (
SELECT account_id, period
, SUM(demand) AS demand
, COUNT(*) AS orders
FROM demand
GROUP BY account_id, period
)
SELECT ly.account_id, ly.period
, ly.orders AS ly_orders
, ly.demand AS ly_demand
, ty.orders AS ty_orders
, ty.demand AS ty_demand
FROM demands AS ly
LEFT JOIN demands AS ty
ON ly.account_id = ty.account_id
AND ty.period = 202105
WHERE ly.period = 202005
ORDER BY ly.account_id, ly.period, ly_demand, ty_demand
;
The result:
+------------+--------+-----------+-----------+-----------+-----------+
| account_id | period | ly_orders | ly_demand | ty_orders | ty_demand |
+------------+--------+-----------+-----------+-----------+-----------+
| 1 | 202005 | 2 | 220 | 3 | 260 |
| 2 | 202005 | 1 | 200 | 1 | 240 |
+------------+--------+-----------+-----------+-----------+-----------+
Since we performed aggregation in the CTE term (demands), the join found at most one row for each period for each account.

Related

Count item after joining two table

I have two tables,
First is Product table
+-----------+------+-----+------------+
| id | pnum | year |month|date |
+-----------+------+-----+------------+
|12 | S5 | 2021 | 2 | 2021-02-21 |
|12 | S5 | 2021 | 2 | 2021-02-22 |
|12 | S5 | 2021 | 2 | 2021-02-23 |
|33 | A55| 2021 | 3 | 2021-03-01 |
|44 | B1 | 2021 | 6 | 2021-06-01 |
Second is Deal table
+-----------+------+-----+------------+
| id | pnum| year |month|date |
+-----------+------+-----+------------+
|12 | S5 | 2021 | 2 | 2021-02-28 |
|12 | S5 | 2021 | 2 | 2021-02-01 |
|33 | A55| 2021 | 3 | 2021-03-01 |
I need a result which can tell me how many product got launch
for a year_month and count of deal in first 15 days or in next 15 days
+----------- +------------+----------------+--------------------+
| num | count| year-month |deal_in_first15 |deal_after_first15 |
+----------- +------+-----+-------------------------------------
|S5 | 3 | 2021-02 | 1 | 1 |
|A55 | 1 | 2021-03 | 1 | 0 |
I was trying to do it like below
select * from Product p inner join Deal d on
p.num=d.num AND p.id=d.id AND p.month=d.month
but it is not helping me to get exact result as intended.
I have some java and python background and not expert in sql hence applying count and case statement is not working out.
You can try to use the condition aggregate function in subquery then do JOIN
select p.pnum ,
COUNT(*) count,
FORMAT(p.[date],'yyyy-MM') 'year-month',
deal_in_first15 ,
deal_after_first15
from Product p
inner join (
SELECT id ,
pnum ,
month,
year ,
COUNT(CASE WHEN DATEPART(day,[date]) < 15 THEN 1 END) deal_in_first15 ,
COUNT(CASE WHEN DATEPART(day,[date]) >= 15 THEN 1 END) deal_after_first15
FROM Deal
GROUP BY id ,pnum ,month,year
) d on
p.pnum=d.pnum AND p.id=d.id AND p.month=d.month
group by FORMAT(p.[date],'yyyy-MM') ,
p.pnum,
deal_in_first15 ,
deal_after_first15
I think there is another way might you want, using two subquery then JOIN
select p.pnum ,
p.cnt 'count',
CONCAT(p.year,'-',FORMAT(p.month,'0#')) 'year-month',
deal_in_first15 ,
deal_after_first15
from (
SELECT id ,
pnum ,
month,
year,
count(*) cnt
FROM Product
GROUP BY id ,
pnum ,
month,
year
) p
inner join (
SELECT id ,
pnum ,
month,
year ,
COUNT(CASE WHEN DATEPART(day,[date]) < 15 THEN 1 END) deal_in_first15 ,
COUNT(CASE WHEN DATEPART(day,[date]) >= 15 THEN 1 END) deal_after_first15
FROM Deal
GROUP BY id ,pnum ,month,year
) d on
p.pnum=d.pnum
AND p.id=d.id
AND p.month=d.month
AND p.year=d.year
sqlfiddle
Note
I would use the year column to be JOIN condition. otherwise, the result will be wrong when you cross-year meet the same month.
This is it:
select pnum,cnt,sum(mn1),sum(mn2) from (select d."pnum",(select count(*) from Product p where p."pnum"=d."pnum"
) cnt,
case when
EXTRACT(DAY FROM d."date")<=15 then sum(1) else sum(0) end mn1 ,
case when
EXTRACT(DAY FROM d."date")>15 then sum(1) else sum(0) end mn2,
d."year"||'-'||d."month"
from deal d
group by 1,d."year"||'-'||d."month",d."date")abc group by 1,2;
please check at:http://sqlfiddle.com/#!17/3bad9/18

SQL query grouping by range

Hi have a table A with the following data:
+------+-------+----+--------+
| YEAR | MONTH | PA | AMOUNT |
+------+-------+----+--------+
| 2020 | 1 | N | 100 |
+------+-------+----+--------+
| 2020 | 2 | N | 100 |
+------+-------+----+--------+
| 2020 | 3 | O | 100 |
+------+-------+----+--------+
| 2020 | 4 | N | 100 |
+------+-------+----+--------+
| 2020 | 5 | N | 100 |
+------+-------+----+--------+
| 2020 | 6 | O | 100 |
+------+-------+----+--------+
I'd like to have the following result:
+---------+---------+--------+
| FROM | TO | AMOUNT |
+---------+---------+--------+
| 2020-01 | 2020-02 | 200 |
+---------+---------+--------+
| 2020-03 | 2020-03 | 100 |
+---------+---------+--------+
| 2020-04 | 2020-05 | 200 |
+---------+---------+--------+
| 2020-06 | 2020-06 | 100 |
+---------+---------+--------+
My DB is DB2/400.
I have tried with ROW_NUMBER partitioning, subqueries but I can't figure out how to solve this.
I understand this as a gaps-and-island problem, where you want to group together adjacent rows that have the same PA.
Here is an approach using the difference between row numbers to build the groups:
select min(year_month) year_month_start, max(year_month) year_month_end, sum(amount) amount
from (
select a.*, year * 100 + month year_month
row_number() over(order by year, month) rn1,
row_number() over(partition by pa order by year, month) rn2
from a
) a
group by rn1 - rn2
order by year_month_start
You can try the below -
select min(year)||'-'||min(month) as from_date,max(year)||'-'||max(month) as to_date,sum(amount) as amount from
(
select *,row_number() over(order by month)-
row_number() over(partition by pa order by month) as grprn
from t1
)A group by grprn,pa order by grprn
This works in tsql, guess you can adaot it to db2-400?
SELECT MIN(Dte) [From]
, MAX(Dte) [To]
-- ,PA
, SUM(Amount)
FROM (
SELECT year * 100 + month Dte
, Pa
, Amount
, ROW_NUMBER() OVER (PARTITION BY pa ORDER BY year * 100 + month) +
10000- (YEar*100+Month) rn
FROM tabA a
) b
GROUP BY Pa
, rn
ORDER BY [From]
, [To]
The trick is the row number function partitioned by PA and ordered by date, This'll count one up for each month for the, when added to the descnding count of month and month, you will get the same number for consecutive months with same PA. You the group by PA and the grouping yoou made, rn, to get the froups, and then Bob's your uncle.

eSQL multiple join but with conditions

I've 3 tables as under
MERCHANDISE
+-----------+-----------+---------------+
| MERCH_NUM | MERCH_DIV | MERCH_SUB_DIV |
+-----------+-----------+---------------+
| 1 | car | awd |
| 1 | car | awd |
| 2 | bike | 1kcc |
| 3 | cycle | hybrid |
| 3 | cycle | city |
| 4 | moped | fixie |
+-----------+-----------+---------------+
PRIORITY
+----------+-----------+---------+---------+------------+------------+---------------+
| CUST_NUM | SALES_NUM | DOC_NUM | BALANCE | PRIORITY_1 | PRIORITY_2 | PRIORITY_CODE |
+----------+-----------+---------+---------+------------+------------+---------------+
| 90 | 1000 | 10 | 23 | 1 | 6 | NO |
| 91 | 1001 | 20 | 32 | 3 | 7 | PRI |
| 92 | 1002 | 30 | 11 | 2 | 8 | LATE |
| 93 | 1003 | 40 | 22 | 5 | 9 | 1MON |
+----------+-----------+---------+---------+------------+------------+---------------+
ORDER
+----------+-----------+---------+---------+-----------+-----------+
| CUST_NUM | SALES_NUM | DOC_NUM | COUNTRY | MERCH_NUM | MERCH_DIV |
+----------+-----------+---------+---------+-----------+-----------+
| 90 | 1000 | 10 | INDIA | 1 | car |
| 91 | 1001 | 20 | CHINA | 2 | bike |
| 92 | 1002 | 30 | USA | 3 | cycle |
| 93 | 1003 | 40 | UK | 4 | moped |
+----------+-----------+---------+---------+-----------+-----------+
I want to join the left joined table from the last two tables with the first one such that the MERCH_SUB_DIV 'awd' appears only once for each unique combination of merch_num and merch_div
the code I came up with is as under, but I'm not sure how do I eliminate the duplicate row just for the awd
select
ROW#, MERCH.MERCH_NUMBER, ORDPRI.MERCH_NUMBER, ORDPRI.CUST_NUM,
BALANCE, SALES_NUM, ITEM_NUM, RANK, PRIORITY_1
from (
select
ROW_NUMBER() OVER(
PARTITION BY ORD.DOC_NUM, ORD.ITEM_NUM
ORDER BY ORD.DOC_NUM, ORD.ITEM_NUM ASC
) AS Row#,
ORD.CUST_NUM, PRI.CUST_NUM, ORD.MERCH_NUM, ORD.MERCH_DIV, PRI.BALANCE,
pri.DOC_NUM, pri.SALES_NUM, pri.PRIORITY_1, pri.PRIORITY_2
from ORDER as ORD
left join PRIORITY as PRI on ORD.DOC_NUM = PRI.DOC_NUM
and ORD.SALES_NUMBER = PRI.SALES_NUM
where country_name in ('USA', ‘INDIA’)
) as ORDPRI
left join MERCHANDISE as MERCH on ORDPRI.DIV = MERCH.DIV
and ORDPRI.MERCH_NUM = MERCH.MERCH_NUM
You have to use 'DISTINCT' keyword to get unique values, but if your 'Priority table' & 'Order table' contains different values for Same MERCH_NUM then the final result contains the repetation of the 'MERCH_NUM'.
SELECT DISTINCT M.MERCH_NUMBER, O.MERCH_NUMBER, O.CUST_NUM, BALANCE, SALES_NUM,ITEM_NUM,RANK,PRIORITY_1
FROM priority_table P
LEFT JOIN order_table O ON P.CUST_NUM = O.CUST_NUM AND P.SALES_NUM=O.SALES_NUM AND P.DOC_NUM = O.DOC_NUM
LEFT JOIN merchandise_table M ON M.MERCH_NUM = O.MERCH_NUM
A way around can be to add one new Row_Number() in the outermost query having Partition by MERCH_SUB_DIV + all the columns in the final list and then filter final results based on the New Row_Number() . Follows a pseudo code that might help:
select
-- All expected columns in final result except the newRow#
ROW#, MERCH_NUM, CUST_NUM,
BALANCE, SALES_NUM, PRIORITY_1
from (
select
ROW#,
-- the new row number includes all column you want to show in final result
row_number() over ( PARTITION BY MERCH.MERCH_SUB_DIV ,
MERCH.MERCH_NUM, ORDPRI.MERCH_NUM, ORDPRI.CUST_NUM,
BALANCE, SALES_NUM, PRIORITY_1
order by (select 1 )) as newRow# ,
MERCH.MERCH_NUM, ORDPRI.CUST_NUM,
BALANCE, SALES_NUM, PRIORITY_1
from (
-- main query goes here
select
ROW_NUMBER() OVER(
PARTITION BY ORD.DOC_NUM --, ORD.ITEM_NUM
ORDER BY ORD.DOC_NUM ASC --, ORD.ITEM_NUM
) AS Row#,
ORD.CUST_NUM, ORD.MERCH_NUM, ORD.MERCH_DIV as DIV, PRI.BALANCE,
pri.DOC_NUM, pri.SALES_NUM, pri.PRIORITY_1, pri.PRIORITY_2
from #ORDER as ORD
left join #PRIORITY as PRI on ORD.DOC_NUM = PRI.DOC_NUM
and ORD.SALES_NUMBER = PRI.SALES_NUM
where country_name in ('USA', 'INDIA')
) as ORDPRI
left join #MERCHANDISE as MERCH on ORDPRI.DIV = MERCH.DIV
and ORDPRI.MERCH_NUM = MERCH.MERCH_NUM
) as T
-- final filter to get distinct values
where newRow# = 1
Sample code here .. Hope this helps!!

Want to JOIN fourth table in query

I have four tables:
mls_category
points_matrix
mls_entry
bonus_points
My first table (mls_category) is like below:
*--------------------------------*
| cat_no | store_id | cat_value |
*--------------------------------*
| 10 | 101 | 1 |
| 11 | 101 | 4 |
*--------------------------------*
My second table (points_matrix) is like below:
*----------------------------------------------------*
| pm_no | store_id | value_per_point | maxpoint |
*----------------------------------------------------*
| 1 | 101 | 1 | 10 |
| 2 | 101 | 2 | 50 |
| 3 | 101 | 3 | 80 |
*----------------------------------------------------*
My third table (mls_entry) is like below:
*-------------------------------------------*
| user_id | category | distance | status |
*-------------------------------------------*
| 1 | 10 | 20 | approved |
| 1 | 10 | 30 | approved |
| 1 | 11 | 40 | approved |
*-------------------------------------------*
My fourth table (bonus_points) is like below:
*--------------------------------------------*
| user_id | store_id | bonus_points | type |
*--------------------------------------------*
| 1 | 101 | 200 | fixed |
| 2 | 102 | 300 | fixed |
| 1 | 103 | 4 | per |
*--------------------------------------------*
Now, I want to add bonus points value into the sum of total distance according to the store_id, user_id and type.
I am using the following code to get total distance:
SELECT MIN(b.value_per_point) * d.total_distance FROM points_matrix b
JOIN
(
SELECT store_id, sum(t1.totald/c.cat_value) as total_distance FROM mls_category c
JOIN
(
SELECT SUM(distance) totald, user_id, category FROM mls_entry
WHERE user_id= 1 AND status = 'approved' GROUP BY user_id, category
) t1 ON c.cat_no = t1.category
) d ON b.store_id = d.store_id AND b.maxpoint >= d.total_distance
The above code is correct to calculate value, now I want to JOIN my fourth table.
This gives me sum (60*3 = 180) as total value. Now, I want (60+200)*3 = 780 for user 1 and store id 101 and value is fixed.
i think your query will be like below
SELECT Max(b.value_per_point)*( max(d.total_distance)+max(bonus_points)) FROM mls_point_matrix b
JOIN
(
SELECT store_id, sum(t1.totald/c.cat_value) as total_distance FROM mls_category c
JOIN
(
SELECT SUM(distance) totald, user_id, category FROM mls_entry
WHERE user_id= 1 AND status = 'approved' GROUP BY user_id, category
) t1 ON c.cat_no = t1.category group by store_id
) d ON b.store_id = d.store_id inner join bonus_points bp on bp.store_id=d.store_id
DEMO fiddle

postgresql - cumul. sum active customers by month (removing churn)

I want to create a query to get the cumulative sum by month of our active customers. The tricky thing here is that (unfortunately) some customers churn and so I need to remove them from the cumulative sum on the month they leave us.
Here is a sample of my customers table :
customer_id | begin_date | end_date
-----------------------------------------
1 | 15/09/2017 |
2 | 15/09/2017 |
3 | 19/09/2017 |
4 | 23/09/2017 |
5 | 27/09/2017 |
6 | 28/09/2017 | 15/10/2017
7 | 29/09/2017 | 16/10/2017
8 | 04/10/2017 |
9 | 04/10/2017 |
10 | 05/10/2017 |
11 | 07/10/2017 |
12 | 09/10/2017 |
13 | 11/10/2017 |
14 | 12/10/2017 |
15 | 14/10/2017 |
Here is what I am looking to achieve :
month | active customers
-----------------------------------------
2017-09 | 7
2017-10 | 6
I've managed to achieve it with the following query ... However, I'd like to know if there are a better way.
select
"begin_date" as "date",
sum((new_customers.new_customers-COALESCE(churn_customers.churn_customers,0))) OVER (ORDER BY new_customers."begin_date") as active_customers
FROM (
select
date_trunc('month',begin_date)::date as "begin_date",
count(id) as new_customers
from customers
group by 1
) as new_customers
LEFT JOIN(
select
date_trunc('month',end_date)::date as "end_date",
count(id) as churn_customers
from customers
where
end_date is not null
group by 1
) as churn_customers on new_customers."begin_date" = churn_customers."end_date"
order by 1
;
You may use a CTE to compute the total end_dates and then subtract it from the counts of start dates by using a left join
SQL Fiddle
Query 1:
WITH edt
AS (
SELECT to_char(end_date, 'yyyy-mm') AS mon
,count(*) AS ct
FROM customers
WHERE end_date IS NOT NULL
GROUP BY to_char(end_date, 'yyyy-mm')
)
SELECT to_char(c.begin_date, 'yyyy-mm') as month
,COUNT(*) - MAX(COALESCE(ct, 0)) AS active_customers
FROM customers c
LEFT JOIN edt ON to_char(c.begin_date, 'yyyy-mm') = edt.mon
GROUP BY to_char(begin_date, 'yyyy-mm')
ORDER BY month;
Results:
| month | active_customers |
|---------|------------------|
| 2017-09 | 7 |
| 2017-10 | 6 |