SQL query grouping by range - sql

Hi have a table A with the following data:
+------+-------+----+--------+
| YEAR | MONTH | PA | AMOUNT |
+------+-------+----+--------+
| 2020 | 1 | N | 100 |
+------+-------+----+--------+
| 2020 | 2 | N | 100 |
+------+-------+----+--------+
| 2020 | 3 | O | 100 |
+------+-------+----+--------+
| 2020 | 4 | N | 100 |
+------+-------+----+--------+
| 2020 | 5 | N | 100 |
+------+-------+----+--------+
| 2020 | 6 | O | 100 |
+------+-------+----+--------+
I'd like to have the following result:
+---------+---------+--------+
| FROM | TO | AMOUNT |
+---------+---------+--------+
| 2020-01 | 2020-02 | 200 |
+---------+---------+--------+
| 2020-03 | 2020-03 | 100 |
+---------+---------+--------+
| 2020-04 | 2020-05 | 200 |
+---------+---------+--------+
| 2020-06 | 2020-06 | 100 |
+---------+---------+--------+
My DB is DB2/400.
I have tried with ROW_NUMBER partitioning, subqueries but I can't figure out how to solve this.

I understand this as a gaps-and-island problem, where you want to group together adjacent rows that have the same PA.
Here is an approach using the difference between row numbers to build the groups:
select min(year_month) year_month_start, max(year_month) year_month_end, sum(amount) amount
from (
select a.*, year * 100 + month year_month
row_number() over(order by year, month) rn1,
row_number() over(partition by pa order by year, month) rn2
from a
) a
group by rn1 - rn2
order by year_month_start

You can try the below -
select min(year)||'-'||min(month) as from_date,max(year)||'-'||max(month) as to_date,sum(amount) as amount from
(
select *,row_number() over(order by month)-
row_number() over(partition by pa order by month) as grprn
from t1
)A group by grprn,pa order by grprn

This works in tsql, guess you can adaot it to db2-400?
SELECT MIN(Dte) [From]
, MAX(Dte) [To]
-- ,PA
, SUM(Amount)
FROM (
SELECT year * 100 + month Dte
, Pa
, Amount
, ROW_NUMBER() OVER (PARTITION BY pa ORDER BY year * 100 + month) +
10000- (YEar*100+Month) rn
FROM tabA a
) b
GROUP BY Pa
, rn
ORDER BY [From]
, [To]
The trick is the row number function partitioned by PA and ordered by date, This'll count one up for each month for the, when added to the descnding count of month and month, you will get the same number for consecutive months with same PA. You the group by PA and the grouping yoou made, rn, to get the froups, and then Bob's your uncle.

Related

Getting duplication WITH partition by

I want to know which customers who ordered in June 2020 also ordered in June 2021. My code returns the correct DISTINCT orders, but discounted sales is incorrect for customers who placed more than one order in either year. For example, a customer who placed one order in 2020 and four orders in 2021 has 2020 discounted sales at 4x the actual amount. The four orders in 2021 have four rows, and the one 2020 order populates against each. I saw this by using ROW_NUMBER () which exposed the underlying problem. I cannot use DISTINCT with discounted sales because customers do place multiple orders for identical dollar amounts. How do I get the exact discounted sales using standard SQL for BQ?
SELECT
DISTINCT ly.cuid AS cuid,
COUNT(DISTINCT ly.order_id) OVER (PARTITION BY ly.cuid) AS ly_orders,
SUM(ly.discounted_sales) OVER (PARTITION BY ly.cuid) AS ly_demand,
COUNT(DISTINCT ty.order_id) OVER (PARTITION BY ty.cuid) AS ty_orders,
SUM(ty.discounted_sales) OVER (PARTITION BY ly.cuid) AS ty_demand
FROM table ly
LEFT JOIN table ty
ON ly.cuid = ty.cuid
WHERE ly.order_date BETWEEN '2020-06-01' AND '2020-06-30'
AND ty.order_date BETWEEN '2021-06-01'AND '2021-06-30'
AND ly.financial_status <> 'credit'
AND ty.financial_status <> 'credit'
AND ly.discounted_sales >0
AND ty.discounted_sales >0
AND ly.channel = 'b2b'
AND ty.channel = 'b2b'
ORDER BY ly.cuid asc
[Results]
cuid ly_orders ly_demand ty_orders ty_demand comments
D 1 22,466.40 4 154,596.24 ly is 4x actual
F 2 2,573.20 1 1,944.40 ty is 2x actual
G 1 32,134.40 4 1,632.00 ly is 4x actual
I 2 757.56 1 730.56 ty is 2x actual
J 2 54,859.00 2 23,822.32 both are 2x actual
THIS WORKED:
WITH prior_period AS (
SELECT
DISTINCT cuid AS ly_cuid,
COUNT(DISTINCT order_id) OVER (PARTITION BY cuid) AS ly_orders,
SUM(discounted_sales) OVER (PARTITION BY cuid) AS ly_demand
FROM TABLE
WHERE EXTRACT (YEAR FROM order_date) = 2020 AND EXTRACT(MONTH FROM order_date) = 6
AND financial_status <> 'credit'
AND discounted_sales >0
AND channel = 'b2b'
GROUP BY cuid, order_id, discounted_sales
ORDER BY cuid asc),
this_period AS (
SELECT
DISTINCT cuid AS ty_cuid,
COUNT(DISTINCT order_id) OVER (PARTITION BY cuid) AS ty_orders,
SUM(discounted_sales) OVER (PARTITION BY cuid) AS ty_demand
FROM TABLE
WHERE EXTRACT (YEAR FROM order_date) = 2021 AND EXTRACT(MONTH FROM order_date) = 6
AND financial_status <> 'credit'
AND discounted_sales >0
AND channel = 'b2b'
GROUP BY cuid, order_id, discounted_sales
ORDER BY cuid asc)
SELECT *
FROM prior_period ly
JOIN this_period ty ON ly.ly_cuid = ty.ty_cuid
ORDER BY ly.ly_cuid
Updated with your schema and approximate data:
Try this...
WITH periods AS (
SELECT cuid AS cuid
, COUNT(*) AS orders
, SUM(discounted_sales) AS demand
, EXTRACT(YEAR FROM order_date) AS yr
FROM demand2
WHERE EXTRACT(YEAR FROM order_date) IN (2020, 2021) AND EXTRACT(MONTH FROM order_date) = 6
AND financial_status <> 'credit'
AND discounted_sales > 0
AND channel = 'b2b'
GROUP BY cuid, EXTRACT(YEAR FROM order_date)
)
SELECT ly.cuid
, ly.orders AS ly_orders
, ly.demand AS ly_demand
, ty.orders AS ty_orders
, ty.demand AS ty_demand
FROM periods AS ly
JOIN periods AS ty
ON ly.cuid = ty.cuid
AND ly.yr = 2020
AND ty.yr = 2021
ORDER BY ly.cuid
;
The result:
+------+-----------+-----------+-----------+-----------+
| cuid | ly_orders | ly_demand | ty_orders | ty_demand |
+------+-----------+-----------+-----------+-----------+
| D | 1 | 5616.60 | 4 | 154596.24 |
| F | 2 | 2573.20 | 1 | 972.20 |
| G | 1 | 8033.60 | 4 | 1632.56 |
| I | 2 | 757.56 | 1 | 365.28 |
| J | 2 | 27429.50 | 2 | 11911.16 |
+------+-----------+-----------+-----------+-----------+
Here's a similar example with data, SQL and results to show both the incorrect result and the correct result.
The data:
SELECT * FROM demand ORDER BY account_id, period;
+----+------------+--------+--------+
| id | account_id | period | demand |
+----+------------+--------+--------+
| 1 | 1 | 202005 | 100 |
| 2 | 1 | 202005 | 120 |
| 3 | 1 | 202105 | 105 |
| 4 | 1 | 202105 | 125 |
| 5 | 1 | 202105 | 30 |
| 6 | 2 | 202005 | 200 |
| 7 | 2 | 202105 | 240 |
+----+------------+--------+--------+
The incorrect SQL, without SUMs to just show the join behavior:
SELECT t1.id, t1.account_id, t1.period, t1.demand AS demand1
, t2.id, t2.period, t2.demand AS demand2
FROM demand AS t1
LEFT JOIN demand AS t2
ON t1.account_id = t2.account_id
AND t2.period = 202105
WHERE t1.period = 202005
ORDER BY t1.account_id, t1.period, demand1, demand2
;
+----+------------+--------+---------+------+--------+---------+
| id | account_id | period | demand1 | id | period | demand2 |
+----+------------+--------+---------+------+--------+---------+
| 1 | 1 | 202005 | 100 | 5 | 202105 | 30 |
| 1 | 1 | 202005 | 100 | 3 | 202105 | 105 |
| 1 | 1 | 202005 | 100 | 4 | 202105 | 125 |
| 2 | 1 | 202005 | 120 | 5 | 202105 | 30 |
| 2 | 1 | 202005 | 120 | 3 | 202105 | 105 |
| 2 | 1 | 202005 | 120 | 4 | 202105 | 125 |
| 6 | 2 | 202005 | 200 | 7 | 202105 | 240 |
+----+------------+--------+---------+------+--------+---------+
Notice account 2 doesn't have a problem because only one demand was found last year and this year.
But account 1 found 2 demand rows for last year and 3 demand rows for this year, leading to (2 x 3) = 6 rows in the joined result. This is the source of the problem.
To correct this, we aggregate before the JOIN, so that we have at most one (1) row per account / period to be joined.
One form of the correct solution, with SUMs derived in the CTE term:
WITH demands AS (
SELECT account_id, period
, SUM(demand) AS demand
, COUNT(*) AS orders
FROM demand
GROUP BY account_id, period
)
SELECT ly.account_id, ly.period
, ly.orders AS ly_orders
, ly.demand AS ly_demand
, ty.orders AS ty_orders
, ty.demand AS ty_demand
FROM demands AS ly
LEFT JOIN demands AS ty
ON ly.account_id = ty.account_id
AND ty.period = 202105
WHERE ly.period = 202005
ORDER BY ly.account_id, ly.period, ly_demand, ty_demand
;
The result:
+------------+--------+-----------+-----------+-----------+-----------+
| account_id | period | ly_orders | ly_demand | ty_orders | ty_demand |
+------------+--------+-----------+-----------+-----------+-----------+
| 1 | 202005 | 2 | 220 | 3 | 260 |
| 2 | 202005 | 1 | 200 | 1 | 240 |
+------------+--------+-----------+-----------+-----------+-----------+
Since we performed aggregation in the CTE term (demands), the join found at most one row for each period for each account.

How do I update a value based on the dense_rank result?

I have a query (SQL Server 2017) that finds two different discounts on the same date.
WITH CTE AS (SELECT [date_id], [good_id], [store_id], [name_promo_mech], [discount],
RN = DENSE_RANK() OVER (PARTITION BY [date_id], [good_id], [store_id], [name_promo_mech]
ORDER BY [discount]) +
DENSE_RANK() OVER (PARTITION BY [date_id], [good_id], [store_id], [name_promo_mech]
ORDER BY [discount] DESC) - 1
FROM [dbo].[ds_promo_list_by_day_new] AS PL
)
SELECT * FROM CTE
WHERE RN > 1;
GO
Query result:
+------------+----------+---------+-----------------+----------+----+
| date_id | store_id | good_id | name_promo_mech | discount | RN |
+------------+----------+---------+-----------------+----------+----+
| 2017-01-01 | 3 | 98398 | January 2017 | 15 | 2 |
+------------+----------+---------+-----------------+----------+----+
| 2017-01-01 | 3 | 98398 | January 2017 | 40 | 2 |
+------------+----------+---------+-----------------+----------+----+
| 2017-01-01 | 5 | 98398 | January 2017 | 15 | 3 |
+------------+----------+---------+-----------------+----------+----+
| 2017-01-01 | 5 | 98398 | January 2017 | 40 | 3 |
+------------+----------+---------+-----------------+----------+----+
| 2017-01-01 | 5 | 98398 | January 2017 | 30 | 3 |
+------------+----------+---------+-----------------+----------+----+
Now I want to make the discounts the same for all unique good_id, store_id, name_promo_merch in the source table. There is a rule for this. For example, for the row good_id = 98398, store_id = 3, name_promo_mech = N'january 2017' there were 10 entries with a 15 discount, and 20 with a 40 discount, then the 15 discount should be replaced with 40. However, if the number of entries for each discount was the same, then the maximum discount is set for all of them.
Can I do this? The number of rows in the source table is about 100 million.
What you want to do is set the value to the mode (a statistical term for the most common value) on each date and combination of whatever. You can use window functions:
with toupdate as (
select pl.*,
first_value(discount) over (partition by date_id, good_id, store_id, name_promo_mech order by cnt desc, discount desc) as mode_discount
from (select pl.*,
count(*) over (partition by date_id, good_id, store_id, name_promo_mech, discount) as cnt
from ds_promo_list_by_day_new pl
) pl
)
update toupdate
set discount = mode_discount
where mode_discount <> discount;
The subquery counts the number of values for each discount for each whatever on each day. The outer query gets the discount with the largest count, and in the case of ties, the larger value.
The rest is a simple update.

postgresql - cumul. sum active customers by month (removing churn)

I want to create a query to get the cumulative sum by month of our active customers. The tricky thing here is that (unfortunately) some customers churn and so I need to remove them from the cumulative sum on the month they leave us.
Here is a sample of my customers table :
customer_id | begin_date | end_date
-----------------------------------------
1 | 15/09/2017 |
2 | 15/09/2017 |
3 | 19/09/2017 |
4 | 23/09/2017 |
5 | 27/09/2017 |
6 | 28/09/2017 | 15/10/2017
7 | 29/09/2017 | 16/10/2017
8 | 04/10/2017 |
9 | 04/10/2017 |
10 | 05/10/2017 |
11 | 07/10/2017 |
12 | 09/10/2017 |
13 | 11/10/2017 |
14 | 12/10/2017 |
15 | 14/10/2017 |
Here is what I am looking to achieve :
month | active customers
-----------------------------------------
2017-09 | 7
2017-10 | 6
I've managed to achieve it with the following query ... However, I'd like to know if there are a better way.
select
"begin_date" as "date",
sum((new_customers.new_customers-COALESCE(churn_customers.churn_customers,0))) OVER (ORDER BY new_customers."begin_date") as active_customers
FROM (
select
date_trunc('month',begin_date)::date as "begin_date",
count(id) as new_customers
from customers
group by 1
) as new_customers
LEFT JOIN(
select
date_trunc('month',end_date)::date as "end_date",
count(id) as churn_customers
from customers
where
end_date is not null
group by 1
) as churn_customers on new_customers."begin_date" = churn_customers."end_date"
order by 1
;
You may use a CTE to compute the total end_dates and then subtract it from the counts of start dates by using a left join
SQL Fiddle
Query 1:
WITH edt
AS (
SELECT to_char(end_date, 'yyyy-mm') AS mon
,count(*) AS ct
FROM customers
WHERE end_date IS NOT NULL
GROUP BY to_char(end_date, 'yyyy-mm')
)
SELECT to_char(c.begin_date, 'yyyy-mm') as month
,COUNT(*) - MAX(COALESCE(ct, 0)) AS active_customers
FROM customers c
LEFT JOIN edt ON to_char(c.begin_date, 'yyyy-mm') = edt.mon
GROUP BY to_char(begin_date, 'yyyy-mm')
ORDER BY month;
Results:
| month | active_customers |
|---------|------------------|
| 2017-09 | 7 |
| 2017-10 | 6 |

Redshift count with variable

Imagine I have a table on Redshift with this similar structure. Product_Bill_ID is the Primary Key of this table.
| Store_ID | Product_Bill_ID | Payment_Date
| 1 | 1 | 01/10/2016 11:49:33
| 1 | 2 | 01/10/2016 12:38:56
| 1 | 3 | 01/10/2016 12:55:02
| 2 | 4 | 01/10/2016 16:25:05
| 2 | 5 | 02/10/2016 08:02:28
| 3 | 6 | 03/10/2016 02:32:09
If I want to query the number of Product_Bill_ID that a store sold in the first hour after it sold its first Product_Bill_ID, how could I do this?
This example should outcome
| Store_ID | First_Payment_Date | Sold_First_Hour
| 1 | 01/10/2016 11:49:33 | 2
| 2 | 01/10/2016 16:25:05 | 1
| 3 | 03/10/2016 02:32:09 | 1
You need to get the first hour. That is easy enough using window functions:
select s.*,
min(payment_date) over (partition by store_id) as first_payment_date
from sales s
Then, you need to do the date filtering and aggregation:
select store_id, count(*)
from (select s.*,
min(payment_date) over (partition by store_id) as first_payment_date
from sales s
) s
where payment_date <= first_payment_date + interval '1 hour'
group by store_id;
SELECT
store_id,
first_payment_date,
SUM(
CASE WHEN payment_date < DATEADD(hour, 1, first_payment_date) THEN 1 END
) AS sold_first_hour
FROM
(
SELECT
*,
MIN(payment_date) OVER (PARTITION BY store_id) AS first_payment_date
FROM
yourtable
)
parsed_table
GROUP BY
store_id,
first_payment_date

SQL sum over partition for preceding period

I have the following table, which represent Customers for each day:
+----------+-----------+
| Date | Customers |
+----------+-----------+
| 1/1/2014 | 4 |
| 1/2/2014 | 7 |
| 1/3/2014 | 5 |
| 1/4/2014 | 5 |
| 1/5/2014 | 10 |
| 2/1/2014 | 7 |
| 2/2/2014 | 4 |
| 2/3/2014 | 1 |
| 2/4/2014 | 5 |
+----------+-----------+
I would like to add 2 additional columns:
Summary of the customers for the current month
Summary of the customers for the preceding month
here's the desired outcome:
+----------+-----------+----------------------+------------------------+
| Date | Customers | Sum_of_Current_month | Sum_of_Preceding_month |
+----------+-----------+----------------------+------------------------+
| 1/1/2014 | 4 | 31 | 0 |
| 1/2/2014 | 7 | 31 | 0 |
| 1/3/2014 | 5 | 31 | 0 |
| 1/4/2014 | 5 | 31 | 0 |
| 1/5/2014 | 10 | 31 | 0 |
| 2/1/2014 | 7 | 17 | 31 |
| 2/2/2014 | 4 | 17 | 31 |
| 2/3/2014 | 1 | 17 | 31 |
| 2/4/2014 | 5 | 17 | 31 |
+----------+-----------+----------------------+------------------------+
I have managed to calculate the 3rd column by a simple sum over partition function:
Select
Date,
Customers,
Sum(Customers) over (Partition by (Month(Date)||year(Date) Order by 1) as Sum_of_Current_month
From table
However, I can't find a way to calculate the Sum_of_preceding_month column.
Appreciate your support.
Asaf
The previous month is a bit tricky. What's your Teradata release, TD14.10 supports LAST_VALUE:
SELECT
dt,
customers,
Sum_of_Current_month,
-- return the previous sum
COALESCE(LAST_VALUE(x ignore NULLS)
OVER (ORDER BY dt
ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING)
,0) AS Sum_of_Preceding_month
FROM
(
SELECT
dt,
Customers,
SUM(Customers) OVER (PARTITION BY TRUNC(dt,'mon')) AS Sum_of_Current_month,
CASE -- keep the number only for the last day in month
WHEN ROW_NUMBER()
OVER (PARTITION BY TRUNC(dt,'mon')
ORDER BY dt)
= COUNT(*)
OVER (PARTITION BY TRUNC(dt,'mon'))
THEN Sum_of_Current_month
END AS x
FROM tab
) AS dt
I think this might be easier by using lag() and an aggregation sub-query. The ANSI Standard syntax is:
Select t.*, tt.sumCustomers, tt.prev_sumCustomers
From table t join
(select extract(year from date) as yyyy, extract(month from date) as mm,
sum(Customers) as sumCustomers,
lag(sum(Customers)) over (order by extract(year from date), extract(month from date)
) as prev_sumCustomers
from table t
group by extract(year from date), extract(month from date)
) tt
on extract(year from date) = tt.yyyy and extract(month from date) = t.mm;
In Teradata, this would be written as:
Select t.*, tt.sumCustomers, tt.prev_sumCustomers
From table t join
(select extract(year from date) as yyyy, extract(month from date) as mm,
sum(Customers) as sumCustomers,
min(sum(Customers)) over (order by extract(year from date), extract(month from date)
rows between 1 preceding and 1 preceding
) as prev_sumCustomers
from table t
group by extract(year from date), extract(month from date)
) tt
on extract(year from date) = tt.yyyy and extract(month from date) = t.mm;
Try this:
SELECT
[Date],
[Customers],
(SELECT SUM(customers) FROM table WHERE MONTH(dte) = MONTH(tbl.dte)),
ISNULL((SELECT SUM(customers) FROM table WHERE MONTH(dte) = MONTH(DATEADD(MONTH, -1, tbl.dte))), 0)
FROM table tbl