How do I update a value based on the dense_rank result? - sql

I have a query (SQL Server 2017) that finds two different discounts on the same date.
WITH CTE AS (SELECT [date_id], [good_id], [store_id], [name_promo_mech], [discount],
RN = DENSE_RANK() OVER (PARTITION BY [date_id], [good_id], [store_id], [name_promo_mech]
ORDER BY [discount]) +
DENSE_RANK() OVER (PARTITION BY [date_id], [good_id], [store_id], [name_promo_mech]
ORDER BY [discount] DESC) - 1
FROM [dbo].[ds_promo_list_by_day_new] AS PL
)
SELECT * FROM CTE
WHERE RN > 1;
GO
Query result:
+------------+----------+---------+-----------------+----------+----+
| date_id | store_id | good_id | name_promo_mech | discount | RN |
+------------+----------+---------+-----------------+----------+----+
| 2017-01-01 | 3 | 98398 | January 2017 | 15 | 2 |
+------------+----------+---------+-----------------+----------+----+
| 2017-01-01 | 3 | 98398 | January 2017 | 40 | 2 |
+------------+----------+---------+-----------------+----------+----+
| 2017-01-01 | 5 | 98398 | January 2017 | 15 | 3 |
+------------+----------+---------+-----------------+----------+----+
| 2017-01-01 | 5 | 98398 | January 2017 | 40 | 3 |
+------------+----------+---------+-----------------+----------+----+
| 2017-01-01 | 5 | 98398 | January 2017 | 30 | 3 |
+------------+----------+---------+-----------------+----------+----+
Now I want to make the discounts the same for all unique good_id, store_id, name_promo_merch in the source table. There is a rule for this. For example, for the row good_id = 98398, store_id = 3, name_promo_mech = N'january 2017' there were 10 entries with a 15 discount, and 20 with a 40 discount, then the 15 discount should be replaced with 40. However, if the number of entries for each discount was the same, then the maximum discount is set for all of them.
Can I do this? The number of rows in the source table is about 100 million.

What you want to do is set the value to the mode (a statistical term for the most common value) on each date and combination of whatever. You can use window functions:
with toupdate as (
select pl.*,
first_value(discount) over (partition by date_id, good_id, store_id, name_promo_mech order by cnt desc, discount desc) as mode_discount
from (select pl.*,
count(*) over (partition by date_id, good_id, store_id, name_promo_mech, discount) as cnt
from ds_promo_list_by_day_new pl
) pl
)
update toupdate
set discount = mode_discount
where mode_discount <> discount;
The subquery counts the number of values for each discount for each whatever on each day. The outer query gets the discount with the largest count, and in the case of ties, the larger value.
The rest is a simple update.

Related

Running average but reset when increasing streak stops

Original Dataset
+---------+--------+------------+
| Product | Date | Units Sold |
+---------+--------+------------+
| Prod A | 1/1/19 | 100 |
| Prod A | 1/2/19 | 200 |
| Prod A | 1/3/19 | 300 |
| Prod A | 1/4/19 | 136 |
| Prod A | 1/5/19 | 116 |
| Prod A | 1/6/19 | 120 |
| Prod A | 1/7/19 | 140 |
| Prod A | 1/8/19 | 160 |
+---------+--------+------------+
Desired Output (Last two columns)
+---------+--------+------------+---------------------+------+--------+--------------------+
| Product | Date | Units Sold | Previous Units Sold | Diff | Streak | Streak Running Avg |
+---------+--------+------------+---------------------+------+--------+--------------------+
| Prod A | 1/1/19 | 100 | | 100 | 1 | 100 |
| Prod A | 1/2/19 | 200 | 100 | 100 | 2 | 150 |
| Prod A | 1/3/19 | 300 | 200 | 100 | 3 | 200 |
| Prod A | 1/4/19 | 100 | 300 | -200 | 0 | 0 |
| Prod A | 1/5/19 | 200 | 100 | 100 | 1 | 200 |
| Prod A | 1/6/19 | 300 | 200 | 100 | 2 | 250 |
| Prod A | 1/7/19 | 200 | 300 | -100 | 0 | 0 |
| Prod A | 1/8/19 | 200 | 200 | 0 | 0 | 0 |
+---------+--------+------------+---------------------+------+--------+--------------------+
My goal is to calculate running average only when the difference between previous day sale and current day sale is positive (i.e. when the streak >0 calculate calculate the running average.)
For example, Day 1, 2, & 3 have diff positive i.e. 100
therefore the running average for
Day 1 is 100 (100/1)
Day 2 is 150 (100+200)/2
Day 3 is 200 (100+200+300)/3
 My Query
select *,
CASE WHEN flag=1 then sum(units) over (partition by item_name order by order_date) else 0 end as running_avg_sum
from
(select *,
lag(units, 1) over (partition by product order by order_date) as previous_day_units
CASE WHEN units > previous_day_units then 1 else 0 end as flag
from (
SELECT product,
order_date,
SUM(units_sold) units
FROM product_table
GROUP BY 1, 2
) tbl
)
But above query throws
Invalid operation: Aggregate window functions with an ORDER BY clause require a frame clause;
I'm aware how to resolve the error by adding rows between unbounded preceding and current row in the window function but that way it would average all preceding rows. I'm now sure how can I achieve the desired output.
If there's a way to define boundaries, that'd be really helpful
Any help is appreciated. Thank you!
Insert commands incase you want to replicate
CREATE TABLE product_table
(
product varchar(200),
order_date timestamp,
units_sold bigint
)
INSERT INTO product values("Prod A","2019-01-01","100");
INSERT INTO product values("Prod A","2019-01-02","200");
INSERT INTO product values("Prod A","2019-01-03","300");
INSERT INTO product values("Prod A","2019-01-04","136");
INSERT INTO product values("Prod A","2019-01-05","116");
INSERT INTO product values("Prod A","2019-01-06","120");
INSERT INTO product values("Prod A","2019-01-07","140");
INSERT INTO product values("Prod A","2019-01-08","160");
This is a somewhat tricky gaps and islands problem. Using the difference in row numbers method, we can try:
WITH cte1 AS (
SELECT *, LAG(units_sold, 1, 0) OVER (PARTITION BY product ORDER BY order_date) AS prev_units_sold,
units_sold - LAG(units_sold, 1, 0) OVER (PARTITION BY product ORDER BY order_date) diff
FROM product_table
),
cte2 AS (
SELECT *, ROW_NUMBER() OVER (PARTITION BY product ORDER BY order_date) rn1,
ROW_NUMBER() OVER (PARTITION BY product, diff > 0 ORDER BY order_date) rn2
FROM cte1
)
SELECT product, order_date, units_sold, prev_units_sold, diff,
CASE WHEN diff > 0
THEN ROW_NUMBER() OVER (PARTITION BY product, rn1 - rn2 ORDER BY order_date)
ELSE 0 END AS streak,
CASE WHEN diff > 0
THEN AVG(units_sold) OVER (PARTITION BY product, rn1 - rn2 ORDER BY order_date)
ELSE 0 END AS running_avg
FROM cte2
ORDER BY product, order_date;
Demo
The first pass CTE is required to compute the lag number of units sold, per product as ordered by order date. The second CTE computes two different row number sequences. The first one is simply ordered by order date. The second computes two separate sequences for positive diff and negative diff records. The difference between the first and second sequence can be used to identify the "islands" over which we want to display the streak and streak average. For the negative diff records, we just report zero for the streak and streak average.

SQL query grouping by range

Hi have a table A with the following data:
+------+-------+----+--------+
| YEAR | MONTH | PA | AMOUNT |
+------+-------+----+--------+
| 2020 | 1 | N | 100 |
+------+-------+----+--------+
| 2020 | 2 | N | 100 |
+------+-------+----+--------+
| 2020 | 3 | O | 100 |
+------+-------+----+--------+
| 2020 | 4 | N | 100 |
+------+-------+----+--------+
| 2020 | 5 | N | 100 |
+------+-------+----+--------+
| 2020 | 6 | O | 100 |
+------+-------+----+--------+
I'd like to have the following result:
+---------+---------+--------+
| FROM | TO | AMOUNT |
+---------+---------+--------+
| 2020-01 | 2020-02 | 200 |
+---------+---------+--------+
| 2020-03 | 2020-03 | 100 |
+---------+---------+--------+
| 2020-04 | 2020-05 | 200 |
+---------+---------+--------+
| 2020-06 | 2020-06 | 100 |
+---------+---------+--------+
My DB is DB2/400.
I have tried with ROW_NUMBER partitioning, subqueries but I can't figure out how to solve this.
I understand this as a gaps-and-island problem, where you want to group together adjacent rows that have the same PA.
Here is an approach using the difference between row numbers to build the groups:
select min(year_month) year_month_start, max(year_month) year_month_end, sum(amount) amount
from (
select a.*, year * 100 + month year_month
row_number() over(order by year, month) rn1,
row_number() over(partition by pa order by year, month) rn2
from a
) a
group by rn1 - rn2
order by year_month_start
You can try the below -
select min(year)||'-'||min(month) as from_date,max(year)||'-'||max(month) as to_date,sum(amount) as amount from
(
select *,row_number() over(order by month)-
row_number() over(partition by pa order by month) as grprn
from t1
)A group by grprn,pa order by grprn
This works in tsql, guess you can adaot it to db2-400?
SELECT MIN(Dte) [From]
, MAX(Dte) [To]
-- ,PA
, SUM(Amount)
FROM (
SELECT year * 100 + month Dte
, Pa
, Amount
, ROW_NUMBER() OVER (PARTITION BY pa ORDER BY year * 100 + month) +
10000- (YEar*100+Month) rn
FROM tabA a
) b
GROUP BY Pa
, rn
ORDER BY [From]
, [To]
The trick is the row number function partitioned by PA and ordered by date, This'll count one up for each month for the, when added to the descnding count of month and month, you will get the same number for consecutive months with same PA. You the group by PA and the grouping yoou made, rn, to get the froups, and then Bob's your uncle.

Find groups in data by relative difference between records

I have some rows that are sorted by price:
| id | price |
|----|-------|
| 1 | 2.00 |
| 2 | 2.10 |
| 3 | 2.11 |
| 4 | 2.50 |
| 5 | 2.99 |
| 6 | 3.02 |
| 7 | 9.01 |
| 8 | 9.10 |
| 9 | 9.11 |
| 10 | 13.01 |
| 11 | 13.51 |
| 12 | 14.10 |
I need to group them in "price groups". An item belongs to a different group when difference in price between it and the previous item is greater than some fixed value, say 1.50.
So the expected result is something like this:
| MIN(price) | MAX(price) |
|------------|------------|
| 2.00 | 3.02 |
| 9.01 | 9.11 |
| 13.01 | 14.10 |
I'm not even sure how to call this type of grouping. Group by "rolling difference"? Not exactly...
Can this be done in SQL (or in Postgres in particular)?
Your results are consistent with looking at the previous value and saying a group starts when the difference is greater than 1.5. You can do this with lag(), a cumulative sum, and aggregation:
select min(price), max(price)
from (select t.*,
count(*) filter (where prev_price is null or prev_price < price - 1.5) over (order by price) as grp
from (select t.*,
lag(price) over (order by price) as prev_price
from t
) t
) t
group by grp
Thanks Gordon Linoff for his answer, it is exactly what I was after!
I ended up using this query here simply because I understand it better. I guess it is more noobish, but so am I.
Both queries sort a table of 1M rows into 34 groups in about a second. This query is a bit more performant on 11M rows, sorting them into 380 groups in 15 seconds, vs 23 seconds in Gordon's answer.
SELECT results.group_i, MIN(results.price), MAX(results.price), AVG(results.price)
FROM (
SELECT *,
SUM(new_group) OVER (ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS group_i
FROM (
SELECT annotated.*,
CASE
WHEN prev_price IS NULL OR price - prev_price > 1.5 THEN 1
ELSE 0
END AS new_group
FROM (
SELECT *,
LAG(price) OVER (ORDER BY price) AS prev_price
FROM prices
) AS annotated
) AS grouppable
) AS results
GROUP BY results.group_i
ORDER BY results.group_i;

Remove duplicate rows SQL Server?

I have a table (SQL Server 2017) with sales data that contains duplicate rows, for example:
+---------+---------+---------+----------+---------+----------+
| year_id | week_id | good_id | store_id | ship_id | quantity |
+---------+---------+---------+----------+---------+----------+
| 2017 | 43 | 154876 | 19 | 6 | 2 |
+---------+---------+---------+----------+---------+----------+
| 2017 | 43 | 154876 | 19 | 6 | 0 |
+---------+---------+---------+----------+---------+----------+
| 2019 | 32 | 456123 | 67 | 4 | 6 |
+---------+---------+---------+----------+---------+----------+
| 2019 | 32 | 456123 | 67 | 4 | 4 |
+---------+---------+---------+----------+---------+----------+
| 2019 | 32 | 456123 | 67 | 4 | 0 |
+---------+---------+---------+----------+---------+----------+
| 2018 | 32 | 456123 | 67 | 4 | 0 |
+---------+---------+---------+----------+---------+----------+
I want to delete rows that have the same year_id, week_id, good_id, store_id and ship_id columns, but the quantity is 0. For example:
+---------+---------+---------+----------+---------+----------+
| year_id | week_id | good_id | store_id | ship_id | quantity |
+---------+---------+---------+----------+---------+----------+
| 2017 | 43 | 154876 | 19 | 6 | 2 |
+---------+---------+---------+----------+---------+----------+
| 2019 | 32 | 456123 | 67 | 4 | 6 |
+---------+---------+---------+----------+---------+----------+
I found a query that can do this, but I don’t understand how to indicate that I need to delete a row with a quantity equal to 0.
WITH CTE AS(
SELECT year_id, week_id, good_id, store_id, ship_id,
RN = ROW_NUMBER()OVER(PARTITION BY year_id ORDER BY year_id)
FROM dbo.sales
)
DELETE FROM CTE WHERE RN > 1
A deletable CTE is on the right track. Here is one way:
WITH cte AS (
SELECT *, COUNT(*) OVER (PARTITION BY year_id, week_id, good_id, store_id, ship_id) cnt
FROM dbo.sales
)
DELETE
FROM cte
WHERE cnt = 2 AND quantity = 0;
This will delete every record being duplicate with regard to the five columns you mentioned and having a zero quantity. If you want to also cater for duplicates in greater than pairs, just change the restriction on cnt.
WITH CTE AS
(
SELECT year_id, week_id, good_id, store_id, ship_id,Quantity ,
ROW_NUMBER() OVER (PARTITION BY year_id, week_id, good_id, store_id, ship_id ORDER
BY quantity Desc) RN
FROM dbo.sales
)
DELETE FROM CTE WHERE RN > 1 And Quantity = 0
in your case query will be like below
WITH CTE AS(
SELECT year_id, week_id, good_id, store_id, ship_id,
RN = ROW_NUMBER()OVER(PARTITION BY year_id, week_id, good_id, store_id, ship_id ORDER BY quantity)
, count(*) as cnt
FROM dbo.sales group by year_id, week_id, good_id, store_id, ship_id
)
DELETE FROM CTE WHERE RN = 1 and quantity=0 and cnt>1
if you want to only duplicate of quantity=0 then you need quantity=0 in where condition otherwise you could remove that condition from where

SQL Server 2008 How can I SELECT rows with Value by GROUP(ing) other column?

Hi guys I'm a beginner at SQL so please bear with me.. :)
My question is as follows.
I got this table:
DateTime ID Year Month Value Cost
-------------------|------|--------|-------|-------|--------|
1-1-2013 00:00:01 | 1 | 2013 | 1 | 30 | 90 |
1-1-2013 00:01:01 | 1 | 2013 | 1 | 0 | 0 |
1-1-2013 00:02:01 | 1 | 2013 | 1 | 1 | 3 |
1-2-2013 00:00:01 | 1 | 2013 | 2 | 2 | 6 |
1-2-2013 00:01:01 | 1 | 2013 | 2 | 3 | 9 |
1-2-2013 00:02:01 | 1 | 2013 | 2 | 4 | 12 |
1-3-2013 00:00:01 | 1 | 2013 | 3 | 5 | 15 |
1-3-2013 00:01:01 | 1 | 2013 | 3 | 6 | 18 |
1-3-2013 00:02:01 | 1 | 2013 | 3 | 7 | 21 |
Now what I'm trying to get is this result
Year Month Value Cost
|--------|-------|-------|--------|
| 2013 | 1 | 1 | 3 |
| 2013 | 2 | 4 | 12 |
| 2013 | 3 | 7 | 21 |
As you can see I'm trying to GROUP BY the [Month] and the [Year] and to get the last [Value] for every [Month].
Now as you can understand from the result I do not try to get the MAX() value from the [Value] column but the last value for every [Month] and that is my issue..
Thanks in advance
PS
I was able to GROUP BY the [Year] and the [Month] but as I understand that when I adding the [Value] column the GROUP BY is not effecting the result, as the SQL need more spcification on the value you what the SQL to get..
Instead of using row_number(), you can also use rank(). Using rank() might give you multiple values within the same year and month, see this post.
Because of this, a group by is added.
SELECT
[Year],
[Month],
[Value],
[Cost]
FROM
(
SELECT
[Year],
[Month],
[Value],
[Cost],
Rank() OVER (PARTITION BY [Year], [Month] ORDER BY [DateTime] DESC) AS [Rank]
FROM [t1]
) AS [sub]
WHERE [Rank] = 1
GROUP BY
[Year],
[Month],
[Value],
[Cost]
ORDER BY
[Year] ASC,
[Month] ASC
As stated in the comments, this might still return multiple records for a single month. Therefor the ORDER BY statement can be extended, based on the desired functionality:
Rank() OVER (PARTITION BY [Year], [Month] ORDER BY [DateTime] DESC, [Value] DESC, [Cost] ASC) AS [Rank]
Switching the order of [Value] and [Cost] or ASC <> DESC will influence the rank and because of that the result.
Since you are using SQL Server 2008, you can use row_number() to get the result:
select year, month, value, cost
from
(
select year, month, value, cost,
row_number() over(partition by year, month order by datetime desc) rn
from yourtable
) src
where rn = 1
See SQL Fiddle with Demo
Or you can use a subquery to get this (note: with this version if you have more than one record with the same max datetime per month then you will return each record:
select t1.year, t1.month, t1.value, t1.cost
from yourtable t1
inner join
(
select max(datetime) datetime
from yourtable
group by year, month
) t2
on t1.datetime = t2.datetime
See SQL Fiddle with Demo
Both give the same result:
| YEAR | MONTH | VALUE | COST |
-------------------------------
| 2013 | 1 | 1 | 3 |
| 2013 | 2 | 4 | 12 |
| 2013 | 3 | 7 | 21 |