Remove duplicate rows SQL Server? - sql

I have a table (SQL Server 2017) with sales data that contains duplicate rows, for example:
+---------+---------+---------+----------+---------+----------+
| year_id | week_id | good_id | store_id | ship_id | quantity |
+---------+---------+---------+----------+---------+----------+
| 2017 | 43 | 154876 | 19 | 6 | 2 |
+---------+---------+---------+----------+---------+----------+
| 2017 | 43 | 154876 | 19 | 6 | 0 |
+---------+---------+---------+----------+---------+----------+
| 2019 | 32 | 456123 | 67 | 4 | 6 |
+---------+---------+---------+----------+---------+----------+
| 2019 | 32 | 456123 | 67 | 4 | 4 |
+---------+---------+---------+----------+---------+----------+
| 2019 | 32 | 456123 | 67 | 4 | 0 |
+---------+---------+---------+----------+---------+----------+
| 2018 | 32 | 456123 | 67 | 4 | 0 |
+---------+---------+---------+----------+---------+----------+
I want to delete rows that have the same year_id, week_id, good_id, store_id and ship_id columns, but the quantity is 0. For example:
+---------+---------+---------+----------+---------+----------+
| year_id | week_id | good_id | store_id | ship_id | quantity |
+---------+---------+---------+----------+---------+----------+
| 2017 | 43 | 154876 | 19 | 6 | 2 |
+---------+---------+---------+----------+---------+----------+
| 2019 | 32 | 456123 | 67 | 4 | 6 |
+---------+---------+---------+----------+---------+----------+
I found a query that can do this, but I don’t understand how to indicate that I need to delete a row with a quantity equal to 0.
WITH CTE AS(
SELECT year_id, week_id, good_id, store_id, ship_id,
RN = ROW_NUMBER()OVER(PARTITION BY year_id ORDER BY year_id)
FROM dbo.sales
)
DELETE FROM CTE WHERE RN > 1

A deletable CTE is on the right track. Here is one way:
WITH cte AS (
SELECT *, COUNT(*) OVER (PARTITION BY year_id, week_id, good_id, store_id, ship_id) cnt
FROM dbo.sales
)
DELETE
FROM cte
WHERE cnt = 2 AND quantity = 0;
This will delete every record being duplicate with regard to the five columns you mentioned and having a zero quantity. If you want to also cater for duplicates in greater than pairs, just change the restriction on cnt.

WITH CTE AS
(
SELECT year_id, week_id, good_id, store_id, ship_id,Quantity ,
ROW_NUMBER() OVER (PARTITION BY year_id, week_id, good_id, store_id, ship_id ORDER
BY quantity Desc) RN
FROM dbo.sales
)
DELETE FROM CTE WHERE RN > 1 And Quantity = 0

in your case query will be like below
WITH CTE AS(
SELECT year_id, week_id, good_id, store_id, ship_id,
RN = ROW_NUMBER()OVER(PARTITION BY year_id, week_id, good_id, store_id, ship_id ORDER BY quantity)
, count(*) as cnt
FROM dbo.sales group by year_id, week_id, good_id, store_id, ship_id
)
DELETE FROM CTE WHERE RN = 1 and quantity=0 and cnt>1
if you want to only duplicate of quantity=0 then you need quantity=0 in where condition otherwise you could remove that condition from where

Related

Getting duplication WITH partition by

I want to know which customers who ordered in June 2020 also ordered in June 2021. My code returns the correct DISTINCT orders, but discounted sales is incorrect for customers who placed more than one order in either year. For example, a customer who placed one order in 2020 and four orders in 2021 has 2020 discounted sales at 4x the actual amount. The four orders in 2021 have four rows, and the one 2020 order populates against each. I saw this by using ROW_NUMBER () which exposed the underlying problem. I cannot use DISTINCT with discounted sales because customers do place multiple orders for identical dollar amounts. How do I get the exact discounted sales using standard SQL for BQ?
SELECT
DISTINCT ly.cuid AS cuid,
COUNT(DISTINCT ly.order_id) OVER (PARTITION BY ly.cuid) AS ly_orders,
SUM(ly.discounted_sales) OVER (PARTITION BY ly.cuid) AS ly_demand,
COUNT(DISTINCT ty.order_id) OVER (PARTITION BY ty.cuid) AS ty_orders,
SUM(ty.discounted_sales) OVER (PARTITION BY ly.cuid) AS ty_demand
FROM table ly
LEFT JOIN table ty
ON ly.cuid = ty.cuid
WHERE ly.order_date BETWEEN '2020-06-01' AND '2020-06-30'
AND ty.order_date BETWEEN '2021-06-01'AND '2021-06-30'
AND ly.financial_status <> 'credit'
AND ty.financial_status <> 'credit'
AND ly.discounted_sales >0
AND ty.discounted_sales >0
AND ly.channel = 'b2b'
AND ty.channel = 'b2b'
ORDER BY ly.cuid asc
[Results]
cuid ly_orders ly_demand ty_orders ty_demand comments
D 1 22,466.40 4 154,596.24 ly is 4x actual
F 2 2,573.20 1 1,944.40 ty is 2x actual
G 1 32,134.40 4 1,632.00 ly is 4x actual
I 2 757.56 1 730.56 ty is 2x actual
J 2 54,859.00 2 23,822.32 both are 2x actual
THIS WORKED:
WITH prior_period AS (
SELECT
DISTINCT cuid AS ly_cuid,
COUNT(DISTINCT order_id) OVER (PARTITION BY cuid) AS ly_orders,
SUM(discounted_sales) OVER (PARTITION BY cuid) AS ly_demand
FROM TABLE
WHERE EXTRACT (YEAR FROM order_date) = 2020 AND EXTRACT(MONTH FROM order_date) = 6
AND financial_status <> 'credit'
AND discounted_sales >0
AND channel = 'b2b'
GROUP BY cuid, order_id, discounted_sales
ORDER BY cuid asc),
this_period AS (
SELECT
DISTINCT cuid AS ty_cuid,
COUNT(DISTINCT order_id) OVER (PARTITION BY cuid) AS ty_orders,
SUM(discounted_sales) OVER (PARTITION BY cuid) AS ty_demand
FROM TABLE
WHERE EXTRACT (YEAR FROM order_date) = 2021 AND EXTRACT(MONTH FROM order_date) = 6
AND financial_status <> 'credit'
AND discounted_sales >0
AND channel = 'b2b'
GROUP BY cuid, order_id, discounted_sales
ORDER BY cuid asc)
SELECT *
FROM prior_period ly
JOIN this_period ty ON ly.ly_cuid = ty.ty_cuid
ORDER BY ly.ly_cuid
Updated with your schema and approximate data:
Try this...
WITH periods AS (
SELECT cuid AS cuid
, COUNT(*) AS orders
, SUM(discounted_sales) AS demand
, EXTRACT(YEAR FROM order_date) AS yr
FROM demand2
WHERE EXTRACT(YEAR FROM order_date) IN (2020, 2021) AND EXTRACT(MONTH FROM order_date) = 6
AND financial_status <> 'credit'
AND discounted_sales > 0
AND channel = 'b2b'
GROUP BY cuid, EXTRACT(YEAR FROM order_date)
)
SELECT ly.cuid
, ly.orders AS ly_orders
, ly.demand AS ly_demand
, ty.orders AS ty_orders
, ty.demand AS ty_demand
FROM periods AS ly
JOIN periods AS ty
ON ly.cuid = ty.cuid
AND ly.yr = 2020
AND ty.yr = 2021
ORDER BY ly.cuid
;
The result:
+------+-----------+-----------+-----------+-----------+
| cuid | ly_orders | ly_demand | ty_orders | ty_demand |
+------+-----------+-----------+-----------+-----------+
| D | 1 | 5616.60 | 4 | 154596.24 |
| F | 2 | 2573.20 | 1 | 972.20 |
| G | 1 | 8033.60 | 4 | 1632.56 |
| I | 2 | 757.56 | 1 | 365.28 |
| J | 2 | 27429.50 | 2 | 11911.16 |
+------+-----------+-----------+-----------+-----------+
Here's a similar example with data, SQL and results to show both the incorrect result and the correct result.
The data:
SELECT * FROM demand ORDER BY account_id, period;
+----+------------+--------+--------+
| id | account_id | period | demand |
+----+------------+--------+--------+
| 1 | 1 | 202005 | 100 |
| 2 | 1 | 202005 | 120 |
| 3 | 1 | 202105 | 105 |
| 4 | 1 | 202105 | 125 |
| 5 | 1 | 202105 | 30 |
| 6 | 2 | 202005 | 200 |
| 7 | 2 | 202105 | 240 |
+----+------------+--------+--------+
The incorrect SQL, without SUMs to just show the join behavior:
SELECT t1.id, t1.account_id, t1.period, t1.demand AS demand1
, t2.id, t2.period, t2.demand AS demand2
FROM demand AS t1
LEFT JOIN demand AS t2
ON t1.account_id = t2.account_id
AND t2.period = 202105
WHERE t1.period = 202005
ORDER BY t1.account_id, t1.period, demand1, demand2
;
+----+------------+--------+---------+------+--------+---------+
| id | account_id | period | demand1 | id | period | demand2 |
+----+------------+--------+---------+------+--------+---------+
| 1 | 1 | 202005 | 100 | 5 | 202105 | 30 |
| 1 | 1 | 202005 | 100 | 3 | 202105 | 105 |
| 1 | 1 | 202005 | 100 | 4 | 202105 | 125 |
| 2 | 1 | 202005 | 120 | 5 | 202105 | 30 |
| 2 | 1 | 202005 | 120 | 3 | 202105 | 105 |
| 2 | 1 | 202005 | 120 | 4 | 202105 | 125 |
| 6 | 2 | 202005 | 200 | 7 | 202105 | 240 |
+----+------------+--------+---------+------+--------+---------+
Notice account 2 doesn't have a problem because only one demand was found last year and this year.
But account 1 found 2 demand rows for last year and 3 demand rows for this year, leading to (2 x 3) = 6 rows in the joined result. This is the source of the problem.
To correct this, we aggregate before the JOIN, so that we have at most one (1) row per account / period to be joined.
One form of the correct solution, with SUMs derived in the CTE term:
WITH demands AS (
SELECT account_id, period
, SUM(demand) AS demand
, COUNT(*) AS orders
FROM demand
GROUP BY account_id, period
)
SELECT ly.account_id, ly.period
, ly.orders AS ly_orders
, ly.demand AS ly_demand
, ty.orders AS ty_orders
, ty.demand AS ty_demand
FROM demands AS ly
LEFT JOIN demands AS ty
ON ly.account_id = ty.account_id
AND ty.period = 202105
WHERE ly.period = 202005
ORDER BY ly.account_id, ly.period, ly_demand, ty_demand
;
The result:
+------------+--------+-----------+-----------+-----------+-----------+
| account_id | period | ly_orders | ly_demand | ty_orders | ty_demand |
+------------+--------+-----------+-----------+-----------+-----------+
| 1 | 202005 | 2 | 220 | 3 | 260 |
| 2 | 202005 | 1 | 200 | 1 | 240 |
+------------+--------+-----------+-----------+-----------+-----------+
Since we performed aggregation in the CTE term (demands), the join found at most one row for each period for each account.

SQL query grouping by range

Hi have a table A with the following data:
+------+-------+----+--------+
| YEAR | MONTH | PA | AMOUNT |
+------+-------+----+--------+
| 2020 | 1 | N | 100 |
+------+-------+----+--------+
| 2020 | 2 | N | 100 |
+------+-------+----+--------+
| 2020 | 3 | O | 100 |
+------+-------+----+--------+
| 2020 | 4 | N | 100 |
+------+-------+----+--------+
| 2020 | 5 | N | 100 |
+------+-------+----+--------+
| 2020 | 6 | O | 100 |
+------+-------+----+--------+
I'd like to have the following result:
+---------+---------+--------+
| FROM | TO | AMOUNT |
+---------+---------+--------+
| 2020-01 | 2020-02 | 200 |
+---------+---------+--------+
| 2020-03 | 2020-03 | 100 |
+---------+---------+--------+
| 2020-04 | 2020-05 | 200 |
+---------+---------+--------+
| 2020-06 | 2020-06 | 100 |
+---------+---------+--------+
My DB is DB2/400.
I have tried with ROW_NUMBER partitioning, subqueries but I can't figure out how to solve this.
I understand this as a gaps-and-island problem, where you want to group together adjacent rows that have the same PA.
Here is an approach using the difference between row numbers to build the groups:
select min(year_month) year_month_start, max(year_month) year_month_end, sum(amount) amount
from (
select a.*, year * 100 + month year_month
row_number() over(order by year, month) rn1,
row_number() over(partition by pa order by year, month) rn2
from a
) a
group by rn1 - rn2
order by year_month_start
You can try the below -
select min(year)||'-'||min(month) as from_date,max(year)||'-'||max(month) as to_date,sum(amount) as amount from
(
select *,row_number() over(order by month)-
row_number() over(partition by pa order by month) as grprn
from t1
)A group by grprn,pa order by grprn
This works in tsql, guess you can adaot it to db2-400?
SELECT MIN(Dte) [From]
, MAX(Dte) [To]
-- ,PA
, SUM(Amount)
FROM (
SELECT year * 100 + month Dte
, Pa
, Amount
, ROW_NUMBER() OVER (PARTITION BY pa ORDER BY year * 100 + month) +
10000- (YEar*100+Month) rn
FROM tabA a
) b
GROUP BY Pa
, rn
ORDER BY [From]
, [To]
The trick is the row number function partitioned by PA and ordered by date, This'll count one up for each month for the, when added to the descnding count of month and month, you will get the same number for consecutive months with same PA. You the group by PA and the grouping yoou made, rn, to get the froups, and then Bob's your uncle.

How do I update a value based on the dense_rank result?

I have a query (SQL Server 2017) that finds two different discounts on the same date.
WITH CTE AS (SELECT [date_id], [good_id], [store_id], [name_promo_mech], [discount],
RN = DENSE_RANK() OVER (PARTITION BY [date_id], [good_id], [store_id], [name_promo_mech]
ORDER BY [discount]) +
DENSE_RANK() OVER (PARTITION BY [date_id], [good_id], [store_id], [name_promo_mech]
ORDER BY [discount] DESC) - 1
FROM [dbo].[ds_promo_list_by_day_new] AS PL
)
SELECT * FROM CTE
WHERE RN > 1;
GO
Query result:
+------------+----------+---------+-----------------+----------+----+
| date_id | store_id | good_id | name_promo_mech | discount | RN |
+------------+----------+---------+-----------------+----------+----+
| 2017-01-01 | 3 | 98398 | January 2017 | 15 | 2 |
+------------+----------+---------+-----------------+----------+----+
| 2017-01-01 | 3 | 98398 | January 2017 | 40 | 2 |
+------------+----------+---------+-----------------+----------+----+
| 2017-01-01 | 5 | 98398 | January 2017 | 15 | 3 |
+------------+----------+---------+-----------------+----------+----+
| 2017-01-01 | 5 | 98398 | January 2017 | 40 | 3 |
+------------+----------+---------+-----------------+----------+----+
| 2017-01-01 | 5 | 98398 | January 2017 | 30 | 3 |
+------------+----------+---------+-----------------+----------+----+
Now I want to make the discounts the same for all unique good_id, store_id, name_promo_merch in the source table. There is a rule for this. For example, for the row good_id = 98398, store_id = 3, name_promo_mech = N'january 2017' there were 10 entries with a 15 discount, and 20 with a 40 discount, then the 15 discount should be replaced with 40. However, if the number of entries for each discount was the same, then the maximum discount is set for all of them.
Can I do this? The number of rows in the source table is about 100 million.
What you want to do is set the value to the mode (a statistical term for the most common value) on each date and combination of whatever. You can use window functions:
with toupdate as (
select pl.*,
first_value(discount) over (partition by date_id, good_id, store_id, name_promo_mech order by cnt desc, discount desc) as mode_discount
from (select pl.*,
count(*) over (partition by date_id, good_id, store_id, name_promo_mech, discount) as cnt
from ds_promo_list_by_day_new pl
) pl
)
update toupdate
set discount = mode_discount
where mode_discount <> discount;
The subquery counts the number of values for each discount for each whatever on each day. The outer query gets the discount with the largest count, and in the case of ties, the larger value.
The rest is a simple update.

Get users who took ride for 3 or more consecutive dates

I have below table, it shows user_id and ride_date.
+---------+------------+
| user_id | ride_date |
+---------+------------+
| 1 | 2019-11-01 |
| 1 | 2019-11-03 |
| 1 | 2019-11-05 |
| 2 | 2019-11-03 |
| 2 | 2019-11-04 |
| 2 | 2019-11-05 |
| 2 | 2019-11-06 |
| 3 | 2019-11-03 |
| 3 | 2019-11-04 |
| 3 | 2019-11-05 |
| 3 | 2019-11-06 |
| 4 | 2019-11-05 |
| 4 | 2019-11-07 |
| 4 | 2019-11-08 |
| 4 | 2019-11-09 |
| 5 | 2019-11-11 |
| 5 | 2019-11-13 |
+---------+------------+
I want user_id who took rides for 3 or more consecutive days along with days on which they took consecutive rides
The desired result is as below
+---------+-----------------------+
| user_id | consecutive_ride_date |
+---------+-----------------------+
| 2 | 2019-11-03 |
| 2 | 2019-11-04 |
| 2 | 2019-11-05 |
| 2 | 2019-11-06 |
| 3 | 2019-11-03 |
| 3 | 2019-11-04 |
| 3 | 2019-11-05 |
| 3 | 2019-11-06 |
| 4 | 2019-11-08 |
| 4 | 2019-11-09 |
| 4 | 2019-11-10 |
+---------+-----------------------+
SQL Fiddle
With LAG() and LEAD() window functions:
with cte as (
select *,
datediff(
day,
lag([ride_date]) over (partition by [user_id] order by [ride_date]),
[ride_date]
) prev1,
datediff(
day,
lag([ride_date], 2) over (partition by [user_id] order by [ride_date]),
[ride_date]
) prev2,
datediff(
day,
[ride_date],
lead([ride_date]) over (partition by [user_id] order by [ride_date])
) next1,
datediff(
day,
[ride_date],
lead([ride_date], 2) over (partition by [user_id] order by [ride_date])
) next2
from Table1
)
select [user_id], [ride_date]
from cte
where
(prev1 = 1 and prev2 = 2) or
(prev1 = 1 and next1 = 1) or
(next1 = 1 and next2 = 2)
See the demo.
Results:
> user_id | ride_date
> ------: | :---------
> 2 | 03/11/2019
> 2 | 04/11/2019
> 2 | 05/11/2019
> 2 | 06/11/2019
> 3 | 03/11/2019
> 3 | 04/11/2019
> 3 | 05/11/2019
> 3 | 06/11/2019
> 4 | 07/11/2019
> 4 | 08/11/2019
> 4 | 09/11/2019
Here is one way to adress this gaps-and-island problem:
first, assign a rank to each user ride with row_number(), and recover the previous ride_date (aliased lag_ride_date)
then, compare the date of the previous ride to the current one in a conditional sum, that increases when the dates are successive ; by comparing this with the rank of the user ride, you get groups (aliased grp) that represent consecutive rides with a 1 day spacing
do a window count how many records belong to each group (aliased cnt)
filter on records whose window count is greater than 3
Query:
select user_id, ride_date
from (
select
t.*,
count(*) over(partition by user_id, grp) cnt
from (
select
t.*,
rn1
- sum(case when ride_date = dateadd(day, 1, lag_ride_date) then 1 else 0 end)
over(partition by user_id order by ride_date) grp
from (
select
t.*,
row_number() over(partition by user_id order by ride_date) rn1,
lag(ride_date) over(partition by user_id order by ride_date) lag_ride_date
from Table1 t
) t
) t
) t
where cnt >= 3
Demo on DB Fiddle
This is a typical gaps and island problems.
We can solve it as follows
with data
as (
select user_id
,ride_date
,dateadd(day
,-row_number() over(partition by user_id order by ride_date asc)
,ride_date) as grp_field
from Table1
)
,consecutive_days
as(
select user_id
,ride_date
,count(*) over(partition by user_id,grp_field) as cnt
from data
)
select *
from consecutive_days
where cnt>=3
order by user_id,ride_date
https://dbfiddle.uk/?rdbms=sqlserver_2017&fiddle=7bb851d9a12966b54afb4d8b144f3d46
There is no need to apply gaps-and-islands methodologies to this problem. The problem is much simpler to solve.
You can return the users and first date just by using LEAD():
SELECT t1.*
FROM (SELECT t1.*,
LEAD(ride_date, 2) OVER (PARTITION BY user_id ORDER BY ride_date) as ride_date_2
FROM table1 t1
) t1
WHERE ride_date_2 = DATEADD(day, 2, ride_date);
If you want the actual dates, you can unpivot the results:
SELECT DISTINCT t1.user_id, v.ride_date
FROM (SELECT t1.*,
LEAD(ride_date, 2) OVER (PARTITION BY user_id ORDER BY ride_date) as ride_date_2
FROM table1 t1
) t1 CROSS APPLY
(VALUES (t1.ride_date),
(DATEADD(day, 1, t1.ride_date)),
(DATEADD(day, 2, t1.ride_date))
) v(ride_date)
WHERE t1.ride_date_2 = DATEADD(day, 2, t1.ride_date)
ORDER BY t1.user_id, v.ride_date;

Redshift count with variable

Imagine I have a table on Redshift with this similar structure. Product_Bill_ID is the Primary Key of this table.
| Store_ID | Product_Bill_ID | Payment_Date
| 1 | 1 | 01/10/2016 11:49:33
| 1 | 2 | 01/10/2016 12:38:56
| 1 | 3 | 01/10/2016 12:55:02
| 2 | 4 | 01/10/2016 16:25:05
| 2 | 5 | 02/10/2016 08:02:28
| 3 | 6 | 03/10/2016 02:32:09
If I want to query the number of Product_Bill_ID that a store sold in the first hour after it sold its first Product_Bill_ID, how could I do this?
This example should outcome
| Store_ID | First_Payment_Date | Sold_First_Hour
| 1 | 01/10/2016 11:49:33 | 2
| 2 | 01/10/2016 16:25:05 | 1
| 3 | 03/10/2016 02:32:09 | 1
You need to get the first hour. That is easy enough using window functions:
select s.*,
min(payment_date) over (partition by store_id) as first_payment_date
from sales s
Then, you need to do the date filtering and aggregation:
select store_id, count(*)
from (select s.*,
min(payment_date) over (partition by store_id) as first_payment_date
from sales s
) s
where payment_date <= first_payment_date + interval '1 hour'
group by store_id;
SELECT
store_id,
first_payment_date,
SUM(
CASE WHEN payment_date < DATEADD(hour, 1, first_payment_date) THEN 1 END
) AS sold_first_hour
FROM
(
SELECT
*,
MIN(payment_date) OVER (PARTITION BY store_id) AS first_payment_date
FROM
yourtable
)
parsed_table
GROUP BY
store_id,
first_payment_date