SQL query - remove duplicated - sql

I have a table with the following columns that matter:
ID | commentid
1 | abs345
2 | abs345
3 | abs345
4 | poly234
5 | poly234
6 | qq1r4c
7 | abs345
8 | abs345
And I intend to delete the lines where the commentid is duplicated, that is, when the ID numbering is not followed sequentially.
For this example, the lines with ID 7 and 8 would be eliminated.

Do you want to return all rows except for the last comment id when it is duplicated?
select t.*
from (select t.*,
count(*) over (partition by commentid) as commentid_cnt,
max(id) over (partition by commentid) as max_commentid_id,
max(id) over () as max_id
from t
) t
where max_id = max_comment_id and commentid_cnt > 1;
EDIT:
Oh, I think I understand. You want to keep only the first "grouping" of commentid. Assuming that the is are sequential with no gaps, then one approach is:
enumerate the rows for each commentid
subtract the value from id
If this is larger than the minimum id minus 1, then you are not in the "first" group.
This looks like:
select t.*
from (select t.*,
min(id) over (partition by commentid) as min_id,
row_number() over (partition by commentid order by id) as seqnum
from t
) t
where id - seqnum = min_id - 1

Related

Is there a way to calculate average based on distinct rows without using a subquery?

If I have data like so:
+----+-------+
| id | value |
+----+-------+
| 1 | 10 |
| 1 | 10 |
| 2 | 20 |
| 3 | 30 |
| 2 | 20 |
+----+-------+
How do I calculate the average based on the distinct id WITHOUT using a subquery (i.e. querying the table directly)?
For the above example it would be (10+20+30)/3 = 20
I tried to do the following:
SELECT AVG(IF(id = LAG(id) OVER (ORDER BY id), NULL, value)) AS avg
FROM table
Basically I was thinking that if I order by id and check the previous row to see if it has the same id, the value should be NULL and thus it would not be counted into the calculation, but unfortunately I can't put analytical functions inside aggregate functions.
As far as I know, you can't do this without a subquery. I would use:
SELECT AVG(avg_value)
FROM
(
SELECT AVG(value) AS avg_value
FROM yourTable
GROUP BY id
) t;
WITH RANK AS (
Select *,
ROW_NUMBER() OVER(PARTITION BY ID) AS RANK
FROM
TABLE
QUALIFY RANK = 1
)
SELECT
AVG(VALUES)
FROM RANK
The outer query will have other parameters that need to access all the data in the table
I interpret this comment as wanting an average on every row -- rather than doing an aggregation. If so, you can use window functions:
select t.*,
avg(case when seqnum = 1 then value end) over () as overall_avg
from (select t.*,
row_number() over (partition by id order by id) as seqnum
from t
) t;
Yes there is a way,
Simply use distinct inside the avg function as below :
select avg(distinct value) from tab;
http://sqlfiddle.com/#!4/9d156/2/0

Removing duplicate values from sql server on condition of 2 columns

|Rownumber |OldIdassigned |commoncode |
------------------------------------------
| 1 |FLEX |Y2573F102 |
------------------------------------------
| 2 |RCL |Y2573F102 |
------------------------------------------
| 3 |FLEX |Y2573F102 |
------------------------------------------
| 4 |QGEN |N72482123 |
------------------------------------------
| 5 |QGEN |N72482123 |
------------------------------------------
| 6 |QGEN |N72482123 |
------------------------------------------
| 7 |RACE |N72482123 |
------------------------------------------
| 8 |CLB |N22717107 |
------------------------------------------
| 9 |CLB |N22717107 |
------------------------------------------
<b>| 10 |CLB |N22717107 |
I need to delete the duplicate records based on Common code and a condition that - if oldidassigned is same then delete else don't delete.
For example Y2573F102 has 3 duplicate records rows 1,2,3 .... 1,2 need not to be deleted , only 3rd row has to be deleted.
I like updatable CTEs and window functions for this purpose:
with todelete as (
select t.*,
row_number() over (partition by commoncode order by rownumber) as seqnum
from t
)
delete todelete
where seqnum > 1;
Use ROW_NUMBER() :
DELETE t
FROM (SELECT t.*, ROW_NUMBER() OVER (PARTITION BY OldIdassigned, commoncode ORDER BY rownumber) AS Seq
FROM table t
) t
WHERE t.seq > 1;
EDIT : If you want to check the duplication based on commoncode only then remove OldIdassigned from PARTITION clause :
DELETE t
FROM (SELECT t.*, ROW_NUMBER() OVER (PARTITION BY commoncode ORDER BY rownumber DESC) AS Seq
FROM table t
) t
WHERE t.seq > 1;
use window function row_number, according to your description and comments it seems you need change in partition clause
delete t
from
(select t1.*,row_number() over(partition by commoncode order by Rownumber) rn from table t1
)t where rn<>1
https://dbfiddle.uk/?rdbms=sqlserver_2017&fiddle=eacc0688efb534a0addee68678f323fe
Use Row_Number()
delete t from
(select *, row_number() over(partition by commoncode order by
rownumber) as rn) t
where rn<>1
Since all answers are similar (and correct), I will post one alternative way:
DELETE FROM TableA
WHERE EXISTS ( SELECT * FROM TableA AS A2
WHERE A2.commoncode = TableA.commoncode
AND A2.OldIdassigned = TableA.OldIdassigned
AND A2.Rownumber < TableA.Rownumber )

Need to sum all Most Recent Rows from each Store that have ItemID

I have a table with, among other things, these columns: DateTransferred, ComputedQuantity, StoreID, ItemID
I have two goals. My simpler goal is to write a query where I feel in the ItemID and it sums up the ComputedQuantity where it matches that ItemID, only using the most recent DateTransferred for each StoreID. So with the following example data:
DateTransferred | StoreID | ItemID | ComputedQuantity
11/10/17 | 1 | 1 | 3 <
10/10/17 | 1 | 1 | 4
09/10/17 | 2 | 1 | 9 <
08/10/17 | 3 | 1 | 1 <
07/10/17 | 3 | 1 | 10
I would want it to pull every row with < next to it, as that's the most recent Date for that StoreID, and sum up to 13
My more complicated goal is that I would like to include the above-calculated value into a 'join' where I'm dealing with the Item table, so that I can pull all the items and join them with a new column which has the summed up ComputedQuantity
This is on SQL Server 10 on Windows Server 2008, if that matters
One simple method uses a correlated subquery:
select t.*
from t
where t. DateTransferred = (select max(t2.DateTransferred)
from t t2
where t2.storeid = t.storeid
);
Another even simpler method uses window functions:
select t.*
from (select t.*,
row_number() over (partition by storeid order by DateTransferred desc) as seqnum
from t
) t
where seqnum = 1;
In either case, you can add a where clause to the subquery if you want the most recent date on or before some given date (say a year ago).
Also, these both assume that your data has no future dates. If so, then add where DateTransferred < getdate().
The final statement which sums the ComputedQuantities:
select ItemID, SUM(ComputedQuantity) Quantity
from (select t.*,
row_number() over (partition by StoreID, ItemID order by DateTransferred DESC) as seqnum
from [db].[dbo].[InventoryTransferLog] t
) t
where seqnum = 1 and ComputedQuantity > 0
GROUP BY ItemID
ORDER BY ItemID
I decided not to sum values < 0

Compare different orders of the same table

I have this following scenario, a table with these columns:
table_id|user_id|os_number|inclusion_date
In the system, the os_number is sequential for the users, but due to a system bug some users inserted OSs in wrong order. Something like this:
table_id | user_id | os_number | inclusion_date
-----------------------------------------------
1 | 1 | 1 | 2015-11-01
2 | 1 | 2 | 2015-11-02
3 | 1 | 3 | 2015-11-01
Note the os number 3 inserted before the os number 2
What I need:
Recover the table_id of the rows 2 and 3, which is out of order.
I have these two select that show me the table_id in two different orders:
select table_id from table order by user_id, os_number
select table_id from table order by user_id, inclusion_date
I can't figure out how can I compare these two selects and see which users are affected by this system bug.
Your question is a bit difficult because there is no correct ordering (as presented) -- because dates can have ties. So, use the rank() or dense_rank() function to compare the two values and return the ones that are not in the correct order:
select t.*
from (select t.*,
rank() over (partition by user_id order by inclusion_date) as seqnum_d,
rank() over (partition by user_id order by os_number) as seqnum_o
from t
) t
where seqnum_d <> seqnum_o;
Use row_number() over both orders:
select *
from (
select *,
row_number() over (order by os_number) rnn,
row_number() over (order by inclusion_date) rnd
from a_table
) s
where rnn <> rnd;
table_id | user_id | os_number | inclusion_date | rnn | rnd
----------+---------+-----------+----------------+-----+-----
3 | 1 | 3 | 2015-11-01 | 3 | 2
2 | 1 | 2 | 2015-11-02 | 2 | 3
(2 rows)
Not entirely sure about the performance on this but you could use a cross apply on the same table to get the results in one query. This will bring up the pairs of table_ids which are incorrect.
select
a.table_id as InsertedAfterTableId,
c.table_id as InsertedBeforeTableId
from table a
cross apply
(
select b.table_id
from table b
where b.inclusion_date < a.inclusion_date and b.os_number > a.os_number
) c
Both query examples given below simply check a mismatch between inclusion date and os_number:
This first query should return the offending row (the one whose os_number is off from its inclusion date)--in the case of the example row 3.
select table.table_id, table.user_id, table.os_number from table
where EXISTS(select * from table t
where t.user_id = table.user_id and
t.inclusion_date > table.inclusion_date and
t.os_number < table.os_number);
This second query will return the table numbers and users for two rows that are mismatched:
select first_table.table_id, second_table.table_id, first_table.user_id from
table first_table
JOIN table second_table
ON (first_table.user_id = second_table.user_id and
first_table.inclusion_date > second_table.inclusion_date and
first_table.os_number < second_table.os_number);
I would use WINDOW FUNCTIONS to get row numbers in orders in question and then compare them:
SELECT
sub.table_id,
sub.user_id,
sub.os_number,
sub.inclusion_date,
number_order_1, number_order_2
FROM (
SELECT
table_id,
user_id,
os_number,
inclusion_date,
row_number() OVER (PARTITION BY user_id
ORDER BY os_number
ROWS BETWEEN UNBOUNDED PRECEDING
AND UNBOUNDED FOLLOWING
) AS number_order_1,
row_number() OVER (PARTITION BY user_id
ORDER BY inclusion_date
ROWS BETWEEN UNBOUNDED PRECEDING
AND UNBOUNDED FOLLOWING
) AS number_order_2
FROM
table
) sub
WHERE
number_order_1 <> number_order_1
;
EDIT:
Because of a_horse_with_no_name made good point about my final answer. I've back to my first answer (look in edit history) which work also if os_number isn't gapless.
select *
from (
select a_table.*,
lag(inclusion_date) over (partition by user_id order by os_number) as last_date
from a_table
) result
where last_date is not null AND last_date>inclusion_date;
This should cover gaps as well as ties. Basically, I simply check the inclusion_date of the last os_number, and make sure it's not strictly greater than the current date (so 2 version on the same date is fine).

Group by groups of 3 by column

Say I have a table that looks like this:
| id | category_id | created_at |
| 1 | 3 | date... |
| 2 | 4 | date... |
| 3 | 1 | date... |
| 4 | 2 | date... |
| 5 | 5 | date... |
| 6 | 6 | date... |
And imagine there are a lot more entries. I'd like to grab these in a way that they are fresh, so ordering them by created_at DESC - but I'd also like to group them by category, in groups of 3!
So in pseudocode it looks something like this:
Go to category 1
-> Pick last 3
Go to category 2
-> Pick last 3
Go to category 3
-> Pick last 3
And so forth, starting over from category_id 1 when there's no other category to grab from. This will then be paginated as well so I need to make it work with offset & limit as well somehow.
I'm not at all sure where to start or what they keywords to google for are. I'd be happy with some nudges in the right direction so I can find the answer myself, or a full answer.
Another case for the window function row_number().
Just the latest 3 rows per category
SELECT id, category_id, created_at
FROM (
SELECT id, category_id, created_at
, row_number() OVER (PARTITION BY category_id
ORDER BY created_at DESC) AS rn
FROM tbl
) sub
WHERE rn < 4
ORDER BY category_id, rn;
The latest 3 rows per category, plus later rows
If you want to append the rest of the rows (your question gets fuzzy if and how):
SELECT *
FROM (
SELECT id, category_id, created_at
, row_number() OVER (PARTITION BY category_id
ORDER BY created_at DESC) AS rn
FROM tbl
) sub
ORDER BY (rn > 3), category_id, rn;
One can sort by the outcome of a boolean expression (rn > 3):
FALSE (0)
TRUE (1)
NULL (because default is NULLS LAST - not applicable here)
This way, the latest 3 rows per category come first and all the rest later.
Or use a CTE and UNION ALL:
WITH cte AS (
SELECT id, category_id, created_at
, row_number() OVER (PARTITION BY category_id
ORDER BY created_at DESC) AS rn
FROM tbl
)
)
SELECT id, category_id, created_at
FROM cte
WHERE rn < 4
ORDER BY category_id, rn
)
UNION ALL
)
SELECT id, category_id, created_at
FROM cte
WHERE rn >= 4
ORDER BY category_id, rn
);
Same result.
All parentheses required to attach ORDER BY in individual legs of a UNION query.