delete all duplicated rows in sql - sql

I have table like this
|-------------------------|
| A | B | C |
|-------------------------|
| 1 | 2 | 5 |
|-------------------------|
| 1 | 2 | 10 |
|-------------------------|
| 1 | 2 | 2 |
|-------------------------|
I need to delete all duplicated rows with equals A nad B value and lower C value
after running sql script i need to have only this row with top C Value for every equals A and B columns
|-------------------------|
| A | B | V |
|-------------------------|
| 1 | 2 | 10 |
|-------------------------|

One method is window functions:
select t.*
from (select t.*,
row_number() over (partition by a, b order by v desc) as seqnum
from t
) t
where seqnum = 1;
This returns the entire row, which can be handy if you want additional columns. If you really need just the three columns, then aggregation does what you want:
select a, b, max(v)
from t
group by a, b;
In standard SQL, you can keep only the maximum value using:
delete from t
where t.v < (select max(t2.v) from t t2 where t2.a = t.a and t2.b = t.b);

Related

SQL Query - Check for Two Distinct Values

Given the below data set I want to run a query to highlight any 'pairs' that do not consist of a 'left' and 'right'.
+---------+-----------+---------------+----------------------+
| Pair_Id | Pair_Name | Individual_Id | Individual_Direction |
+---------+-----------+---------------+----------------------+
| 1 | A | A1 | Left |
| 1 | A | A2 | Right |
| 2 | B | B1 | Right |
| 2 | B | B2 | Left |
| 3 | C | C1 | Left |
| 3 | C | C2 | Left |
| 4 | D | D1 | Right |
| 4 | D | D2 | Left |
| 5 | E | E1 | Left |
| 5 | E | E2 | Right |
+---------+-----------+---------------+----------------------+
In this instance Pair 3 'C' has two lefts. Therefore, I would look to display the following:
+---------+-----------+---------------+----------------------+
| Pair_Id | Pair_Name | Individual_Id | Individual_Direction |
+---------+-----------+---------------+----------------------+
| 3 | C | C1 | Left |
| 3 | C | C2 | Left |
+---------+-----------+---------------+----------------------+
You can simply use not exists:
select t.*
from t
where not exists (select 1
from t t2
where t2.pair_id = t.pair_id and
t2.Individual_Direction <> t.Individual_Direction
) ;
With an index on (pair_id, Individual_Direction), this should not only be the most concise solution but also the fastest.
If you want to be sure that there are pairs (the above returns singletons):
select t.*
from t
where not exists (select 1
from t t2
where t2.pair_id = t.pair_id and
t2.Individual_Direction <> t.Individual_Direction
) and
exists (select 1
from t t2
where t2.pair_id = t.pair_id and
t2.Individual_ID <> t.Individual_ID
);
You can also do this using window functions:
select t.*
from (select t.*,
count(*) over (partition by pair_id) as cnt,
min(status) over (partition by pair_id) as min_status,
max(status) over (partition by pair_id) as max_status
from t
) t
where cnt > 1 and min_status <> max_status;
One option uses aggregation:
WITH cte AS (
SELECT Pair_Name
FROM yourTable
WHERE Individual_Direction IN ('Left', 'Right')
GROUP BY Pair_Name
HAVING MIN(Individual_Direction) = MAX(Individual_Direction)
)
SELECT *
FROM yourTable
WHERE Pair_Name IN (SELECT Pair_Name FROM cte);
The HAVING clause used above asserts that a matching pair has both a minimum and maximum direction which are the same. This implies that such a pair only has one direction.
As is the case with Gordon's answer, an index on (Pair_Name, Individual_Direction) might help performance:
CREATE INDEX idx ON yourTable (Pair_Name, Individual_Direction);
There should be an elegant way of using window function than what I wrote:
WITH ranked AS
(
SELECT *, RANK() OVER(ORDER BY Pair_Id, Pair_Name, Individual_Direction) AS r
FROM pairs
),
counted AS
(
SELECT Pair_Id, Pair_Name, Individual_Direction,r, COUNT(r) as times FROM ranked
GROUP BY Pair_Id, Pair_Name, Individual_Direction, r
HAVING COUNT(r) > 1
)
SELECT ranked.Pair_Id, ranked.Pair_Name, ranked.Individual_Id, ranked.Individual_Direction FROM ranked
RIGHT JOIN counted
ON ranked.Pair_Id=counted.Pair_Id
AND ranked.Pair_Name=counted.Pair_Name
AND ranked.Individual_Direction=counted.Individual_Direction

Iterate over the rows of a second table to return resultset with cumulative sum

Yesterday, after the help of a SO user #
Iterate over the rows of a second table to return resultset
I was able to make a combination of rows with a selfjoin.
After some modifications, to adapt to my implementation, I faced a new challenge that I'm stuck: how to make an aggregate sum of a third column?
My issue is better explained in the image below:
Based on the code
SELECT
b1.table_a_id,
b1.label_x,
b2.label_y
FROM table_a a
INNER JOIN table_b b1
ON b1.table_a_id = a.table_a_id
INNER JOIN table_b b2
ON b2.table_a_id = b1.table_a_id AND
b2.label_y > b1.label_x
ORDER BY
b1.table_a_id,
b1.label_x,
b2.label_y;
I was able to acquire the combinations.
What should be the next step to get the cumulative sum based on a third column?
I couldn't think of a solution without using a second service, such as python with pandas, using a cumsum function.
To generate the expected resultset, you would need to join the table with itself with an inequality condition on the order column. Then, you can do a window sum:
select
t1.table_a_id,
t1.label_x,
t2.label_y,
sum(t2.value) over(
partition by t1.table_a_id, t1.label_x
order by t1."order", t2."order"
) agg_value
from
table_b t1
inner join table_b t2
on t1.table_a_id = t2.table_a_id
and t2."order" >= t1."order"
order by t1."order", t2."order"
Note: order is a reserved word, so it needs to be quoted; if you actual database column has a different name, you can remove the double quotes.
Demo on DB Fiddle:
TABLE_A_ID | LABEL_X | LABEL_Y | AGG_VALUE
---------: | :------ | :------ | --------:
1 | A | B | 1
1 | A | C | 3
1 | A | D | 6
1 | A | E | 10
1 | A | F | 15
1 | B | C | 2
1 | B | D | 5
1 | B | E | 9
1 | B | F | 14
1 | C | D | 3
1 | C | E | 7
1 | C | F | 12
1 | D | E | 4
1 | D | F | 9
1 | E | F | 5
You seem to want a cumulative sum:
SELECT b1.table_a_id, b1.label_x, b2.label_y,
SUM(b1.value) OVER (PARTITION BY b1.table_a_id, b1.label_x
ORDER BY b2.order
) as AGG_VALUE

SQL: Select Most Recent Sequentially Distinct Value w/ Grouping

I am having trouble writing a query that would select the last "new" sequentially distinct value (let's call this column Col A) grouped based on another column (Col B). Since this is a bit ambiguous/confusing, here is an example to explain (assume row number is indicative of sequence inside groups; in my issue the rows are ordered by date):
|--------|-------|-------|
| RowNum | Col A | Col B |
|--------|-------|-------|
| 1 | A | A |
| 2 | B | A |
| 3 | C | A |
| 4 | B | B |
| 5 | A | B |
| 6 | B | B |
Would select:
| 3 | C | A |
| 6 | B | B |
Note that although B also appears in row 4, the fact that row 5 contains A means that the B in row 6 is sequentially distinct. But if table looked like this:
|--------|-------|-------|
| RowNum | Col A | Col B |
|--------|-------|-------|
| 1 | A | A |
| 2 | B | A |
| 3 | C | A |
| 4 | B | B |
| 5 | A | B |
| 6 | A | B | <--
Then we would want to select:
| 3 | C | A |
| 5 | A | B |
I think that this would be an easier problem if I wasn't concerned with values being distinct but not sequential. I'm not really sure how to even consider sequence when making a query.
I have attempted to solve this by calculating the min/max row numbers where each value of Col A appears. That calculation (using the second sample table) would produce a result like this:
|--------|--------|--------|--------|
| ColA | ColB | MinRow | MaxRow |
|--------|--------|--------|--------|
| A | A | 1 | 1 |
| B | A | 2 | 2 |
| C | A | 3 | 3 |
| A | B | 5 | 6 |
| B | B | 4 | 4 |
A solution raised in a related post (SQL: Select Row with Last New Sequentially Distinct Value) went on a similar path, essentially taking the most recent RowNum which differs from the last ColA and then picks the next row. However, in that question I failed to address the need for the query to work for multiple groups, hence the new post.
Any help with this problem, if it is at all possible to do in SQL, would be greatly appreciated. I am running SQL 2008 SP4.
Hmmm . . . One method is to get the last value. Then choose all the last rows with that value and aggregate:
select min(rownum), colA, colB
from (select t.*,
first_value(colA) over (partition by colB order by rownum desc) as last_colA
from t
) t
where rownum > all (select t2.rownum
from t t2
where t2.colB = t.colB and t2.colA <> t.last_colA
)
group by colA, colB;
Or, without the aggregation:
select t.*
from (select t.*,
first_value(colA) over (partition by colB order by rownum desc) as last_colA,
lag(colA) over (partition by colB order by rownum) as prev_clA
from t
) t
where rownum > all (select t2.rownum
from t t2
where t2.colB = t.colB and t2.colA <> t.last_colA
) and
(prev_colA is null or prev_colA <> colA);
But in SQL Server 2008, let's treat this as a gaps-and-islands problem:
select t.*
from (select t.*,
min(rownum) over (partition by colB, colA, (seqnum_b - seqnum_ab) ) as min_rownum_group,
max(rownum) over (partition by colB, colA, (seqnum_b - seqnum_ab) ) as max_rownum_group
from (select t.*,
row_number() over (partition by colB order by rownum) as seqnum_b,
row_number() over (partition by colB, colA order by rownum) as seqnum_ab,
max(rownum) over (partition by colB order by rownum) as max_rownum
from t
) t
) t
where rownum = min_rownum_group and -- first row in the group defined by adjacent colA, colB
max_rownum_group = max_rownum -- last group for each colB;
This identifies each of the groups using a difference of row numbers. It calculates the maximum rownum for the group and overall in the data. These are the same for the last group.

How can I select each particular data up to a certain quantity?

How can I select each particular data upto a certain quantity. For example in the below table, there are 4 A, 4 B, 2 C and 1 D. Now I want to select all letters but not more than two each of it, Which will yield 2 A, 2 B, 2 C and 1 D.
+====+========+
| ID | Letter |
+====+========+
| 1 | A |
+----+--------+
| 2 | B |
+----+--------+
| 3 | B |
+----+--------+
| 4 | C |
+----+--------+
| 5 | A |
+----+--------+
| 6 | A |
+----+--------+
| 7 | C |
+----+--------+
| 8 | B |
+----+--------+
| 9 | B |
+----+--------+
| 10 | D |
+----+--------+
| 11 | A |
+----+--------+
Can anyone please help me for the above scenario?
I can think of a simple way:
select
case
when count(*) > 1
then 2
else count(*)
end,
second_column
from your_table
group by second_column;
This will give the result you want, but it won't really 'select ONLY two or less records' of each.
Using a ROW_NUMBER() function and a derived table:
CREATE TABLE myTable (id int, Letter varchar(1))
INSERT INTO myTable
VALUES (1,'A')
,(2,'B')
,(3,'B')
,(4,'C')
,(5,'A')
,(6,'A')
,(7,'C')
,(8,'B')
,(9,'B')
,(10,'D')
,(11,'A')
SELECT id, Letter
FROM
(SELECT *
,ROW_NUMBER() OVER(PARTITION BY Letter ORDER BY Letter) as rn
FROM myTable) myTable
WHERE rn = 1 or rn = 2
In essence, "cut" (PARTITION) the rows by Letters, and assign them each a number for its unique group, then pick the first two of each Letter.
Try it here:
http://rextester.com/WTKYCE51114
Use ROW_NUMBER() function to tag each record the row number and PARTITION it BY (grouping by) letter and ORDER it BY (id)
SELECT id,
letter
FROM (SELECT *,
ROW_NUMBER() OVER(PARTITION BY letter ORDER BY id) rnum
FROM myTable
) t
WHERE rnum <=2
Ordering it by id, you will have the first two instances of each letter in ascending order, thus you will have below result (note that id 1 and 5 are selected for A, 2 and 3 for B)
id letter
1 A
5 A
2 B
3 B
4 C
7 C
10 D

Select non-unique id where no row meets criteria

Say I have this table, and I want to select the IDs where all D is < 4. In this case it would only select ID 1 because 2's D>4, and 3 has a D>4
+----+---+------+
| ID | D | U-ID |
+----+---+------+
| 1 | 1 | a |
+----+---+------+
| 1 | 2 | b |
+----+---+------+
| 2 | 5 | c |
+----+---+------+
| 3 | 5 | d |
+----+---+------+
| 3 | 2 | e |
+----+---+------+
| 3 | 3 | f |
+----+---+------+
I really don't even know where to start making a query for this, and my sql isn't good enough yet to know what to google, so I'm sorry if this has been asked before.
I would simply do:
select id
from table
group by id
having max(d) < 4;
If you happened to want all the original rows, I would use a window function:
select t.*
from (select t.*, max(d) over (partition by id) as maxd
from t
) t
where maxd < 4;
Here's one option using conditional aggregation:
select id
from yourtable
group by id
having count(case when d >= 4 then 1 end) = 0
SQL Fiddle Demo
If you need all the data from the corresponding rows/columns, you can either join back to the table using the above, or alternatively you could use not exists:
select *
from yourtable t
where not exists (
select 1
from yourtable t2
where t.id = t2.id and
t2.d >= 4
)
use this query.
select ID from yourtablename where D < 4;