SQL - Deleting duplicate columns only if another column matches [duplicate] - sql

This question already has answers here:
SQL - Delete duplicate columns error [duplicate]
(4 answers)
How to delete duplicate rows in SQL Server?
(26 answers)
Closed 4 years ago.
I have the following table (TBL_VIDEO) with duplicate column entries in "TIMESTAMP", and I want to remove them only if the "CAMERA" number matches.
BEFORE:
ANALYSIS_ID | TIMESTAMP | EMOTION | CAMERA
-------------------------------------------
1 | 5 | HAPPY | 1
2 | 10 | SAD | 1
3 | 10 | SAD | 1
4 | 5 | HAPPY | 2
5 | 15 | ANGRY | 2
6 | 15 | HAPPY | 2
AFTER:
ANALYSIS_ID | TIMESTAMP | EMOTION | CAMERA
-------------------------------------------
1 | 5 | HAPPY | 1
2 | 10 | SAD | 1
4 | 5 | HAPPY | 2
5 | 15 | ANGRY | 2
I have attempted this statement but the columns wouldn't delete accordingly. I appreciate all the help to produce a correct SQL statement. Thanks in advance!
delete y
from TBL_VIDEO y
where exists (select 1 from TBL_VIDEO y2 where y.TIMESTAMP = y2.TIMESTAMP and y2.ANALYSIS_ID < y.ANALYSIS_ID, y.CAMERA = y.CAMERA, y2.CAMERA = y2.CAMERA);

try this:
delete f2 from (
select row_number() over(partition by TIMESTAMP, CAMERA order by ANALYSIS_ID) rang
from yourtable f1
) f2 where f2.rang>1

Other solution :
delete f1 from yourtable f1
where exists
(
select * from yourtable f2
where f2.TIMESTAMP=f1.TIMESTAMP and f2.CAMERA=f1.CAMERA and f1.ANALYSIS_ID>f2.ANALYSIS_ID
)

use row_number and find the duplicate and delete them
delete from
(select *,row_number() over(partition by TIMESTAMP,CAMERA order by ANALYSIS_ID) as rn from TBL_VIDEO
) t1 where rn>1

;WITH cte
AS
(
select ANALYSIS_ID,
ROW_NUMBER() over(partition by TIMESTAMP, CAMERA order by ANALYSIS_ID) rnk
)
DELETE FROM cte WHERE cte.rnk > 1

You can use subquery :
select v.*
from tbl_video v
where analysis_id = (select min(v1.analysis_id)
from tbl_video v1
where v1.timestamp = v.timestamp and
v1.camera = v.camera
);
However, analytical function with top (1) with ties clause also useful :
select top (1) with ties v.*
from tbl_video v
order by row_number() over (partition by v.timestamp, v.camera order by v.analysis_id);
So, your delete version would be :
delete v
from tbl_video v
where analysis_id = (select min(v1.analysis_id)
from tbl_video v1
where v1.timestamp = v.timestamp and
v1.camera = v.camera
);

Related

MSSQL: How to increment a int-column grouped by another column?

Given the following table:
UserId | Idx
1 | 0
1 | 1
1 | 3
1 | 5
2 | 1
2 | 2
2 | 3
2 | 5
And I want to update the Idx column that it is correctly incremented grouped by UserId column:
UserId | Idx
1 | 0
1 | 1
1 | 2
1 | 3
2 | 0
2 | 1
2 | 2
2 | 3
I know its possible with T-SQL (with Cursor), but is it also possible with a single statement?
Thank you
You can use correlated subquery :
update t
set idx = coalesce((select count(*)
from table as t1
where t1.userid = t.userid and t1.idx < t.idx
), 0
);
Use ROW_NUMBER() with Partition
update tablex set Idx=A.Idx
from tablex T
inner join
(
select UserID ,ID,ROW_NUMBER() OVER (PARTITION BY UserID ORDER By UserID)-1 Idx
from tablex
) A on T.ID=A.ID
Use an updatable CTE:
with toupdate as (
select t.*,
row_number() over (partition by user_id order by idx) - 1 as new_idx
from t
)
update toupdate
set idx = new_idx
where new_idx <> new_idx;
This should be the fastest method for solving this problem.

Partitioning function for continuous sequences

There is a table of the following structure:
CREATE TABLE history
(
pk serial NOT NULL,
"from" integer NOT NULL,
"to" integer NOT NULL,
entity_key text NOT NULL,
data text NOT NULL,
CONSTRAINT history_pkey PRIMARY KEY (pk)
);
The pk is a primary key, from and to define a position in the sequence and the sequence itself for a given entity identified by entity_key. So the entity has one sequence of 2 rows in case if the first row has the from = 1; to = 2 and the second one has from = 2; to = 3. So the point here is that the to of the previous row matches the from of the next one.
The order to determine "next"/"previous" row is defined by pk which grows monotonously (since it's a SERIAL).
The sequence does not have to start with 1 and the to - from does not necessary 1 always. So it can be from = 1; to = 10. What matters is that the "next" row in the sequence matches the to exactly.
Sample dataset:
pk | from | to | entity_key | data
----+--------+------+--------------+-------
1 | 1 | 2 | 42 | foo
2 | 2 | 3 | 42 | bar
3 | 3 | 4 | 42 | baz
4 | 10 | 11 | 42 | another foo
5 | 11 | 12 | 42 | another baz
6 | 1 | 2 | 111 | one one one
7 | 2 | 3 | 111 | one one one two
8 | 3 | 4 | 111 | one one one three
And what I cannot realize is how to partition by "sequences" here so that I could apply window functions to the group that represents a single "sequence".
Let's say I want to use the row_number() function and would like to get the following result:
pk | row_number | entity_key
----+-------------+------------
1 | 1 | 42
2 | 2 | 42
3 | 3 | 42
4 | 1 | 42
5 | 2 | 42
6 | 1 | 111
7 | 2 | 111
8 | 3 | 111
For convenience I created an SQLFiddle with initial seed: http://sqlfiddle.com/#!15/e7c1c
PS: It's not the "give me the codez" question, I made my own research and I just out of ideas how to partition.
It's obvious that I need to LEFT JOIN with the next.from = curr.to, but then it's still not clear how to reset the partition on next.from IS NULL.
PS: It will be a 100 points bounty for the most elegant query that provides the requested result
PPS: the desired solution should be an SQL query not pgsql due to some other limitations that are out of scope of this question.
I don’t know if it counts as “elegant,” but I think this will do what you want:
with Lagged as (
select
pk,
case when lag("to",1) over (order by pk) is distinct from "from" then 1 else 0 end as starts,
entity_key
from history
), LaggedGroups as (
select
pk,
sum(starts) over (order by pk) as groups,
entity_key
from Lagged
)
select
pk,
row_number() over (
partition by groups
order by pk
) as "row_number",
entity_key
from LaggedGroups
Just for fun & completeness: a recursive solution to reconstruct the (doubly) linked lists of records. [ this will not be the fastest solution ]
NOTE: I commented out the ascending pk condition(s) since they are not needed for the connection logic.
WITH RECURSIVE zzz AS (
SELECT h0.pk
, h0."to" AS next
, h0.entity_key AS ek
, 1::integer AS rnk
FROM history h0
WHERE NOT EXISTS (
SELECT * FROM history nx
WHERE nx.entity_key = h0.entity_key
AND nx."to" = h0."from"
-- AND nx.pk > h0.pk
)
UNION ALL
SELECT h1.pk
, h1."to" AS next
, h1.entity_key AS ek
, 1+zzz.rnk AS rnk
FROM zzz
JOIN history h1
ON h1.entity_key = zzz.ek
AND h1."from" = zzz.next
-- AND h1.pk > zzz.pk
)
SELECT * FROM zzz
ORDER BY ek,pk
;
You can use generate_series() to generate all the rows between the two values. Then you can use the difference of row numbers on that:
select pk, "from", "to",
row_number() over (partition by entity_key, min(grp) order by pk) as row_number
from (select h.*,
(row_number() over (partition by entity_key order by ind) -
ind) as grp
from (select h.*, generate_series("from", "to" - 1) as ind
from history h
) h
) h
group by pk, "from", "to", entity_key
Because you specify that the difference is between 1 and 10, this might actually not have such bad performance.
Unfortunately, your SQL Fiddle isn't working right now, so I can't test it.
Well,
this not exactly one SQL query but:
select a.pk as PK, a.entity_key as ENTITY_KEY, b.pk as BPK, 0 as Seq into #tmp
from history a left join history b on a."to" = b."from" and a.pk = b.pk-1
declare #seq int
select #seq = 1
update #tmp set Seq = case when (BPK is null) then #seq-1 else #seq end,
#seq = case when (BPK is null) then #seq+1 else #seq end
select pk, entity_key, ROW_NUMBER() over (PARTITION by entity_key, seq order by pk asc)
from #tmp order by pk
This is in SQL Server 2008

how to query range?

Raw Data
| ID | STATUS |
| 1 | A |
| 2 | A |
| 3 | B |
| 4 | B |
| 5 | B |
| 6 | A |
| 7 | A |
| 8 | A |
| 9 | C |
Result
| START | END |
| 1 | 2 |
| 6 | 8 |
Range of STATUS A
How to query ?
This should give you the correct ranges:
SELECT
STATUS,
MIN(ID),
max_id
FROM (
SELECT
t1.STATUS,
t1.ID,
COALESCE(MAX(t2.ID), t1.ID) max_id
FROM
yourtable t1 LEFT JOIN yourtable t2
ON t1.STATUS=t2.STATUS AND t1.ID<t2.ID
WHERE
NOT EXISTS (SELECT NULL
FROM yourtable t3
WHERE
t3.STATUS!=t1.STATUS
AND t3.ID>t1.ID AND t3.ID<t2.ID)
GROUP BY
t1.ID,
t1.STATUS
) s
WHERE
status = 'A'
GROUP BY
STATUS,
max_id
Please see fiddle here.
You are probably better off with a cursor-based solution or a client-side function.
However, if you were using Oracle - the following would work.
WITH LOWER_VALS AS
( -- All the Ids with no immediate predecessor
SELECT ROWNUM AS RN, STATUS, ID AS LOWER FROM
(
SELECT STATUS, ID
FROM RAWDATA RD1
WHERE RD1.ID -1 NOT IN
(SELECT ID FROM RAWDATA PRED_TABLE WHERE PRED_TABLE.STATUS = RD1.STATUS)
ORDER BY STATUS, ID
)
) ,
UPPER_VALS AS
( -- All the Ids with no immediate successor
SELECT ROWNUM AS RN, STATUS, ID AS UPPER FROM
(
SELECT STATUS, ID
FROM RAWDATA RD2
WHERE RD2.ID +1 NOT IN
(SELECT ID FROM RAWDATA SUCC_TABLE WHERE SUCC_TABLE.STATUS = RD2.STATUS)
ORDER BY STATUS, ID
)
)
SELECT
L.STATUS, L.LOWER, U.UPPER
FROM
LOWER_VALS L
JOIN UPPER_VALS U ON
U.RN = L.RN;
Results in the set
A 1 2
A 6 8
B 3 5
C 9 9
http://sqlfiddle.com/#!4/10184/2
There is not a lot to go on from what you put, but I think this might work. I am using T-SQL because I don't know what you are using?
SELECT
min(ID)
, max(ID)
FROM RawData
WHERE [Status] = 'A'

How to add row data to first occurrence of string field

I have found many examples of how to remove duplicate rows, but they all involve rows with a unique integer id.
Here is what I need to know. I want to merge all the duplicate stringIDs together and sum the values of the other columns.
I have this:
stringID | v1 | v2 | v3
a | 2 | 3 | 4
b | 5 | 4 | 1
a | 1 | 1 | 2
b | 2 | 1 | 1
I want this:
stringID | v1 | v2 | v3
a | 3 | 4 | 6
b | 7 | 5 | 2
Thank you for the help.
EDIT
I am using MySQL
I think just a simple GROUP BY and SUM() should give you the results you're looking for:
SELECT
StringID,
SUM(v1) AS v1,
SUM(v2) AS v2,
SUM(v3) AS v3
FROM YourTable
GROUP BY StringID
See it in action with Sql Fiddle.
(disclaimer: with SQL-Server >= 2005)
So you want to update first with the sum of all records, then you want to delete the dups:
Update:
WITH CTE AS
(
SELECT stringID,
v1_sum = SUM(v1) OVER (PARTITION BY stringID),
v2_sum = SUM(v2) OVER (PARTITION BY stringID),
v3_sum = SUM(v3) OVER (PARTITION BY stringID),
RN = ROW_NUMBER()OVER(PARTITION BY stringId Order By stringId)
FROM dbo.TableName
)
UPDATE tn SET v1 = v1_sum, v2 = v2_sum, v3 = v3_sum
FROM CTE c
INNER JOIN dbo.TableName tn ON c.stringId=tn.stringId
WHERE c.RN = 1;
Delete:
WITH CTE AS
(
SELECT
RN = ROW_NUMBER()OVER(PARTITION BY stringId Order By stringId)
FROM dbo.TableName
)
DELETE FROM CTE WHERE RN > 1;
DEMO
Try this...
select stringID, sum(v1), sum(v2), sum(v3)
from yourtable
group by stringID

Grouping SQL Results based on order

I have table with data something like this:
ID | RowNumber | Data
------------------------------
1 | 1 | Data
2 | 2 | Data
3 | 3 | Data
4 | 1 | Data
5 | 2 | Data
6 | 1 | Data
7 | 2 | Data
8 | 3 | Data
9 | 4 | Data
I want to group each set of RowNumbers So that my result is something like this:
ID | RowNumber | Group | Data
--------------------------------------
1 | 1 | a | Data
2 | 2 | a | Data
3 | 3 | a | Data
4 | 1 | b | Data
5 | 2 | b | Data
6 | 1 | c | Data
7 | 2 | c | Data
8 | 3 | c | Data
9 | 4 | c | Data
The only way I know where each group starts and stops is when the RowNumber starts over. How can I accomplish this? It also needs to be fairly efficient since the table I need to do this on has 52 Million Rows.
Additional Info
ID is truly sequential, but RowNumber may not be. I think RowNumber will always begin with 1 but for example the RowNumbers for group1 could be "1,1,2,2,3,4" and for group2 they could be "1,2,4,6", etc.
For the clarified requirements in the comments
The rownumbers for group1 could be "1,1,2,2,3,4" and for group2 they
could be "1,2,4,6" ... a higher number followed by a lower would be a
new group.
A SQL Server 2012 solution could be as follows.
Use LAG to access the previous row and set a flag to 1 if that row is the start of a new group or 0 otherwise.
Calculate a running sum of these flags to use as the grouping value.
Code
WITH T1 AS
(
SELECT *,
LAG(RowNumber) OVER (ORDER BY ID) AS PrevRowNumber
FROM YourTable
), T2 AS
(
SELECT *,
IIF(PrevRowNumber IS NULL OR PrevRowNumber > RowNumber, 1, 0) AS NewGroup
FROM T1
)
SELECT ID,
RowNumber,
Data,
SUM(NewGroup) OVER (ORDER BY ID
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS Grp
FROM T2
SQL Fiddle
Assuming ID is the clustered index the plan for this has one scan against YourTable and avoids any sort operations.
If the ids are truly sequential, you can do:
select t.*,
(id - rowNumber) as grp
from t
Also you can use recursive CTE
;WITH cte AS
(
SELECT ID, RowNumber, Data, 1 AS [Group]
FROM dbo.test1
WHERE ID = 1
UNION ALL
SELECT t.ID, t.RowNumber, t.Data,
CASE WHEN t.RowNumber != 1 THEN c.[Group] ELSE c.[Group] + 1 END
FROM dbo.test1 t JOIN cte c ON t.ID = c.ID + 1
)
SELECT *
FROM cte
Demo on SQLFiddle
How about:
select ID, RowNumber, Data, dense_rank() over (order by grp) as Grp
from (
select *, (select min(ID) from [Your Table] where ID > t.ID and RowNumber = 1) as grp
from [Your Table] t
) t
order by ID
This should work on SQL 2005. You could also use rank() instead if you don't care about consecutive numbers.