How do I write a query to delete duplicates in a table? - sql

Given a table resembling this one, called VehicleUser:
VehicleUserId | VehicleId | UserId
1 | 1001 | 2
2 | 1001 | 2
3 | 1001 | 2
4 | 1001 | 3
5 | 1001 | 3
6 | 1001 | 3
How do I write a query that can delete the duplicates? row 2 and 3 are identical to row 1 except for a different VehicleUserId and rows 5 and 6 are identical to 4 except for a different VehicleUserId.

;with cte as (
select row_number() over
(partition by VehicleId, UserId order by VehicleUserId) as rn
from VehicleUser)
delete from cte
where rn > 1;

You could filter duplicates with a exists clause, like:
delete v1
from VehicleUser v1
where exists
(
select *
from VehicleUser v2
where v1.VehicleId = v2.VehicleId
and v1.UserId = v2.UserId
and v1.VehicleUserId > v2.VehicleUserId
)
Before you run this, check if it works by replacing the delete with a select:
select *
from VehicleUser v1
where exists
(
...
The rows that show up will be deleted.

here's your unique values:
select vehicleid, userid, min(vehicleuserid) as min_id
from vehicleuser
group by vehicleid, userid
you can put them in a new table before deleting anything to make sure you have what you want, then delete vehicleUser or use an outer join to delete rows from vehicleUser that aren't in the new table.
Debugging before deleting rows is safer.

I don't think you can do this purely in a single query.
I'd do a grouped query to find the duplicates, then iterate the results, deleting all but the first VehicleUserId row.
select VehicleId, UserId
from VehicleUser
group by VehicleId, UserId
having count(*) > 1
Will get you the VehicleId/UserId combinations for which there are duplicates.

Related

Finding SQL duplicates - two methods different results

I have a table in which duplicates may appear. A duplicate is considered when:
sector_id, department_id,number_id are the same (I will add that these are foreign keys to other tables, because maybe it is important)
and valid_to is null
I did this with two queries:
1.
select count(*) from(
select sector_id, departament_id,numer_id, count(*) from tables.workspace
where valid_to is null
group by 1,2,3
having count(*) >1 ) as r
--results : 650
with duplicate_rows as
(
select *, count(id) over (partition by sector_id, departament_id, numer_id) duplicate_count from tables.workspace where valid_to is null
)
select count(*) from
(
select * from duplicate_rows where duplicate_count >1
) as t
--results : 3655
Please explain what I`m doing wrong, possibly why these two functions return different values and which of them is true
Your second query is the wrong one.
You're using a window function and selecting everything in your CTE, which means that every record will have the total COUNT for each combination of your partition by fields.
For example, if there are 3 records with sector_id = 'A', departament_id = 'RED', numer_id = 1, your CTE will look like this:
sector_id | departament_id | numer_id | duplicate_count
------------+----------------+----------+-----------------
A | RED | 1 | 3
A | RED | 1 | 3
A | RED | 1 | 3
Which means that your second query will return 3 instead of 1.
Try adding a DISTINCT to the query that selects from the CTE and it should give you the same results as your first query.
select distinct * from duplicate_rows where duplicate_count >1

Find the count of IDs that have the same value

I'd like to get a count of all of the Ids that have have the same value (Drops) as other Ids. For instance, the illustration below shows you that ID 1 and 3 have A drops so the query would count them. Similarly, ID 7 & 18 have B drops so that's another two IDs that the query would count totalling in 4 Ids that share the same values so that's what my query would return.
+------+-------+
| ID | Drops |
+------+-------+
| 1 | A |
| 2 | C |
| 3 | A |
| 7 | B |
| 18 | B |
+------+-------+
I've tried the several approaches but the following query was my last attempt.
With cte1 (Id1, D1) as
(
select Id, Drops
from Posts
),
cte2 (Id2, D2) as
(
select Id, Drops
from Posts
)
Select count(distinct c1.Id1) newcnt, c1.D1
from cte1 c1
left outer join cte2 c2 on c1.D1 = c2.D2
group by c1.D1
The result if written out in full would be a single value output but the records that the query should be choosing should look as follows:
+------+-------+
| ID | Drops |
+------+-------+
| 1 | A |
| 3 | A |
| 7 | B |
| 18 | B |
+------+-------+
Any advice would be great. Thanks
You can use a CTE to generate a list of Drops values that have more than one corresponding ID value, and then JOIN that to Posts to find all rows which have a Drops value that has more than one Post:
WITH CTE AS (
SELECT Drops
FROM Posts
GROUP BY Drops
HAVING COUNT(*) > 1
)
SELECT P.*
FROM Posts P
JOIN CTE ON P.Drops = CTE.Drops
Output:
ID Drops
1 A
3 A
7 B
18 B
If desired you can then count those posts in total (or grouped by Drops value):
WITH CTE AS (
SELECT Drops
FROM Posts
GROUP BY Drops
HAVING COUNT(*) > 1
)
SELECT COUNT(*) AS newcnt
FROM Posts P
JOIN CTE ON P.Drops = CTE.Drops
Output
newcnt
4
Demo on SQLFiddle
You may use dense_rank() to resolve your problem. if drops has the same ID then dense_rank() will provide the same rank.
Here is the demo.
with cte as
(
select
drops,
count(distinct rnk) as newCnt
from
( select
*,
dense_rank() over (partition by drops order by id) as rnk
from myTable
) t
group by
drops
having count(distinct rnk) > 1
)
select
sum(newCnt) as newCnt
from cte
Output:
|newcnt |
|------ |
| 4 |
First group the count of the ids for your drops and then sum the values greater than 1.
select sum(countdrops) as total from
(select drops , count(id) as countdrops from yourtable group by drops) as temp
where countdrops > 1;

Query to get all distinct lines adding a column indicating a sum of each duplicate

What I'm Looking for:
I need to have a list from SQL server getting all IDs, but each ID have multiples lines.
Some lines from each ID are systems update so do not need to take care about them in my query.
In another words:
I need to get the whole list, counting all lines that are not from system for each ID.
The Database its looks like below:
ID | linenumber| data, data, ... data|Requesto| data, data
1 | 1 |.....................|JUAN |...........
1 | 2 |.....................|SYSTEM |...........
2 | 1 |.....................|Matias |...........
2 | 2 |.....................|Matias |...........
2 | 3 |.....................|Matias |...........
And I need to get:
ID | CantRoWs |.....................|WHO is |...........
1 | 1 |.....................|JUAN |...........
2 | 3 |.....................|Matias |...........
I was thinking about using a temp query like below but it does not work.
with temp as
(
SELECT OVER (PARTITION BY szCID ORDER BY gdReceived desc) as RowNum,*
FROM TABLE1;
)
SELECT *, (Select count(szCID) from TABLE1 where szAccount <> 'system') AS Hits From temp
WHERE RowNum = 1
Any ideas?
I would suggest you start by using row_number() and count() inside the common table expression:
WITH temp
AS (
SELECT
*
, ROW_NUMBER() OVER (PARTITION BY szCID ORDER BY gdReceived DESC) AS RowNum
, COUNT(*) OVER (PARTITION BY szCID) as hits
FROM TABLE1
WHERE szAccount <> 'system'
)
SELECT
*
FROM temp
WHERE RowNum = 1

Delete rows except for one for every id

I have a dataset with multiple ids. For every id there are multiple entries. Like this:
--------------
| ID | Value |
--------------
| 1 | 3 |
| 1 | 4 |
| 1 | 2 |
| 2 | 1 |
| 2 | 2 |
| 3 | 3 |
| 3 | 5 |
--------------
Is there a SQL DELETE query to delete (random) rows for every id, except for one (random rows would be nice but is not essential)? The resulting table should look like this:
--------------
| ID | Value |
--------------
| 1 | 2 |
| 2 | 1 |
| 3 | 5 |
--------------
Thanks!
It doesn't look like hsqldb fully supports olap functions (in this case row_number() over (partition by ...), so you'll need to use a derived table to identify the one value you want to keep for each ID. It certainly won't be random, but I don't think anything else will be either. Something like so
This query will give you the first part:
select
id,
min(value) as minval
from
group by id
Then you can delete from your table where you don't match:
delete from
<your table> t1
inner join
(
select
id,
min(value) as minval
from
<your table>
group by id
) t2
on t1.id = t2.id
and t1.value <> t2.value
Try this:
alter ignore table a add unique(id);
Here a is the table name
This should do what you want:
SELECT ID, Value
FROM (SELECT ID, Value, ROW_NUMBER() OVER(PARTITION BY ID ORDER BY NEWID()) AS RN
FROM #Table) AS A
WHERE A.RN = 1
I tried the given answers with HSQLDB but it refused to execute those queries for different reasons (join is not allowed in delete query, ignore statement is not allowed in alter query). Thanks to Andrew I came up with this solution (which is a little bit more circumstantial, but allows it to delete random rows):
Add a new column for random values:
ALTER TABLE <table> ADD COLUMN rand INT
Fill this column with random data:
UPDATE <table> SET rand = RAND() * 1000000
Delete all rows which don't have the minimum random value for their id:
DELETE FROM <table> WHERE rand NOT IN (SELECT MIN(rand) FROM <table> GROUP BY id)
Drop the random column:
ALTER TABLE <table> DROP rand
For larger tables you probably should ensure that the random values are unique, but this worked perfectly for me.

Updating Single Row per Group

The Background
I have a temporary table containing information including a unique rowID, OrderNumber, and guestCount. RowID and OrderNumber already exist in this table, and I am running a new query to fill in the missing guestCount for each orderNumber. I would like to then update the temp table with this information.
Example
What I currently have looks something like this, with only RowID being unique, meaning that there can be multiple items having the same OrderNumber.
RowID | OrderNumber | guestCount
1 | 30001 | 0
2 | 30002 | 0
3 | 30002 | 0
4 | 30003 | 0
My query returns the following table, only returning one total number of guests per orderNumber:
OrderNumber | guestCount
30001 | 3
30002 | 10
30003 | 5
The final table should look like:
RowID | OrderNumber | guestCount
1 | 30001 | 3
2 | 30002 | 10
3 | 30002 | 0
4 | 30003 | 5
I'm only interested in updating one (doesn't matter which) entry per orderNumber, but my current logic is resulting in errors:
UPDATE temp
SET temp.guestCount = cc.guestCount
FROM( SELECT OrderNumber, guestCount
FROM (SELECT OrderNumber, guestCount, RowID = MIN(RowID)
FROM #tempTable
GROUP BY RowID, OrderNumber, guestCount) t)temp
INNER JOIN queryTable q ON temp.OrderNumber = q.OrderNumber
I'm not sure if this logic is even a valid way of doing this, but I do know that I'm getting errors in my update due to the fact that I'm using an aggregate function, as well as a GROUP function. Is there any way to go about this operation differently?
You can define the row to update by using row_number() in a CTE. This identifies the first row in the group for the update:
with toupdate as (
select tt.*, row_number() over (partition by OrderNumber order by id) as seqnum
from #tempTable tt
)
UPDATE toupdate
SET toupdate.guestCount = q.guestCount
FROM toupdate
INNER JOIN queryTable q
ON temp.OrderNumber = q.OrderNumber
where toupdate.seqnum = 1;
The problem with you query is that temp is based on an aggregation subquery. Such a subquery is not updatable, because it does not have a 1-1 relationship with the rows of the original query. Using the CTE with row_number() is updatable. In addition, your set statement uses the table alias cc which is not defined in the query.