I have an sql server database, that I pre-loaded with a ton of rows of data.
Unfortunately, there is no primary key in the database, and there is now duplicate information in the table. I'm not concerned about there not being a primary key, but i am concerned about there being duplicates in the database...
Any thoughts? (Forgive me for being an sql server newb)
Well, this is one reason why you should have a primary key on the table. What version of SQL Server? For SQL Server 2005 and above:
;WITH r AS
(
SELECT col1, col2, col3, -- whatever columns make a "unique" row
rn = ROW_NUMBER() OVER (PARTITION BY col1, col2, col3 ORDER BY col1)
FROM dbo.SomeTable
)
DELETE r WHERE rn > 1;
Then, so you don't have to do this again tomorrow, and the next day, and the day after that, declare a primary key on the table.
Let's say your table is unique by COL1 and COL2.
Here is a way to do it:
SELECT *
FROM (SELECT COL1, COL2, ROW_NUMBER() OVER (PARTITION BY COL1, COL2 ORDER BY COL1, COL2 ASC) AS ROWID
FROM TABLE_NAME )T
WHERE T.ROWID > 1
The ROWID > 1 will enable you to select only the duplicated rows.
Related
I need to remove the duplicates from a table with 139 columns based on 2 columns and load the unique rows with 139 columns into another table.
eg :
col1 col2 col3 .....col139
a b .............
b c .............
a b .............
o/p:
col1 col2 col3 .....col139
a b .............
b c .............
need a SQL query for DB2?
If the "other table" does not exist yet you can create it like this
CREATE TABLE othertable LIKE originaltable
And the insert the requested row with this statement:
INSERT INTO othertable
SELECT col1,...,coln
FROM (SELECT
t.*,
ROW_NUMBER() OVER (PARTITION BY col1, col2 ORDER BY col1) AS num
FROM t) t
WHERE num = 1
There are numerous tools out there that generate queries and column lists - so if you do not want to write it by hand you could generate it with these tools or use another SQL statement to select it from the Db2 catalog table (syscat.columns).
You might be better just deleting the duplicates in place. This can be done without specifying a column list.
DELETE FROM
( SELECT
ROW_NUMBER() OVER (PARTITION BY col1, col2) AS DUP
FROM t
)
WHERE
DUP > 1
You can use row_number():
select t.*
from (select t.*,
row_number() over (partition by a, b order by a) as seqnum
from t
) t;
If you don't want seqnum in the result set, though, you need to list out all the columns.
To find duplicate values in col1 or any column, you can run the following query:
SELECT col1 FROM your_table GROUP BY col1 HAVING COUNT(*) > 1;
And if you want to delete those duplicate rows using the value of col1, you can run the following query:
DELETE FROM your_table WHERE col1 IN (SELECT col1 FROM your_table GROUP BY col1 HAVING COUNT(*) > 1);
You can use the same approach to delete duplicate rows from the table using col2 values.
How to Delete Duplicate records from Table in SQL Server ?
WITH CTE AS(
SELECT [col1], [col2], [col3], [col4], [col5], [col6], [col7],
RN = ROW_NUMBER()OVER(PARTITION BY col1 ORDER BY col1)
FROM dbo.Table1
)
DELETE FROM CTE WHERE RN > 1
Example
To delete rows where the combination of columns col_1, col_2, ... col_n are duplicates, you can use a common table expression;
WITH cte AS (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY col1, ..., col_n ORDER BY col_1) AS rn
FROM table_1
)
DELETE FROM cte WHERE rn<>1;
Since the rows are classified by the contents of the listed columns, if the rows are identical in all ways, you'll still need to list all columns in the query.
As always, test and/or backup before running deletes from random people on the Internet on your production data.
The following questions and the answer given there could be the best help for you
Remove Duplicate Records
You can select the min and max of the rowId (if there is and identity field otherwise add one)
DELETE MyTable
FROM MyTable
LEFT OUTER JOIN (
SELECT MIN(RowId) as RowId, Col1, Col2, Col3
FROM MyTable
GROUP BY Col1, Col2, Col3
) as KeepRows ON
MyTable.RowId = KeepRows.RowId
WHERE
KeepRows.RowId IS NULL
Use
Source
Add an identity column to your table:
Alter table tbl_name add Id int identity(1,1)
Then run following query for deleting records from table:
Delete from tbl_Name where Id not in(select min(Id) from tbl_Name group by RowId)
//duplicate_data_table contains duplicate values
create temp as (select distinct * from duplicate_data_table);
drop duplicate_data_table;
create duplicate_data_table as (select * from temp);
drop temp;
Sometimes I wish to perform a join whereby I take the largest value of one column. Doing this I have to perform a max() and a groupby- which prevents me from retrieving the other columns from the row which was the max (beause they were not contained in a GROUP BY or aggregate function).
To fix this, I join the max value back on the original data source, to get the other columns. However, my problem is that this sometimes returns more than one row.
So, so far I have something like:
SELECT * FROM
(SELECT Col1, Max(Col2) FROM Table GROUP BY Col1) tab1
JOIN
(SELECT Col1, Col2 FROM Table) tab2
ON tab1.Col2 = tab2.Col2
If the above query now returns three rows (which match the largest value for column2) I have a bit of a headache.
If there was an extra column- col3 and for the rows returned by the above query, I only wanted to return the one which was, say the minimum Col3 value- how would I do this?
If you are using SQL Server 2005+. Then you can do it like this:
CTE way
;WITH CTE
AS
(
SELECT
ROW_NUMBER() OVER(PARTITION BY Col1 ORDER BY Col2 DESC) AS RowNbr,
table.*
FROM
table
)
SELECT
*
FROM
CTE
WHERE
CTE.RowNbr=1
Subquery way
SELECT
*
FROM
(
SELECT
ROW_NUMBER() OVER(PARTITION BY Col1 ORDER BY Col2 DESC) AS RowNbr,
table.*
FROM
table
) AS T
WHERE
T.RowNbr=1
As I got it can be something like this
SELECT * FROM
(SELECT Col1, Max(Col2) FROM Table GROUP BY Col1) tab1
JOIN
(SELECT Col1, Col2 FROM Table) tab2
ON tab1.Col2 = tab2.Col2 and Col3 = (select min(Col3) from table )
Assuming you are using SQL-Server 2005 or later You can make use of Window functions here. I have chosen ROW_NUMBER() but it is not hte only option.
;WITH T AS
( SELECT *,
ROW_NUMBER() OVER(PARTITION BY Col1 ORDER BY Col2 DESC) [RowNumber]
FROM Table
)
SELECT *
FROM T
WHERE RowNumber = 1
The PARTITION BY within the OVER clause is equivalent to your group by in your subquery, then your ORDER BY determines the order in which to start numbering the rows. In this case Col2 DESC to start with the highest value of col2 (Equivalent to your MAX statement).
This is my view:
Create View [MyView] as
(
Select col1, col2, col3 From Table1
UnionAll
Select col1, col2, col3 From Table2
)
I need to add a new column named Id and I need to this column be unique so I think to add new column as identity. I must mention this view returned a large of data so I need a way with good performance, And also I use two select query with union all I think this might be some complicated so what is your suggestion?
Use the ROW_NUMBER() function in SQL Server 2008.
Create View [MyView] as
SELECT ROW_NUMBER() OVER( ORDER BY col1 ) AS id, col1, col2, col3
FROM(
Select col1, col2, col3 From Table1
Union All
Select col1, col2, col3 From Table2 ) AS MyResults
GO
The view is just a stored query that does not contain the data itself so you can add a stable ID. If you need an id for other purposes like paging for example, you can do something like this:
create view MyView as
(
select row_number() over ( order by col1) as ID, col1 from (
Select col1 From Table1
Union All
Select col1 From Table2
) a
)
There is no guarantee that the rows returned by a query using ROW_NUMBER() will be ordered exactly the same with each execution unless the following conditions are true:
Values of the partitioned column are unique. [partitions are parent-child, like a boss has 3 employees][ignore]
Values of the ORDER BY columns are unique. [if column 1 is unique, row_number should be stable]
Combinations of values of the partition column and ORDER BY columns are unique. [if you need 10 columns in your order by to get unique... go for it to make row_number stable]"
There is a secondary issue here, with this being a view. Order By's don't always work in views (long-time sql bug). Ignoring the row_number() for a second:
create view MyView as
(
select top 10000000 [or top 99.9999999 Percent] col1
from (
Select col1 From Table1
Union All
Select col1 From Table2
) a order by col1
)
Using "row_number() over ( order by col1) as ID" is very expensive.
This way is much more efficient in cost:
Create View [MyView] as
(
Select ID = isnull(cast(newid() as varchar(40)), '')
, col1
, col2
, col3
From Table1
UnionAll
Select ID = isnull(cast(newid() as varchar(40)), '')
, col1
, col2
, col3
From Table2
)
use ROW_NUMBER() with "order by (select null)" this will be less expensive and will get your result.
Create View [MyView] as
SELECT ROW_NUMBER() over (order by (select null)) as id, *
FROM(
Select col1, col2, col3 From Table1
Union All
Select col1, col2, col3 From Table2 ) R
GO
Strange question, I know. I don't want to delete all the rows and start again, but we have a development database table where some of the rows have duplicate IDs, but different values.
I want to delete all records with duplicate IDs, so I can force data integrity on the table for the new version and build relationships. At the moment it's an ID that is inserted and generated by code (legacy).
From another question I got this:
delete
t1
from
tTable t1, tTable t2
where
t1.locationName = t2.locationName and
t1.id > t2.id
But this won't work as the IDs are the same!
How can I delete all but one record where IDs are the same? That is, delete where the count of records with the same ID > 1? If that's not possible, then deleting all records with duplicate IDs would be fine.
In SQL Server 2005 and above:
WITH q AS
(
SELECT *,
ROW_NUMBER() OVER (PARTITION BY locationName ORDER BY id) rn
FROM tTable
)
DELETE
FROM q
WHERE rn > 1
Depends on your DB server, but you can associate DELETE and LIMIT (mysql) or TOP (sql server).
You could also move the first (not duplicate) of each record to a temp table, delete the original table and copy the temp table back to the original one.
Not sure for mysql but for a MSServer database you could use the following
SET IDENTITY_INSERT [tablename] ON
SELECT DISTINCT col1, col2, col3 INTO temp_[tablename] FROM [tablename]
ALTER TABLE temp_[tablename] ADD IDcol INT IDENTITY
TRUNCATE TABLE [tablename]
INSERT INTO [tablename](IDcol, col1, col2, col3) SELECT IDcol, col1, col2, col3 FROM temp_[tablename]
DROP TABLE temp_[tablename]
Hope this helps.