Inserting unique rows into a table where not exist - sql

I am using postgres 8.4.
I am merging several tables into one. There are duplicates both within and across tables. The new table will have a unique constraint. I have inserted the first table into the new big table without trouble, but when trying to add the second table I get an error. I have tried:
INSERT INTO big_table(id, col1, col2)
SELECT DISTINCT ON (id)
id,
col1,
col2,
FROM table2
WHERE NOT EXISTS(
SELECT id, col1, col2
FROM big_table
WHERE(big_table.id = table2.id))
I get the following error:
invalid reference to FROM-clause entry for table "big_table" LINE
13: ...big_table WHERE(table2.id = big_table.id))
HINT: There is an entry for table "big_tweets", but it cannot be
referenced from this part of the query.
I think it might have something to do with the fact that big_table changes, but I'm not sure how else to exclude rows that already exist in the table.

Not related to your question. But instead you can UNION all the table before create the big table to remove the duplicates.
CREATE big_table as
SELECT id, col1, col2 FROM Table1
UNION
SELECT id, col1, col2 FROM Table2
....
UNION
SELECT id, col1, col2 FROM TableN
You also can use a CTE to solve the self reference problem
WITH cte as (
SELECT DISTINCT ON (id)
id,
col1,
col2,
FROM table2
WHERE NOT EXISTS(
SELECT id, col1, col2
FROM big_table
WHERE(big_table.id = table2.id))
)
INSERT INTO big_table
SELECT *
FROM cte

Related

Create a view of a table with a column that has multiple values

I have a table (Table1) like the following:
Col1
Col2
First
Code1,Code2,Code3
Second
Code2
So Col2 can contain multiple values comma separated, I have another table (Table2) that contains this:
ColA
ColB
Code1
Value1
Code2
Vaue2
Code3
Vaue3
I need to create a view that joins the two tables (Table1 and Table2) and returns something like this:
Col1
Col2
First
Value1,Value2,Value3
Second
Value2
Is that possible? (I'm on Oracle DB if that helps.)
It's a violation of first normal form to have a list in a column value like that. It causes a lot of difficulties in a relational database, like the one you are encountering now.
However, you can get what you want by using the LIKE operator to find colA values that are substrings of the Col2 column. Add delimiters before and after to catch the first and last ones. Then aggregate back up to a single list using LISTAGG.
SELECT table1.col1,
LISTAGG(table2.colB,',') WITHIN GROUP (ORDER BY table2.colB) value_list
FROM table1,
table2
WHERE ','||table1.col2||',' LIKE '%,'||table2.colA||',%'
GROUP BY table1.col1
This will not perform well on large volumes, because without an equijoin it's going to use nested loops, and you can't use an index on a LIKE predicate with % at the beginning. The combination of nested loops + FTS is not pleasant with large volumes of data. Therefore, if this is your situation, you will need to fix the 1NF problem by transforming table1 into normal relational format, and then join it to table2 with an equijoin, which will enable it to use a hash join instead. So:
SELECT table1.col1,
LISTAGG(table2.colB,',') WITHIN GROUP (ORDER BY table2.colB) value_list
FROM (SELECT t.col1,
SUBSTR(t.col2,INSTR(t.col2,',',1,seq)+1,INSTR(t.col2,',',1,seq+1)-(INSTR(t.col2,',',1,seq)+1)) col2_piece
FROM (SELECT col1,
','||col2||',' col2
FROM table1) t,
(SELECT ROWNUM seq FROM dual CONNECT BY LEVEL < 10) x) table1,
table2
WHERE table1.col2_piece IS NOT NULL
AND table1.col2_piece = table2.colA
GROUP BY table1.col1
If you want the values in the same order in the list as the terms then you can use:
SELECT t1.col1,
LISTAGG(t2.colb, ',') WITHIN GROUP (
ORDER BY INSTR(','||t1.col2||',', ','||t2.colA||',')
) AS value2
FROM table1 t1
INNER JOIN table2 t2
ON INSTR(','||t1.col2||',', ','||t2.colA||',') > 0
GROUP BY
t1.col1
Which, for the sample data:
CREATE TABLE Table1 (Col1, Col2) AS
SELECT 'First', 'Code1,Code2,Code3' FROM DUAL UNION ALL
SELECT 'Second', 'Code2' FROM DUAL;
CREATE TABLE Table2 (ColA, ColB) AS
SELECT 'Code1', 'XXXX' FROM DUAL UNION ALL
SELECT 'Code2', 'ZZZZ' FROM DUAL UNION ALL
SELECT 'Code3', 'YYYY' FROM DUAL;
Outputs:
COL1
VALUE2
First
XXXX,ZZZZ,YYYY
Second
ZZZZ
fiddle

Weird results from: Create table .. as select from

Could it be that the following query gives weird results (without errors):
CREATE TABLE MY_TABLE
AS (
SELECT COL_1, COL2
FROM EXISTING_TABLE_1
UNION
SELECT COL_1, COL2
FROM EXISTING_TABLE_2
WHERE key_id NOT IN (
SELECT key_id
FROM(
SELECT COL1, COL2
FROM EXISTING_TABLE_3
UNION
SELECT COL1, COL2
FROM EXISTING_TABLE_4
)A
)
) WITH DATA
When I run similar code, but with real table names and data, my table has for example 250K records. While, when I just run the select part, so everything between the brackets, I get 300K + records.
Is create table .... as ( select .... ) WITH DATA known for problems like this?
FYI: I don't get any errors, I noticed this a little late when doing analysis.

Delete duplicate rows teradata

im using this query to get all the duplicate rows :
SELECT count(*),col1, col2 from table GROUP BY col1, col2 having count(*)>1
i tried this query :
DELETE FROM TABLE WHERE (col1, col2) in (SELECT count(*),col1, col2 from table GROUP BY col1, col2
having count(*)>1 )
but it doesnt work because of count(*) in the select statement.
How can i delete all the duplicate rows of this query ?
Thanks
You do this using below query if rowid usage is enabled
delete from table
where row_id not in (
select max(rowid) from table
group by col1,col2 ) TMP
OR
You can copy all data to a new SET table(get rid of duplicates)
Remove all records from main table
Re insert all records from newly created SET table to main table

Add Identity column to a view in SQL Server 2008

This is my view:
Create View [MyView] as
(
Select col1, col2, col3 From Table1
UnionAll
Select col1, col2, col3 From Table2
)
I need to add a new column named Id and I need to this column be unique so I think to add new column as identity. I must mention this view returned a large of data so I need a way with good performance, And also I use two select query with union all I think this might be some complicated so what is your suggestion?
Use the ROW_NUMBER() function in SQL Server 2008.
Create View [MyView] as
SELECT ROW_NUMBER() OVER( ORDER BY col1 ) AS id, col1, col2, col3
FROM(
Select col1, col2, col3 From Table1
Union All
Select col1, col2, col3 From Table2 ) AS MyResults
GO
The view is just a stored query that does not contain the data itself so you can add a stable ID. If you need an id for other purposes like paging for example, you can do something like this:
create view MyView as
(
select row_number() over ( order by col1) as ID, col1 from (
Select col1 From Table1
Union All
Select col1 From Table2
) a
)
There is no guarantee that the rows returned by a query using ROW_NUMBER() will be ordered exactly the same with each execution unless the following conditions are true:
Values of the partitioned column are unique. [partitions are parent-child, like a boss has 3 employees][ignore]
Values of the ORDER BY columns are unique. [if column 1 is unique, row_number should be stable]
Combinations of values of the partition column and ORDER BY columns are unique. [if you need 10 columns in your order by to get unique... go for it to make row_number stable]"
There is a secondary issue here, with this being a view. Order By's don't always work in views (long-time sql bug). Ignoring the row_number() for a second:
create view MyView as
(
select top 10000000 [or top 99.9999999 Percent] col1
from (
Select col1 From Table1
Union All
Select col1 From Table2
) a order by col1
)
Using "row_number() over ( order by col1) as ID" is very expensive.
This way is much more efficient in cost:
Create View [MyView] as
(
Select ID = isnull(cast(newid() as varchar(40)), '')
, col1
, col2
, col3
From Table1
UnionAll
Select ID = isnull(cast(newid() as varchar(40)), '')
, col1
, col2
, col3
From Table2
)
use ROW_NUMBER() with "order by (select null)" this will be less expensive and will get your result.
Create View [MyView] as
SELECT ROW_NUMBER() over (order by (select null)) as id, *
FROM(
Select col1, col2, col3 From Table1
Union All
Select col1, col2, col3 From Table2 ) R
GO

How can I delete duplicate rows in a table

I have a table with say 3 columns. There's no primary key so there can be duplicate rows. I need to just keep one and delete the others. Any idea how to do this is Sql Server?
I'd SELECT DISTINCT the rows and throw them into a temporary table, then drop the source table and copy back the data from the temp.
EDIT: now with code snippet!
INSERT INTO TABLE_2
SELECT DISTINCT * FROM TABLE_1
GO
DELETE FROM TABLE_1
GO
INSERT INTO TABLE_1
SELECT * FROM TABLE_2
GO
Add an identity column to act as a surrogate primary key, and use this to identify two of the three rows to be deleted.
I would consider leaving the identity column in place afterwards, or if this is some kind of link table, create a compound primary key on the other columns.
The following example works as well when your PK is just a subset of all table columns.
(Note: I like the approach with inserting another surrogate id column more. But maybe this solution comes handy as well.)
First find the duplicate rows:
SELECT col1, col2, count(*)
FROM t1
GROUP BY col1, col2
HAVING count(*) > 1
If there are only few, you can delete them manually:
set rowcount 1
delete from t1
where col1=1 and col2=1
The value of "rowcount" should be n-1 times the number of duplicates. In this example there are 2 dulpicates, therefore rowcount is 1. If you get several duplicate rows, you have to do this for every unique primary key.
If you have many duplicates, then copy every key once into anoher table:
SELECT col1, col2, col3=count(*)
INTO holdkey
FROM t1
GROUP BY col1, col2
HAVING count(*) > 1
Then copy the keys, but eliminate the duplicates.
SELECT DISTINCT t1.*
INTO holddups
FROM t1, holdkey
WHERE t1.col1 = holdkey.col1
AND t1.col2 = holdkey.col2
In your keys you have now unique keys. Check if you don't get any result:
SELECT col1, col2, count(*)
FROM holddups
GROUP BY col1, col2
Delete the duplicates from the original table:
DELETE t1
FROM t1, holdkey
WHERE t1.col1 = holdkey.col1
AND t1.col2 = holdkey.col2
Insert the original rows:
INSERT t1 SELECT * FROM holddups
btw and for completeness: In Oracle there is a hidden field you could use (rowid):
DELETE FROM our_table
WHERE rowid not in
(SELECT MIN(rowid)
FROM our_table
GROUP BY column1, column2, column3... ;
see: Microsoft Knowledge Site
Here's the method I used when I asked this question -
DELETE MyTable
FROM MyTable
LEFT OUTER JOIN (
SELECT MIN(RowId) as RowId, Col1, Col2, Col3
FROM MyTable
GROUP BY Col1, Col2, Col3
) as KeepRows ON
MyTable.RowId = KeepRows.RowId
WHERE
KeepRows.RowId IS NULL
This is a way to do it with Common Table Expressions, CTE. It involves no loops, no new columns or anything and won't cause any unwanted triggers to fire (due to deletes+inserts).
Inspired by this article.
CREATE TABLE #temp (i INT)
INSERT INTO #temp VALUES (1)
INSERT INTO #temp VALUES (1)
INSERT INTO #temp VALUES (2)
INSERT INTO #temp VALUES (3)
INSERT INTO #temp VALUES (3)
INSERT INTO #temp VALUES (4)
SELECT * FROM #temp
;
WITH [#temp+rowid] AS
(SELECT ROW_NUMBER() OVER (ORDER BY i ASC) AS ROWID, * FROM #temp)
DELETE FROM [#temp+rowid] WHERE rowid IN
(SELECT MIN(rowid) FROM [#temp+rowid] GROUP BY i HAVING COUNT(*) > 1)
SELECT * FROM #temp
DROP TABLE #temp
This is a tough situation to be in. Without knowing your particular situation (table size etc) I think that your best shot is to add an identity column, populate it and then delete according to it. You may remove the column later but I would suggest that you should keep it as it is really a good thing to have in the table
After you clean up the current mess you could add a primary key that includes all the fields in the table. that will keep you from getting into the mess again.
Of course this solution could very well break existing code. That will have to be handled as well.
Can you add a primary key identity field to the table?
Manrico Corazzi - I specialize in Oracle, not MS SQL, so you'll have to tell me if this is possible as a performance boost:-
Leave the same as your first step - insert distinct values into TABLE2 from TABLE1.
Drop TABLE1. (Drop should be faster than delete I assume, much as truncate is faster than delete).
Rename TABLE2 as TABLE1 (saves you time, as you're renaming an object rather than copying data from one table to another).
Here's another way, with test data
create table #table1 (colWithDupes1 int, colWithDupes2 int)
insert into #table1
(colWithDupes1, colWithDupes2)
Select 1, 2 union all
Select 1, 2 union all
Select 2, 2 union all
Select 3, 4 union all
Select 3, 4 union all
Select 3, 4 union all
Select 4, 2 union all
Select 4, 2
select * from #table1
set rowcount 1
select 1
while ##rowcount > 0
delete #table1 where 1 < (select count(*) from #table1 a2
where #table1.colWithDupes1 = a2.colWithDupes1
and #table1.colWithDupes2 = a2.colWithDupes2
)
set rowcount 0
select * from #table1
What about this solution :
First you execute the following query :
select 'set rowcount ' + convert(varchar,COUNT(*)-1) + ' delete from MyTable where field=''' + field +'''' + ' set rowcount 0' from mytable group by field having COUNT(*)>1
And then you just have to execute the returned result set
set rowcount 3 delete from Mytable where field='foo' set rowcount 0
....
....
set rowcount 5 delete from Mytable where field='bar' set rowcount 0
I've handled the case when you've got only one column, but it's pretty easy to adapt the same approach tomore than one column. Let me know if you want me to post the code.
How about:
select distinct * into #t from duplicates_tbl
truncate duplicates_tbl
insert duplicates_tbl select * from #t
drop table #t
I'm not sure if this works with DELETE statements, but this is a way to find duplicate rows:
SELECT *
FROM myTable t1, myTable t2
WHERE t1.field = t2.field AND t1.id > t2.id
I'm not sure if you can just change the "SELECT" to a "DELETE" (someone wanna let me know?), but even if you can't, you could just make it into a subquery.