Most efficient way to find distinct records, retaining unique ID - sql

I have a large dataset stored in a SQL server table, with 1 unique ID, and many attributes. I need to select the distinct attribute records, along with one of the unique IDs associated with that unique combination.
Example dataset:
ID|Col1|Col2|Col3...
1|big|blue|ball
2|big|red|ball
3|big|blue|ball
4|small|red|ball
Example Goal (2,3,4 would also have been acceptable) :
ID|Col1|Col2|Col3...
1|big|blue|ball
2|big|red|ball
4|small|red|ball
I have tried a few different methods, but all of them seem to be taking very long (hours), so I was wondering if there was a more efficient approach. Failing this, my next idea is to partition the table.
I have tried:
Using Where exists, e.g.
SELECT * from Table as T1
where exists (select *
from table as T2
where
ISNULL(T1.ID,'') <> ISNULL(T2.ID,'')
AND ISNULL([T1].[Col1],'') = ISNULL([T2].[Col1],'')
AND ISNULL([T1].[Col2],'') = ISNULL([T2].[Col2],'')
)
MAX(ID) and Group By Attributes.
GROUP BY Attributes, having count > 1.

How about just using group by?
select min(id), col1, col2, col3
from t
group by col1, col2, col3;
This will probably take a while. This might be more efficient:
select t.*
from t
where t.id = (select min(t2.id)
from t t2
where t.col1 = t2.col1 and t.col2 = t2.col2 and . . .
);
This requires an index on t(col1, col2, col3, . . ., id). Given your request, that is on all columns.
In addition, this will not work for columns that are NULL. Some databases support the ANSI standard is not distinct from for null-safe comparisons. If yours does, then it should use the index for this construct as well.

SELECT Id,Col1,Col2,Col3 FROM (
SELECT Id,Col1,Col2,Col3,ROW_NUMBER() OVER (Partition By Col1,Col2,Col3 Order By ID,Col1,Col2,Col3) valid
from Table as T1) t
WHERE valid=1
Hope this helps...

Related

How to express "either the single resulting record or NULL", without an inner-query LIMIT?

Consider the following query:
SELECT (SELECT MIN(col1) FROM table1) = 7;
Assuming col1 is non-NULLable, this will yield either true or false - or possibly NULL when table1 is empty;
But now suppose I have:
SELECT (
SELECT
FIRST_VALUE (col2) OVER (
ORDER BY col1
) AS col2_for_first_col1
FROM table1
) = 7;
(and assume col2 is also non-NULLable for simplicity.)
If there is a unique col2 value for the lowest col1 value, or the table is empty, then this works just like before. But if there are multiple col2 values for the lowest col1, I'm going to get a query runtime error.
My question: What is a short, elegant way to get NULL from this last query also in the case of multiple inner-query results? I could of course duplicate it and check the count, but I would rather avoid that.
Important caveat: I'm using MonetDB, and it doesn't seem to support ORDER BY ... LIMIT 1 on inner queries.
Without the MonetDB limitation, you would seem to want:
SELECT (SELECT col2
FROM table1
ORDER BY col1
LIMIT 1
) = 7;
with the limitation, you can use window functions differently:
SELECT (SELECT col2
FROM (SELECT col2, ROW_NUMBER() OVER (ORDER BY col1) as seqnum
FROM table1
) t
WHERE seqnum = 1
) = 7;

SQL: How to combine count results from multiple tables into multiple columns

I have two tables that are the same in their structure, but all I really want out of them is a distinct count from a single row from each into a multi-column result set.
I keep getting syntax errors, but so far haven't been able to get something that delivers the data I want, and makes it through the parser. I'm trying to figure out if this is a SQL problem (possible given that I'm using a website's implementation, rather than native MySQl/SQL/Oracle) or a me problem (much more likely).
So what I want is two unrelated (and un-primary-keyed) tables to return a COUNT(DISTINCT column) into a single result. I have tried a couple of different approaches:
select 1,2 FROM
(SELECT COUNT(DISTINCT col1) as 1 from table1),
(SELECT COUNT(DISTINCT col2) as 2 from table2)
SELECT *
FROM (
SELECT COUNT(DISTINCT col1) AS 1
FROM table1
)
CROSS JOIN (
SELECT COUNT(DISTINCT col2) AS 2
FROM table2
)
I have also messed around with 4-5 different uses of union || union all to no avail. Curious what thoughts someone more schooled in the arts of SQL might have. Thanks.
Depending on the database RDBM you are, a few things could change in syntax. In the ANSI Sql definition your query should be:
select col1, col2 FROM
(SELECT COUNT(DISTINCT col1) as col1 from table1) as tab1,
(SELECT COUNT(DISTINCT col2) as col22 from table2) as tab2
You have to add alias for all sub queries. Also name your columns with words, not number, it is easier to understand. Though I don't recall if a number is not allowed as an alias in SQL ANSI.
Without aliases for the subqueries you can use like this:
-- For MySql, PostgreSql, SQL Server (not sure though)
select (SELECT COUNT(DISTINCT col1)
from table1) as col1,
(SELECT COUNT(DISTINCT col2) as col22
from table2) as col2
-- For Oracle
select (SELECT COUNT(DISTINCT col1)
from table1) as col1,
(SELECT COUNT(DISTINCT col2) as col22
from table2) as col2
from dual
-- For DB2
select (SELECT COUNT(DISTINCT col1)
from table1) as col1,
(SELECT COUNT(DISTINCT col2) as col22
from table2) as col2
from sysibm.sysdummy1
Side note: you can use a number as an alias if you surround it with double quotes " (this is SQL ANSI and will work everywhere) like this:
select "1", "2" FROM
(SELECT COUNT(DISTINCT col1) as "1" from table1) a, --don't forget the table alias
(SELECT COUNT(DISTINCT col2) as "2" from table2) b
Mysql Also allows you to use back ticks:
select `1`, `2` FROM
(SELECT COUNT(DISTINCT col1) as `1` from table1) a,
(SELECT COUNT(DISTINCT col2) as `2` from table2) b

Efficiently duplicate some rows in PostgreSQL table

I have PostgreSQL 9 database that uses auto-incrementing integers as primary keys. I want to duplicate some of the rows in a table (based on some filter criteria), while changing one or two values, i.e. copy all column values, except for the ID (which is auto-generated) and possibly another column.
However, I also want to get the mapping from old to new IDs. Is there a better way to do it then just querying for the rows to copy first and then inserting new rows one at a time?
Essentially I want to do something like this:
INSERT INTO my_table (col1, col2, col3)
SELECT col1, 'new col2 value', col3
FROM my_table old
WHERE old.some_criteria = 'something'
RETURNING old.id, id;
However, this fails with ERROR: missing FROM-clause entry for table "old" and I can see why: Postgres must be doing the SELECT first and then inserting it and the RETURNING clauses only has access to the newly inserted row.
RETURNING can only refer to the columns in the final, inserted row. You cannot refer to the "OLD" id this way unless there is a column in the table to hold both it and the new id.
Try running this which should work and will show all the possible values that you can get via RETURNING:
INSERT INTO my_table (col1, col2, col3)
SELECT col1, 'new col2 value', col3
FROM my_table AS old
WHERE old.some_criteria = 'something'
RETURNING *;
It won't get you the behavior you want, but should illustrate better how RETURNING is designed to work.
This can be done with the help of data-modifiying CTEs (Postgres 9.1+):
WITH sel AS (
SELECT id, col1, col3
, row_number() OVER (ORDER BY id) AS rn -- order any way you like
FROM my_table
WHERE some_criteria = 'something'
ORDER BY id -- match order or row_number()
)
, ins AS (
INSERT INTO my_table (col1, col2, col3)
SELECT col1, 'new col2 value', col3
FROM sel
ORDER BY id -- redundant to be sure
RETURNING id
)
SELECT s.id AS old_id, i.id AS new_id
FROM (SELECT id, row_number() OVER (ORDER BY id) AS rn FROM ins) i
JOIN sel s USING (rn);
SQL Fiddle demonstration.
This relies on the undocumented implementation detail that rows from a SELECT are inserted in the order provided (and returned in the order provided). It works in all current versions of Postgres and is not going to break. Related:
Does Postgres preserve insertion order of records?
Window functions are not allowed in the RETURNING clause, so I apply row_number() in another subquery.
More explanation in this related later answer:
INSERT INTO ... FROM SELECT ... RETURNING id mappings
Good! I test this code, but I change
this (FROM my_table AS old) in (FROM my_table) and
this (WHERE old.some_criteria = 'something') in (WHERE some_criteria = 'something')
This is the final code that I use
INSERT INTO my_table (col1, col2, col3)
SELECT col1, 'new col2 value', col3
FROM my_table AS old
WHERE some_criteria = 'something'
RETURNING *;
Thanks!
DROP TABLE IF EXISTS tmptable;
CREATE TEMPORARY TABLE tmptable as SELECT * FROM products WHERE id = 100;
UPDATE tmptable SET id = sbq.id from (select max(id)+1 as id from products) as sbq;
INSERT INTO products (SELECT * FROM tmptable);
DROP TABLE IF EXISTS tmptable;
add another update before the insert to modify another field
UPDATE tmptable SET another = 'data';
'old' is a reserved word, used by the rule rewrite system.
[ I presume this query fragment is not part of a rule; in that case you would have phrased the question differently ]

sql insert into table from select without duplicates (need more then a DISTINCT)

I am selecting multiple rows and inserting them into another table. I want to make sure that it doesn't already exists in the table I am inserting multiple rows into.
DISTINCT works when there are duplicate rows in the select, but not when comparing it to the data already in the table your inserting into.
If I Selected one row at a time I could do a IF EXIST but since its multiple rows (sometimes 10+) it doesn't seem like I can do that.
INSERT INTO target_table (col1, col2, col3)
SELECT DISTINCT st.col1, st.col2, st.col3
FROM source_table st
WHERE NOT EXISTS (SELECT 1
FROM target_table t2
WHERE t2.col1 = st.col1
AND t2.col2 = st.col2
AND t2.col3 = st.col3)
If the distinct should only be on certain columns (e.g. col1, col2) but you need to insert all column, you will probably need some derived table (ANSI SQL):
INSERT INTO target_table (col1, col2, col3)
SELECT st.col1, st.col2, st.col3
FROM (
SELECT col1,
col2,
col3,
row_number() over (partition by col1, col2 order by col1, col2) as rn
FROM source_table
) st
WHERE st.rn = 1
AND NOT EXISTS (SELECT 1
FROM target_table t2
WHERE t2.col1 = st.col1
AND t2.col2 = st.col2)
If you already have a unique index on whatever fields need to be unique in the destination table, you can just use INSERT IGNORE (here's the official documentation - the relevant bit is toward the end), and have MySQL throw away the duplicates for you.
Hope this helps!
So you're looking to retrieve all unique rows from source table which do not already exist in target table?
SELECT DISTINCT(*) FROM source
WHERE primaryKey NOT IN (SELECT primaryKey FROM target)
That's assuming you have a primary key which you can base the uniqueness on... otherwise, you'll have to check each column for uniqueness.
pseudo code for what might work
insert into <target_table> select col1 etc
from <source_table>
where <target_table>.keycol not in
(select source_table.keycol from source_table)
There are a few MSDN articles out there about this, but by far this one is the best:
http://msdn.microsoft.com/en-us/library/ms162773.aspx
They made it real easy to implement and my problem is now fixed. Also the GUI is ugly, but you actually can set minute intervals without using the command line in windows 2003.

sql delete rows with 1 column duplicated

I have a microsoft sql 2005 db table where the entire row is not duplicate, but a column is duplicated.
1 aaa
1 bbb
1 ccc
2 abc
2 def
How can i delete all the rows but 1 that have the first column duplicated?
For clarification I need to get rid of the second, third and fifth rows.
Try the following query in sql server 2005
WITH T AS (SELECT ROW_NUMBER()OVER(PARTITION BY id ORDER BY id) AS rnum,* FROM dbo.Table_1)
DELETE FROM T WHERE rnum>1
Let's call these the id and the Col1 columns.
DELETE myTable T1
WHERE EXISTS
(SELECT * FROM myTable T2
WHERE T2.id = T1.id AND T2.Col1 > T1.Col1)
Edit: As pointed out by Andomar, the above doesn't get rid of exact duplicate cases, where both id and Col1 are the same in different rows.
These can be handled as follow:
(note: whereby the above query is generic SQL, the following applies to MSSQL 2005 and above)
It uses the Common Table Expression (CTE) feature, along with ROW_NUMBER() function to produce a distinctive row value. It is essentially the same construct as the above except that it now works with a "table" (CTEs are mostly like a table) which has a truly distinct identifier key.
Note that by removing "AND T2.Col1 = T1.Col1", we produce a query which can handle both types of duplicates (id-only duplicates and both Id and Col1 duplicates) in a single query, i.e. in a similar fashion that Hamadri's solution (the PARTITION in his/her CTE serves the same purpose as the subquery in this solution, essentially the same amount of work is done). Depending on the situation, it may be preferable, performance-wise or other, to handle the situation in two steps.
WITH T AS
(SELECT ROW_NUMBER() OVER (ORDER BY id, Col1) AS rn, id, Col1 FROM MyTable)
DELETE T AS T1
WHERE EXISTS
(SELECT *
FROM T AS T2
WHERE T2.id = T1.id AND T2.Col1 = T1.Col1
AND T2.rn > T1.rn
)
DELETE tableName as ta
WHERE col2 NOT IN (SELECT MIN(col2) FROM tableName AS t2 GROUP BY col1)
Make sure the sub select returns the rows you want to keep.
Try this.
DELETE FROM <TABLE_NAME_HERE> WHERE <SECOND_COLUMN_NAME_HERE> IN ("bbb","abc","def");
SQL server is not my native SQL database, but maybe something like this? The idea is to get the duplicates and delete the ones with the larger ROW_NUMBER. This should leave only the first one. I dont know if this is what you want or if it will work, but the logic seems sound
DELETE T1
FROM T1 T2
WHERE T1.Col1 = T2.col1
AND T1.ROW_NUMBER() > T2.ROW_NUMBER()
Please feel free to correct me if SQL server cant handle that kind of treatment :)
--Another idea using ROW_NUMBER()
Delete MyTable
Where Id IN
(
Select T.Id FROM
(
SELECT ROW_NUMBER() OVER (PARTITION BY UniqueColumn ORDER BY Id) AS RowNumber FROM MyTable
)T
WHERE T.RowNumber > 1
)