Check for complete duplicate rows in a large table

Check for complete duplicate rows in a large table - sql

My original question with all the relevant context can be found here:
Adding a multi-column primary key to a table with 40 million records
I have a table with 40 million rows and no primary key. Before I add the primary key, I would like to check if the table has any duplicate entries. When I say duplicate entries, I don't just mean duplicate on particular columns. I mean duplicates on entire rows.
I was told in my last question that I can do an EXISTS query to determine duplicates. How would I do that?
I am running PostgreSQL 8.1.22. (Got this info by running select version()).

To find whether any full duplicate exists (identical on all columns), this is probably the fastest way:
SELECT EXISTS (
SELECT 1
FROM tbl t
NATURAL JOIN tbl t1
WHERE t.ctid <> t1.ctid
)
NATURAL JOIN is a very convenient shorthand for the case because (quoting the manual here):
NATURAL is shorthand for a USING list that mentions all columns in the
two tables that have the same names.
EXISTS is probably fastest, because Postgres stops searching as soon as the first duplicate is found. Since you most probably don't have an index covering the whole row and your table is huge, this will save you a lot of time.
Be aware that NULL is never considered identical to another NULL. If you have NULL values and consider them identical, you'd have to do more.
ctid is a system column that can be (ab-)used as ad-hoc primary key, but cannot replace an actual user-defined primary key in the long run.
The outdated version 8.1 seems to have no <> operator defined for a ctid. Try casting to text:
SELECT EXISTS (
SELECT 1
FROM tbl t
NATURAL JOIN tbl t1
WHERE t.ctid::text <> t1.ctid::text
)

shouldn't something like that do the job?
SELECT ALL_COLUMNS[expect unique ID],
count(0) as Dupl
FROM table
WHERE Dupl>1
GROUP BY ALL_COLUMNS[expect unique ID];
not sure if its the most efficient way, but count>1 means you have two identical rows.

Related

redshift select distinct returns repeated values

I have a database where each object property is stored in a separate row. The attached query does not return distinct values in a redshift database but works as expected when testing in any mysql compatible database.
SELECT DISTINCT distinct_value
FROM
(
SELECT
uri,
( SELECT DISTINCT value_string
FROM `test_organization__app__testsegment` AS X
WHERE X.uri = parent.uri AND name = 'hasTestString' AND parent.value_string IS NOT NULL ) AS distinct_value
FROM `test_organization__app__testsegment` AS parent
WHERE
uri IN ( SELECT uri
FROM `test_organization__app__testsegment`
WHERE name = 'types' AND value_uri_multivalue = 'Document'
)
) AS T
WHERE distinct_value IS NOT NULL
ORDER BY distinct_value ASC
LIMIT 10000 OFFSET 0

This is not a bug and behavior is intentional, though not straightforward.
In Redshift, you can declare constraints on the tables but Redshift doesn't enforce them, i.e. it allows duplicate values if you insert them. The only difference here is that when you run SELECT DISTINCT query against a column that doesn't have a primary key declared it will scan the whole column and get unique values, and if you run the same on a column that has primary key constraint it will just return the output without performing unique list filtering. This is how you can get duplicate entries if you insert them.
Why is this done? Redshift is optimized for large datasets and it's much faster to copy data if you don't need to check constraint validity for every row that you copy or insert. If you want you can declare a primary key constraint as a part of your data model but you will need to explicitly support it by removing duplicates or designing ETL in a way there are no such.
More information with specific examples in this Heap blog post Redshift Pitfalls And How To Avoid Them

Perhaps You can solve this by using appropriate joins.
for example i have duplicate values in table 1 and i want values of table 1 by joining it to table 2 and there is some logic behind joining two tables according to your conditions.
so i can do something like this!!
select distinct table1.col1 from table1 left outer join table2 on table1.col1 = table2.col1
this worked for me very well and i got unique values from table1 and could remove dublicates

What's the best way to union two queries with distinct values in a single column, prioritizing the first query?

Needs to be database-agnostic between Oracle and SQL server, although I wouldn't mind hearing SQL server-specific examples as well.
I'm sure the title isn't clear at all, so let me explain what I'm thinking. I'm thinking of two queries. The first might pull in a bunch of data from a given table, including primary keys. The second would just pull in every primary key and leave all other columns blank.
Then I'd want to union them together in such a way that whenever a primary key is missing in the first query, the row from the second query gets pulled in. Otherwise, if the primary key exists in the first query, the row from the second query is ignored.
Quick example:
First query pulls in two columns (first is primary key):
1 1
2 1
Second query pulls in :
1 NULL
2 NULL
3 NULL
So I would want the whole query to pull in:
1 1
2 1
3 NULL
What's the best way to pull this off, performance-wise? Consider an example where there might be a very large number of rows and columns, and the first query might be pretty performance-intensive (although the second of course should always be straightforward, just pulling in primary keys from a list and filling the rest of the columns out with either NULLs or static values).

It sounds to me that you want to use a FULL OUTER JOIN on the two tables or queries:
select
coalesce(q1.col1, q2.col1) col1,
coalesce(q1.col2, q2.col2) col2
from query1 q1
full outer join query2 q2
on q1.col1 = q2.col1;
See SQL Fiddle with Demo.
This will join the two queries on your primary key column (col1 in the sample query), then you can use COALESCE on the columns to return the first non-null value for col1, col2, etc.

You can't use a union since SQL will consider 1, 2 and 1, NULL to be distinct.
Not knowing your schema, I would try the following in psuedo code:
select *
from query_1
union all
select primary_key
from query_2
where query_2.PK not in(select PK from query_1)
This will only return the primary keys in query_2 that are not in query_1 and get you a clean union where the query_1 results are prioritized over query_2 results. Selecting just the primary keys for the first query should be quick and easy, but if that isn't the case let me know and I can try to come up with a more complicated query given your schema.

SQL: Remove rows whose associations are broken (orphaned data)

I have a table called "downloads" with two foreign key columns -- "user_id" and "item_id". I need to select all rows from that table and remove the rows where the User or the Item in question no longer exists. (Look up the User and if it's not found, delete the row in "downloads", then look up the Item and if it's not found, delete the row in "downloads").
It's 3.4 million rows, so all my scripted solutions have been taking 6+ hours. I'm hoping there's a faster, SQL-only way to do this?

use two anti joins and or them together:
delete from your_table
where user_id not in (select id from users_table)
or item_id not in (select id from items_table)
once that's done, consider adding two foreign keys, each with an on delete cascade clause. it'll do this for you automatically.

delete from your_table where user_id not in (select id from users_table) or item_id not in (select id from items_table)

think there is no faster solution when there are so many rows
that are on your server 157 rows per second
check user id
if mysql num rows = 0 than delete the downloads and also check the item_id
there was also a similar question about the performance of myswl num rows
MySQL: Fastest way to count number of rows
edit: think the best is to creatse some triggers so the database server does the job for you
currently i would use a cronjob for the first time

For future reference. For these kind of long operations. It is possible to optimise the server independently of the SQL. For example detach the sql service, defrag the system disk, if you can ensure the sql log files are on separate disk drive to the drive where database is.
This will at least reduce the pain of these kind of long operations.

I've found in SQL 2008 R2, if your "in" clause contains a null value (perhaps from a table who has a reference to this key that is nullable), no records will be returned! To correct, just add a clause to your selects in the union part:
delete from SomeTable where Key not in (
select SomeTableKey from TableB where SomeTableKey is not null
union
select SomeTableKey from TableC where SomeTableKey is not null
)

What is the most efficient way to count rows in a table in SQLite?

I've always just used "SELECT COUNT(1) FROM X" but perhaps this is not the most efficient. Any thoughts? Other options include SELECT COUNT(*) or perhaps getting the last inserted id if it is auto-incremented (and never deleted).
How about if I just want to know if there is anything in the table at all? (e.g., count > 0?)

The best way is to make sure that you run SELECT COUNT on a single column (SELECT COUNT(*) is slower) - but SELECT COUNT will always be the fastest way to get a count of things (the database optimizes the query internally).
If you check out the comments below, you can see arguments for why SELECT COUNT(1) is probably your best option.

To follow up on girasquid's answer, as a data point, I have a sqlite table with 2.3 million rows. Using select count(*) from table, it took over 3 seconds to count the rows. I also tried using SELECT rowid FROM table, (thinking that rowid is a default primary indexed key) but that was no faster. Then I made an index on one of the fields in the database (just an arbitrary field, but I chose an integer field because I knew from past experience that indexes on short fields can be very fast, I think because the index is stored a copy of the value in the index itself). SELECT my_short_field FROM table brought the time down to less than a second.

If you are sure (really sure) that you've never deleted any row from that table and your table has not been defined with the WITHOUT ROWID optimization you can have the number of rows by calling:
select max(RowId) from table;
Or if your table is a circular queue you could use something like
select MaxRowId - MinRowId + 1 from
(select max(RowId) as MaxRowId from table) JOIN
(select min(RowId) as MinRowId from table);
This is really really fast (milliseconds), but you must pay attention because sqlite says that row id is unique among all rows in the same table. SQLite does not declare that the row ids are and will be always consecutive numbers.

The fastest way to get row counts is directly from the table metadata, if any. Unfortunately, I can't find a reference for this kind of data being available in SQLite.
Failing that, any query of the type
SELECT COUNT(non-NULL constant value) FROM table
should optimize to avoid the need for a table, or even an index, scan. Ideally the engine will simply return the current number of rows known to be in the table from internal metadata. Failing that, it simply needs to know the number of entries in the index of any non-NULL column (the primary key index being the first place to look).
As soon as you introduce a column into the SELECT COUNT you are asking the engine to perform at least an index scan and possibly a table scan, and that will be slower.

I do not believe you will find a special method for this. However, you could do your select count on the primary key to be a little bit faster.

sp_spaceused 'table_name' (exclude single quote)
this will return the number of rows in the above table, this is the most efficient way i have come across yet.
it's more efficient than select Count(1) from 'table_name' (exclude single quote)
sp_spaceused can be used for any table, it's very helpful when the table is exceptionally big (hundreds of millions of rows), returns number of rows right a way, whereas 'select Count(1)' might take more than 10 seconds. Moreover, it does not need any column names/key field to consider.

Optimizing "ORDER BY" when the result set is very large and it can't be ordered by an index

How can I make an ORDER BY clause with a small LIMIT (ie 20 rows at a time) return quickly, when I can't use an index to satisfy the ordering of rows?
Let's say I would like to retrieve a certain number of titles from a table 'node' (simplified below). I'm using MySQL by the way.
node_ID INT(11) NOT NULL auto_increment,
node_title VARCHAR(127) NOT NULL,
node_lastupdated INT(11) NOT NULL,
node_created INT(11) NOT NULL
But I need to limit the rows returned to only those a particular user has access to. Many users have access large numbers of nodes. I have this information pre-calculated in a big lookup table (an attempt to make things easier) where the primary key covers both columns and the presence of a row means that usergroup has access to that node:
viewpermission_nodeID INT(11) NOT NULL,
viewpermission_usergroupID INT(11) NOT NULL
My query therefore contains something like
FROM
node
INNER JOIN viewpermission ON
viewpermission_nodeID=node_ID
AND viewpermission_usergroupID IN (<...usergroups of current user...>)
... and I also use a GROUP BY or a DISTINCT so that a node is only returned once even if two of the user's 'usergroups' both have access to that node.
My problem is that there seems to be no way for an ORDER BY clause which sorts results by created or last updated date to use an index, because the rows being returned depend on values in the other viewpermission table.
Therefore MySQL would need to find all rows which match the criteria, then sort them all itself. If there are one million rows for a particular user, and we want to view, say, the latest 100 or rows 100-200 when ordered by last update, the DB would need to figure out which one million rows the user can see, sort this whole result set itself, before it can return those 100 rows, right?
Is there any creative way to get around this? I've been thinking along the lines of:
Somehow add dates into the viewpermission lookup table so that I can build an index containing the dates as well as the permissions. It's a possibility I guess.
Edit: Simplified question
Perhaps I can simplify the question by rewriting it like this:
Is there any way to rewrite this query or create an index for the following such that an index can be used to do the ordering (not just to select the rows)?
SELECT nodeid
FROM lookup
WHERE
usergroup IN (2, 3)
GROUP BY
nodeid
An index on (usergroup) allows the WHERE part to be satisfied by an index, but the GROUP BY forces a temporary table and filesort on those rows. An index on (nodeid) does nothing for me, because the WHERE clause needs an index with usergroup as its first column. An index on (usergroup, nodeid) forces a temporary table and filesort because the GROUP BY is not the first column of the index that can vary.
Any solutions?

Can I answer my own question?
I believe I have found that the only way to do what I describe is for my lookup table to have rows for every possible combination of usergroups a person may want to be a member of.
To pick a simplified example, instead of doing this:
SELECT id FROM ids WHERE groups IN(1,2) ORDER BY id
If you need to use the index both to select rows and to order them, you have to abstract that IN(1,2) so that it is constant rather than a range, ie:
SELECT id FROM ids WHERE grouplist='1,2' ORDER BY id
Of course instead of using the string '1,2' you could have a foreign key there, etc. The point being that you'd have to have a row not just for each group but for each combination of multiple groups.
So, there is my answer.
Anyway, for my application, I feel that maintaining a lookup for all possible combinations of usergroups for each node is not worth it. For my purposes, I predict that most nodes are visible to most users, so I feel that it is acceptable to simply to make the GROUP BY use the index, as the filtering doesn't need it so badly.
In other words, the approach I'll take for my original query may be something like:
SELECT
<fields>
FROM
node
INNER JOIN viewpermission ON
viewpermission_nodeID=node_ID
AND viewpermission_usergroupID IN (<...usergroups of current user...>)
FORCE INDEX(node_created_and_node_ID)
GROUP BY
node_created, node_ID
GROUP BY can use an index if it starts at the left most column of the index and it is in the first non-const non-system table to be processed. The join then deals with the entire list (which is already ordered), and only those not visible to the current user (which will be a small proportion) are removed by the INNER JOIN.

Copy the value you are going to order by into to viewpermission table and add it to your index.
You could use a trigger to maintain that value from the other table.

select * from
(
select *
FROM node
INNER JOIN viewpermission
ON viewpermission_nodeID=node_ID
AND viewpermission_usergroupID IN (<...usergroups of current user...>)
) a
order by a.node_lastupdated desc
The inner query gives you the filtered subset, which I understand is substantially smaller than the whole set. Only the smaller has to be sorted.

MySQL has problems when you use GROUP BY and ORDER BY in the same query. That causes a filesort, and that's probably the biggest penalty for performance.
You can eliminate the need for a DISTINCT (or GROUP BY) by using a non-correlated subquery instead of a JOIN.
SELECT * FROM node
WHERE node_id IN (
SELECT viewpermission_nodeID
FROM viewpermission
WHERE viewpermissiong_usergroupID IN ( <...usergroups...> )
)
ORDER BY node_lastupdated DESC
LIMIT 100;
There's no need to sort or do a DISTINCT on the subquery, since IN (1, 1, 2, 3) is the same as IN (1, 3, 2).
Note that MySQL can use only one index per table in a given query, so it'll try to make the best choice between an index on node_id and an index on node_lastupdated. It can't use both, and even if you made a compound index it wouldn't help in this case.
Remember to analyze different solutions with EXPLAIN.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Check for complete duplicate rows in a large table - sql

shouldn't something like that do the job? SELECT ALL_COLUMNS[expect unique ID], count(0) as Dupl FROM table WHERE Dupl>1 GROUP BY ALL_COLUMNS[expect unique ID]; not sure if its the most efficient way, but count>1 means you have two identical rows.

Related

redshift select distinct returns repeated values

What's the best way to union two queries with distinct values in a single column, prioritizing the first query?

SQL: Remove rows whose associations are broken (orphaned data)

What is the most efficient way to count rows in a table in SQLite?

Optimizing "ORDER BY" when the result set is very large and it can't be ordered by an index

Categories

Resources