Fast way to eyeball possible duplicate rows in a table? - sql

Similar: How can I delete duplicate rows in a table
I have a feeling this is impossible and I'm going to have to do it the tedious way, but I'll see what you guys have to say.
I have a pretty big table, about 4 million rows, and 50-odd columns. It has a column that is supposed to be unique, Episode. Unfortunately, Episode is not unique - the logic behind this was that occasionally other fields in the row change, despite Episode being repeated. However, there is an actually unique column, Sequence.
I want to try and identify rows that have the same episode number, but something different between them (aside from sequence), so I can pick out how often this occurs, and whether it's worth allowing for or I should just nuke the rows and ignore possible mild discrepancies.
My hope is to create a table that shows the Episode number, and a column for each table column, identifying the value on both sides, where they are different:
SELECT Episode,
CASE WHEN a.Value1<>b.Value1
THEN a.Value1 + ',' + b.Value1
ELSE '' END AS Value1,
CASE WHEN a.Value2<>b.Value2
THEN a.Value2 + ',' + b.Value2
ELSE '' END AS Value2
FROM Table1 a INNER JOIN Table1 b ON a.Episode = b.Episode
WHERE a.Value1<>b.Value1
OR a.Value2<>b.Value2
(That is probably full of holes, but the idea of highlighting changed values comes through, I hope.)
Unfortunately, making a query like that for fifty columns is pretty painful. Obviously, it doesn't exactly have to be rock-solid if it will only be used the once, but at the same time, the more copy-pasta the code, the more likely something will be missed. As far as I know, I can't just do a search for DISTINCT, since Sequence is distinct and the same row will pop up as different.
Does anyone have a query or function that might help? Either something that will output a query result similar to the above, or a different solution? As I said, right now I'm not really looking to remove the duplicates, just identify them.

Use:
SELECT DISTINCT t.*
FROM TABLE t
ORDER BY t.episode --, and whatever other columns
DISTINCT is just shorthand for writing a GROUP BY with all the columns involved. Grouping by all the columns will show you all the unique groups of records associated with the episode column in this case. So there's a risk of not having an accurate count of duplicates, but you will have the values so you can decide what to remove when you get to that point.
50 columns is a lot, but setting the ORDER BY will allow you to eyeball the list. Another alternative would be to export the data to Excel if you don't want to construct the ORDER BY, and use Excel's sorting.
UPDATE
I didn't catch that the sequence column would be a unique value, but in that case you'd have to provide a list of all the columns you want to see. IE:
SELECT DISTINCT t.episode, t.column1, t.column2 --etc.
FROM TABLE t
ORDER BY t.episode --, and whatever other columns
There's no notation that will let you use t.* but not this one column. Once the sequence column is omitted from the output, the duplicates will become apparent.

Instead of typing out all 50 columns, you could do this:
select column_name from information_schema.columns where table_name = 'your table name'
then paste them into a query that groups by all of the columns EXCEPT sequence, and filters by count > 1:
select
count(episode)
, col1
, col2
, col3
, ...
from YourTable
group by
col1
, col2
, col3
, ...
having count(episode) > 1
This should give you a list of all the rows that have the same episode number. (But just neither the sequence nor episode numbers themselves). Here's the rub: you will need to join this result set to YourTable on ALL the columns except sequence and episode since you don't have those columns here.
Here's where I like to use SQL to generate more SQL. This should get you started:
select 't1.' + column_name + ' = t2.' + column_name
from information_schema.columns where table_name = 'YourTable'
You'll plug in those join parameters to this query:
select * from YourTable t1
inner join (
select
count(episode) 'epcount'
, col1
, col2
, col3
, ...
from YourTable
group by
col1
, col2
, col3
, ...
having count(episode) > 1
) t2 on
...plug in all those join parameters here...

select count distinct ....
Should show you without having to guess. You can get your columns by viewing your table definition so you can copy/paste your non-sequence columns.

I think something like this is what you want:
select *
from t
where t.episode in (select episode from t group by episode having count(episode) > 1)
order by episode
This will give all rows that have episodes that are duplicated. Non-duplicate rows should stick out fairly obviously.
Of course, if you have access to some sort of scripting, you could just write a script to generate your query for you. It seems pretty straight-forward. (i.e. describe t and iterate over all the fields).
Also, your query should have some sort of ordering, like FROM Table1 a INNER JOIN Table1 b ON a.Episode = b.Episode AND a.Sequence < b.Sequence, otherwise you'll get duplicate non-duplicates.

A relatively simple solution that Ponies sparked:
SELECT t.*
FROM Table t
INNER JOIN ( SELECT episode
FROM Table
GROUP BY Episode
HAVING COUNT(*) > 1
) AS x ON t.episode = x.episode
And then, copy-paste into Excel, and use this as conditional highlighting for the entire result set:
=AND($C2=$C1,A2<>A1)
Column C is Episode. This way, you get a visual highlight when the data's different from the row above (as long as both rows have the same value for episode).

Generate and store a hash key for each row, designed so the hash values mirror your
definition of sameness. Depending on the complexity of your rows, updating the
hash might be a simple trigger on modifying the row.
Query for duplicates of the hash key, which are your "very probably" identical rows.

Related

How to find duplicate rows in Hive?

I want to find duplicate rows from one of the Hive table for which I was given two approaches.
First approach is to use following two queries:
select count(*) from mytable; // this will give total row count
second query is as below which will give count of distinct rows
select count(distinct primary_key1, primary_key2) from mytable;
With this approach, for one of my table total row count derived using first query is 3500 and second query gives row count 2700. So it tells us that 3500 - 2700 = 800 rows are duplicate. But this query doesn't tell which rows are duplicated.
My second approach to find duplicate is:
select primary_key1, primary_key2, count(*)
from mytable
group by primary_key1, primary_key2
having count(*) > 1;
Above query should list of rows which are duplicated and how many times particular row is duplicated. but this query shows zero rows which means there are no duplicate rows in that table.
So I would like to know:
If my first approach is correct - if yes then how do I find which rows are duplicated
Why second approach is not providing list of rows which are duplicated?
Is there any other way to find the duplicates?
Hive does not validate primary and foreign key constraints.
Since these constraints are not validated, an upstream system needs to
ensure data integrity before it is loaded into Hive.
That means that Hive allows duplicates in Primary Keys.
To solve your issue, you should do something like this:
select [every column], count(*)
from mytable
group by [every column]
having count(*) > 1;
This way you will get list of duplicated rows.
analytic window function row_number() is quite useful and can provide the duplicates based upon the elements specified in the partition by clause. A simply in-line view and exists clause will then pinpoint what corresponding sets of records contain these duplicates from the original table. In some databases (like TD, you can forgo the inline view using a QUALIFY pragma option)
SQL1 & SQL2 can be combined. SQL2: If you want to deal with NULLs and not simply dismiss, then a coalesce and concatenation might be better in the
SELECT count(1) , count(distinct coalesce(keypart1 ,'') + coalesce(keypart2 ,'') )
FROM srcTable s
3) Finds all records, not just the > 1 records. This provides all context data as well as the keys so it can be useful when analyzing why you have dups and not just the keys.
select * from srcTable s
where exists
( select 1 from (
SELECT
keypart1,
keypart2,
row_number() over( partition by keypart1, keypart2 ) seq
FROM srcTable t
WHERE
-- (whatever additional filtering you want)
) t
where seq > 1
AND t.keypart1 = s.keypart1
AND t.keypart2 = s.keypart2
)
Suppose your want get duplicate rows based on a particular column ID here. Below query will give you all the IDs which are duplicate in table in hive.
SELECT "ID"
FROM TABLE
GROUP BY "ID"
HAVING count(ID) > 1

Turning multiple rows into single row based on ID, and keeping null values

I have tried some of the various solutions posted on Stack for this issue but none of them keep null values (and it seems like the entire query is built off that assumption).
I have a table with 1 million rows. There are 10 columns. The first column is the id. Each id is unique to "item" (in my case a sales order) but has multiple rows. Each row is either completely null or has a single value in one of the columns. No two rows with the same ID have data for the same column. I need to merge these multiple rows into a single row based on the ID. However, I need to keep the null values. If the first column is null in all rows I need to keep that in the final data.
Can someone please help me with this query I've been stuck on it for 2 hours now.
id - Age - firstname - lastname
1 13 null null
1 null chris null
should output
1 13 chris null
It sounds like you want an aggregation query:
select id, max(col1) as col1, max(col2) as col2, . . .
from t
group by id;
If all values are NULL, then this will produce NULL. If one of the rows (for an id) has a value, then this will produce that value.
select id, max(col1), max(col2).. etc
from mytable
group by id
As some others have mentioned, you should use an aggregation query to achieve this.
select t1.id, max(t1.col1), max(t1.col2)
from tableone t1
group by t1.id
This should return nulls. If you're having issues handling your nulls, maybe implement some logic using ISNULL(). Make sure your data fields really are nulls and not empty strings.
If nulls aren't being returned, check to make sure that EVERY single row that has a particular ID has ONLY nulls. If one of them returns an empty string, then yes, it will drop the null and return anything else over the null.

SQL to find why PK canidate has duplicates on unkeyed table

If my title hurts your head... I'm with you. I don't want to get into why this table exists except that it is part of a legacy system, also the system does "record level access"(RLA) and this I know will be an issue for many tables, anyways the RLA is mentioned because adding a column will change the table format and then many very old programs will no longer work...
Apparently adding a PK has been shown not to change the table format. So I've been told that a certain set of keys is guarantied to be unique, well what do you know... it isn't. And now I need to show where they aren't.
All I can think of is:
Get the cross product where the table matches on it's primary key.
Somehow get a count column onto the result set for the number of entries where the PK matches it self.
Filter that result set for values where count id greater than 2.
I'm going to see if I expand the PK sufficiently I'll actually find something unique.
Remove the constraints / unique indexes, insert the data, and then run this query:
SELECT col1, col2, ..., coln, COUNT(*)
FROM your_table
GROUP BY col1, col2, ..., coln
HAVING COUNT(*) > 1
where col1, col2, ..., coln is the list of columns in your key (one or more columns). The result will be the list of keys that occur more than once together with a count showing how often they occur.
select col1, ... from tab group by col1, ... having count(*)>1;
SELECT * FROM (SELECT ID, COUNT(*) CNT FROM MY_TABLE GROUP BY ID) WHERE CNT > 1

Most efficient way to select 1st and last element, SQLite?

What is the most efficient way to select the first and last element only, from a column in SQLite?
The first and last element from a row?
SELECT column1, columnN
FROM mytable;
I think you must mean the first and last element from a column:
SELECT MIN(column1) AS First,
MAX(column1) AS Last
FROM mytable;
See http://www.sqlite.org/lang_aggfunc.html for MIN() and MAX().
I'm using First and Last as column aliases.
if it's just one column:
SELECT min(column) as first, max(column) as last FROM table
if you want to select whole row:
SELECT 'first',* FROM table ORDER BY column DESC LIMIT 1
UNION
SELECT 'last',* FROM table ORDER BY column ASC LIMIT 1
The most efficient way would be to know what those fields were called and simply select them.
SELECT `first_field`, `last_field` FROM `table`;
Probably like this:
SELECT dbo.Table.FirstCol, dbo.Table.LastCol FROM Table
You get minor efficiency enhancements from specifying the table name and schema.
First: MIN() and MAX() on a text column gives AAAA and TTTT results which are not the first and last entries in my test table. They are the minimum and maximum values as mentioned.
I tried this (with .stats on) on my table which has over 94 million records:
select * from
(select col1 from mitable limit 1)
union
select * from
(select col1 from mitable limit 1 offset
(select count(0) from mitable) -1);
But it uses up a lot of virtual machine steps (281,624,718).
Then this which is much more straightforward (which works if the table was created without WITHOUT ROWID) [sql keywords are in capitals]:
SELECT col1 FROM mitable
WHERE ROWID = (SELECT MIN(ROWID) FROM mitable)
OR ROWID = (SELECT MAX(ROWID) FROM mitable);
That ran with 55 virtual machine steps on the same table and produced the same answer.
min()/max() approach is wrong. It is only correct, if the values are ascending only. I needed something liket this for currency rates, which are random raising and falling.
This is my solution:
select st.*
from stats_ticker st,
(
select min(rowid) as first, max(rowid) as last --here is magic part 1
from stats_ticker
-- next line is just a filter I need in my case.
-- if you want first/last of the whole table leave it out.
where timeutc between datetime('now', '-1 days') and datetime('now')
) firstlast
WHERE
st.rowid = firstlast.first --and these two rows do magic part 2
OR st.rowid = firstlast.last
ORDER BY st.rowid;
magic part 1: the subselect results in a single row with the columns first,last containing rowid's.
magic part 2 easy to filter on those two rowid's.
This is the best solution I've come up so far. Hope you like it.
We can do that by the help of Sql Aggregate function, like Max and Min. These are the two aggregate function which help you to get last and first element from data table .
Select max (column_name ), min(column name) from table name
Max will give you the max value means last value and min will give you the min value means it will give you the First value, from the specific table.

Returning more than one value from a sql statement

I was looking at sql inner queries (bit like the sql equivalent of a C# anon method), and was wondering, can I return more than one value from a query?
For example, return the number of rows in a table as one output value, and also, as another output value, return the distinct number of rows?
Also, how does distinct work? Is this based on whether one field may be the same as another (thus classified as "distinct")?
I am using Sql Server 2005. Would there be a performance penalty if I return one value from one query, rather than two from one query?
Thanks
You could do your first question by doing this:
SELECT
COUNT(field1),
COUNT(DISTINCT field2)
FROM table
(For the first field you could do * if needed to count null values.)
Distinct means the definition of the word. It eliminates duplicate returned rows.
Returning 2 values instead of 1 would depend on what the values were, if they were indexed or not and other undetermined possible variables.
If you are meaning subqueries within the select statement, no you can only return 1 value. If you want more than 1 value you will have to use the subquery as a join.
If the inner query is inline in the SELECT, you may struggle to select multiple values. However, it is often possible to JOIN to a sub-query instead; that way, the sub-query can be named and you can get multiple results
SELECT a.Foo, a.Bar, x.[Count], x.[Avg]
FROM a
INNER JOIN (SELECT COUNT(1) AS [Count], AVG(something) AS [Avg]) x
ON x.Something = a.Something
Which might help.
DISTINCT does what it says. IIRC, you can SELECT COUNT(DISTINCT Foo) etc to query distinct data.
you can return multiple results in 3 ways (off the top of my head)
By having a select with multiple values eg: select col1, col2, col3
With multiple queries eg: select 1 ; select "2" ; select colA. you would get to them in a datareader by calling .NextRecord()
Using output parameters, declare the parameters before exec the query then get the value from them afterwards. eg: set #param1 = "2" . string myparam2 = sqlcommand.parameters["param1"].tostring()
Distinct, filters resulting rows to be unique.
Inner queries in the form:
SELECT * FROM tbl WHERE fld in (SELECT fld2 FROM tbl2 WHERE tbl.fld = tbl2.fld2)
cannot return multiple rows. When you need multiple rows from a secondary query, you usually need to do an inner join on the other query.
rows:
SELECT count(*), count(distinct *) from table
will return a dataset with one row containing two columns. Column 1 is the total number of rows in the table. Column 2 counts only distinct rows.
Distinct means the returned dataset will not have any duplicate rows. Distinct can only appear once usually directly after the select. Thus a query such as:
SELECT distinct a, b, c FROM table
might have this result:
a1 b1 c1
a1 b1 c2
a1 b2 c2
a1 b3 c2
Note that values are duplicated across the whole result set but each row is unique.
I'm not sure what your last question means. You should return from a query all the data relevant to the query. As for faster, only benchmarking can tell you which approach is faster.