In SQL, find duplicates in one column with unique values for another column - sql

So I have a table of aliases linked to record ids. I need to find duplicate aliases with unique record ids. To explain better:
ID Alias Record ID
1 000123 4
2 000123 4
3 000234 4
4 000123 6
5 000345 6
6 000345 7
The result of a query on this table should be something to the effect of
000123 4 6
000345 6 7
Indicating that both record 4 and 6 have an alias of 000123 and both record 6 and 7 have an alias of 000345.
I was looking into using GROUP BY but if I group by alias then I can't select record id and if I group by both alias and record id it will only return the first two rows in this example where both columns are duplicates. The only solution I've found, and it's a terrible one that crashed my server, is to do two different selects for all the data and then join them
ON [T_1].[ALIAS] = [T_2].[ALIAS] AND NOT [T_1].[RECORD_ID] = [T_2].[RECORD_ID]
Are there any solutions out there that would work better? As in, not crash my server when run on a few hundred thousand records?

It looks as if you have two requirements:
Identify all aliases that have more than one record id, and
List the record ids for these aliases horizontally.
The first is a lot easier to do than the second. Here's some SQL that ought to get you where you want with the first:
WITH A -- Get a list of unique combinations of Alias and [Record ID]
AS (
SELECT Distinct
Alias
, [Record ID]
FROM T1
)
, B -- Get a list of all those Alias values that have more than one [Record ID] associated
AS (
SELECT Alias
FROM A
GROUP BY
Alias
HAVING COUNT(*) > 1
)
SELECT A.Alias
, A.[Record ID]
FROM A
JOIN B
ON A.Alias = B.Alias
Now, as for the second. If you're satisfied with the data in this form:
Alias Record ID
000123 4
000123 6
000345 6
000345 7
... you can stop there. Otherwise, things get tricky.
The PIVOT command will not necessarily help you, because it's trying to solve a different problem than the one you have.
I am assuming that you can't necessarily predict how many duplicate Record ID values you have per Alias, and thus don't know how many columns you'll need.
If you have only two, then displaying each of them in a column becomes a relatively trivial exercise. If you have more, I'd urge you to consider whether the destination for these records (a report? A web page? Excel?) might be able to do a better job of displaying them horizontally than SQL Server can do in returning them arranged horizontally.

Perhaps what you want is just the min() and max() of RecordId:
select Alias, min(RecordID), max(RecordId)
from yourTable t
group by Alias
having min(RecordId) <> max(RecordId)
You can also count the number of distinct values, using count(distinct):
select Alias, count(distinct RecordId) as NumRecordIds, min(RecordID), max(RecordId)
from yourTable t
group by Alias
having count(DISTINCT RecordID) > 1;

This will give all repeated values:
select Alias, count(RecordId) as NumRecordIds,
from yourTable t
group by Alias
having count(RecordId) <> count(distinct RecordId);

I agree with Ann L's answer but would like to show how you can use window functions with CTE's as you may prefer the readability.
(Re: how to pivot horizontally, I again agree with Ann)
create temporary table things (
id serial primary key,
alias varchar,
record_id int
)
insert into things (alias, record_id) values
('000123', 4),
('000123', 4),
('000234', 4),
('000123', 6),
('000345', 6),
('000345', 7);
with
things_with_distinct_aliases_and_record_ids as (
select distinct on (alias, record_id)
id,
alias,
record_id
from things
),
things_with_unique_record_id_counts_per_alias as (
select *,
COUNT(*) OVER(PARTITION BY alias) as unique_record_ids_count
from things_with_distinct_aliases_and_record_ids
)
select * from things_with_unique_record_id_counts_per_alias
where unique_record_ids_count > 1
The first CTE gets all the unique alias/record id combinations. E.g.
id | alias | record_id
----+--------+-----------
1 | 000123 | 4
4 | 000123 | 6
3 | 000234 | 4
5 | 000345 | 6
6 | 000345 | 7
The second CTE simply creates a new column for the above and adds the count of record ids for each alias. This allows you to filter only those aliases which have more than one record id associated with them.
id | alias | record_id | unique_record_ids_count
----+--------+-----------+-------------------------
1 | 000123 | 4 | 2
4 | 000123 | 6 | 2
3 | 000234 | 4 | 1
5 | 000345 | 6 | 2
6 | 000345 | 7 | 2

SELECT A.CitationId,B.CitationId, A.CitationName, A.LoaderID, A.PrimaryReferenceLoaderID,B.SecondaryReference1LoaderID, A.SecondaryReference1LoaderID, A.SecondaryReference2LoaderID,
A.SecondaryReference3LoaderID, A.SecondaryReference4LoaderID, A.CreatedOn, A.LastUpdatedOn
FROM CitationMaster A, CitationMaster B
WHERE A.PrimaryReferenceLoaderID= B.SecondaryReference1LoaderID and Isnull(A.PrimaryReferenceLoaderID,'') != '' and Isnull(B.SecondaryReference1LoaderID,'') !=''

Related

In sequelize, how do I select records that match all values that i am searching for?

As an example, I have the following table:
T | S
------
1 | 5
1 | 6
1 | 7
2 | 6
2 | 7
3 | 6
Query: array [1,2]
I want to select all values in S that have the value 1 AND 2 in the T Column.
So in the above example I should get as a result (6,7) because only 6 and 7 have for column T the values 1 and 2.
But i do not want to have 5 in my results as 5 does not have 2 in the T column.
How would I do this in sequelize?
how do i make (1,2) to be used as an array?
Either you insert the array joined as comma-separated literal into the query text (variant 1) or you join the array into one string literal and transfer it iinto the query as a parameter (variant 2).
Variant 1
SELECT s
FROM sourcetable
WHERE t IN (1,2) -- separate filter values
GROUP BY s
HAVING COUNT(DISTINCT t) = 2 -- unique values count
Variant 2
SELECT s
FROM sourcetable
WHERE FIND_IN_SET(t, '1,2') -- separate filter values
GROUP BY s
HAVING COUNT(DISTINCT t) = 2 -- unique values count
If (s,t) is unique then DISTINCT keyword may be removed.

How to sort a column in one table based on the rank in another table

I have a table Table 1 that has User_ID and Item_List where items are arranged randomly
Customer_id Item_List
22 1,4,3,2
24 6,3,2,1
23 4,5,7,8
Table 2 has the ranks of the item according to the highest value
Item_Id Item_Rank
1 8
2 5
3 3
4 4
5 2
6 7
7 1
8 6
I want to produce a Table that has Customer_id with the corresponding Item List ranked according to the Item Rank in Table 2
Customer_id Ranked_Item_List
22 3,4,2,1
24 3,2,6,1
23 7,5,4,8
I don't know any efficient method to do it in hive. Any suggestions?
I can think in 2 different ways, create your UDF to avoid the explode or
select customer_id, collect_list(item_id) from (
select customer_id, item_id, item_rank from
table1 lateral view inline(item_list) item_id join
table2 on table1.item_id = table2.item_id --this should be done as mapjoin if your rank table is not big
) distributed by customer_id, sort by item_rank;
Like I said before, depending on the size of your data, you could create an UDF to apply the sort at mapper level based on your lookup table

Compare column entry to every other entry in the same column

I have a Column of values in SQLite.
value
-----
1
2
3
4
5
For each value I would like to know how many of the other values are larger and display the result. E.g. For value 1 there are 4 entries that have higher values.
value | Count
-------------
1 | 4
2 | 3
3 | 2
4 | 1
5 | 0
I have tried nested select statements and using the Count(*) function but I do not seem to be able to extract the correct levels. Any suggestions would be much appreciated.
Many Thanks
You can do this with a correlated subquery in SQLite:
select value,
(select count(*) from t t2 where t2.value > t.value) as "count"
from t;
In most other databases, you would use a ranking function such as rank() or dense_rank(), but SQLite doesn't support these functions.

SQL - Order by amount of occurrences

It's my first question here so I hope I can explain it well enough,
I want to order my data by amount of occurrences in the table.
My table is like this:
id Daynr
1 2
1 4
2 4
2 5
2 6
3 1
4 2
4 5
And I want it to sort it like this:
id Daynr
3 1
1 2
1 4
4 2
4 5
2 4
2 5
2 6
Player #3 has one day in the table, and Player #1 has 2.
My table is named "dayid"
Both id and Daynr are foreign keys, together making it a primary key
I hope this explains my problem enough, Please ask for more information it's my first time here.
Thanks in advance
You can do this by counting the number of times that things occur for each id. Most databases support window functions, so you can do this as:
select id, daynr
from (select t.*, count(*) over (partition by id) as cnt
from table t
) t
order by cnt, id;
You can also express this as a join:
select t.id, t.daynr
from table as t inner join
(select id, count(*) as cnt
from table
group by id
) as tg
on t.id = tg.id
order by tg.cnt, id;
Note that both of these include the id in the order by. That way, if two ids have the same count, all rows for the id will appear together.

Remove duplicate rows #2

I have a (large ~1 000 000 rows) table that potentially contains duplicate rows (possible NULL values).
What I want to do is this:
Select only distinc rows.
Remove rows with duplicate 'id' field.
Let's have a table:
id | a | b
1 | 2 | 3
2 | 8 | 7
3 | 9 | 10
2 | 8 | 7
3 | 20| 12
What I want to get is:
id | a | b
1 | 2 | 3
2 | 8 | 7
Row with id 2 is preserved in one copy, while rows with id 3 were removed.
I was thinking about:
SELECT DISTINCT id, a, b FROM table; to get only distinct rows.
Somehow filter the result of (1) to remove duplicate ids.
What would be the best way to approach this?
Third Answer now that the question is slightly clearer:
SELECT id, min(a) as a, min(b) as b
FROM (SELECT DISTINCT id, a, b FROM table) t
GROUP BY id
HAVING count(*) =1
Petr, it looks like per the comments, you want a COMBINATION...
Include:
All rows where the ID occurs ONLY ONCE
All rows where the ID occurs MORE than once -- AND all the other fields on the record are the same
EXCLUDE:
Any row where the ID occurs more than once -- AND the other fields do not exactly match.
select ID, min(a) a, min(b) b
from YourTable
group by ID
having min(a) = max(a)
and min(b) = max(b)
If you have more columns aside from a and b to compare, just add the respective values to the select field list and the corresponding having. From the data sample you've provided, the values return from the query would be
ID MIN(A) MIN(B) Having MIN(A) MAX(A) MIN(B) MAX(B)
1 2 3 2 2 3 3
2 8 7 8 8 7 7
3 9 10 9 20 10 12
So the row ID = 3 will get tossed since the having will fail on a same min() and max() of the same column across BOTH columns. Then, you can copy this into a new table. Only one pass through the table...
Can you rebuild the database, or if not build a new one from the original, with id as a primary key? SQL can take care of the rest.