Find and group duplicates

Find and group duplicates - sql

Hopefully I'm able to explain what I'm trying to achieve, it's a bit complicated I think.
I have two tables like this:
ID | Names
--------------
A | Name1
B | Name2
C | Name3
ID | md5s
--------------
A | a
A | b
B | c
C | a
C | c
I'm trying to achieve this: In the end I want to have a list of all "Names" that have duplicate MD5 values and in which other "Names" these MD5 values were found.
So I want to get something like this:
Name1 has 5 duplicate entries in "md5s" with Name8, 4 with Name10 ...
I need a list for all "Names" like described above.
Hopefully that makes sense to someone. :)
I already tried it with this SQL statement:
SELECT names,COUNT(names) AS Num FROM tablename GROUP BY names HAVING(Num > 1);
But that gives me only the md5s that are duplicates. The relation to the rest is totally missing.
*edit:fixed typo

I feel like there must be a better solution than this, but here's what I've thrown together for you:
SELECT a.names NAME,
b.names DUPE_NAME,
COUNT(*) NUM_DUPES
FROM names_tbl a, names_tbl b, md5_tbl md5a, md5_tbl md5b
WHERE a.id < b.id
AND a.id = md5a.id
AND b.id = md5b.id
AND md5a.md5 = md5b.md5
GROUP BY a.names, b.names
ORDER BY a.names
The rule of thumb with finding duplicates is that you probably need to do a self join. This would be simpler if the names and their associated md5's were in the same record, but because they're in separate tables I think you need two versions of each table.

Related

Aggregating or Bundle a Many to Many Relationship in SQL Developer

So I have 1 single table with 2 columns : Sales_Order called ccso, Arrangement called arrmap
The table has distinct values for this combination and both these fields have a Many to Many relationship
1 ccso can have Multiple arrmap
1 arrmap can have Multiple ccso
All such combinations should be considered as one single bundle
Objective :
Assign a final map to each of the Sales Order as the Largest Arrangement in that Bundle
Example:
ccso : 100-10015 has 3 arrangements --> Now each of those arrangements have a set of Sales Orders --> Now those sales orders will also have a list of other arrangements and so on
(Image : 1)
Therefore the answer definitely points to something recursively checking. Ive managed to write the below code / codes and they work as long as I hard code a ccso in the where clause - But I don't know how to proceed after this now. (I'm an accountant by profession but finding more passion in coding recently) I've searched the forums and web for things like
Recursive CTEs,
many to many aggregation
cartesian product etc
and I'm sure there must be a term for this which I don't know yet. I've also tried
I have to use sqldeveloper or googlesheet query and filter formulas
sqldeveloper has restrictions on on some CTEs. If recursive is the way I'd like to know how and if I can control the depth to say 4 or 5 iterations
Ideally I'd want to update a third column with the final map if possible but if not, then a select query result is just fine
Codes I've tried
Code 1: As per Screenshot
WITH a1(ccso, amap) AS
(SELECT distinct a.ccso, a.arrmap
FROM rg_consol_map2 A
WHERE a.ccso = '100-10115' -- this condition defines the ultimate ancestors in your chain, change it as appropriate
UNION ALL
SELECT m.ccso, m.arrmap
FROM rg_consol_map2 m
JOIN a1
ON M.arrmap = a1.amap -- or m.ccso=a1.ccso
) /*if*/ CYCLE amap SET nemap TO 1 /*else*/ DEFAULT 0
SELECT DISTINCT amap FROM (SELECT ccso, amap FROM a1 ORDER BY 1 DESC) WHERE ROWNUM = 1
In this the main challenge is how to remove the hardcoded ccso and do a join for each of the ccso
Code 2 : Manual CTEs for depth
Here again the join outside the CTE gives me an error and sqldeveloper does not allow WITH clause with UPDATE statement - only works for select and cannot be enclosed within brackets as subtable
SELECT distinct ccso FROM
(
WITH ar1 AS
(SELECT distinct arrmap
FROM rg_consol_map
WHERE ccso = a.ccso
)
,so1 AS
(SELECT DISTINCT ccso
FROM rg_consol_map
WHERE arrmap IN (SELECT arrmap FROM ar1)
)
,ar2 AS
(SELECT DISTINCT ccso FROM rg_consol_map
where arrmap IN (select distinct arrmap FROM rg_consol_map
WHERE ccso IN (SELECT ccso FROM so1)
))
SELECT ar1.arrmap, NULL ccso FROM ar1
union all
SELECT null, ar2.ccso FROM ar2
UNION ALL
SELECT NULL arrmap, so1.ccso FROM so1
)
Am I Missing something here or is there an easier way to do this? I read something about MERGE and PROC SQL JOIN but was unable to get them to work but if that's the way to go ahead I will try further if someone can point me in the direction
(Image : 2)
(CSV File : [3])
Edit : Fixing CSV file link
https://github.com/karan360note/karanstackoverflow.git
I suppose can be downloaded from here IC mapping many to many.csv
Oracle 11g version is being used

Apologies in advance for the wall of text.
Your problem is a complex, multi-layered Many-to-Many query; there is no "easy" solution to this, because that is not a terribly ideal design choice. The safest best does literally include multiple layers of CTE or subqueries in order to achieve all the depths you want, as the only ways I know to do so recursively rely on an anchor column (like "parentID") to direct the recursion in a linear fashion. We don't have that option here; we'd go in circles without a way to track our path.
Therefore, I went basic, and with several subqueries. Every level checks for a) All orders containing a particular ARRMAP item, and then b) All additional items on those orders. It's clear enough for you to see the logic and modify to your needs. It will generate a new table that contains the original CCSO, the linking ARRMAP, and the related CCSO. Link: https://pastebin.com/un70JnpA
This should enable you to go back and perform the desired updates you want, based on order # or order date, etc... in a much more straightforward fashion. Once you have an anchor column, a CTE in the future is much more trivial (just search for "CTE recursion tree hierarchy").
SELECT DISTINCT
CCSO, RELATEDORDER
FROM myTempTable
WHERE CCSO = '100-10115'; /* to find all orders by CCSO, query SELECT DISTINCT RELATEDORDER */
--WHERE ARRMAP = 'ARR10524'; /* to find all orders by ARRMAP, query SELECT DISTINCT CCSO */
EDIT:
To better explain what this table generates, let me simplify the problem.
If you have order
A with arrangements 1 and 2;
B with arrangement 2, 3; and
C with arrangement 3;
then, by your initial inquiry and image, order A should related to orders B and C, right? The query generates the following table when you SELECT DISTINCT ccso, relatedOrder:
+-------+--------------+
| CCSO | RelatedOrder |
+----------------------+
| A | B |
| A | C |
+----------------------+
| B | C |
| B | A |
+----------------------+
| C | A |
| C | B |
+-------+--------------+
You can see here if you query WHERE CCSO = 'A' OR RelatedOrder = 'A', you'll get the same relationships, just flipped between the two columns.
+-------+--------------+
| CCSO | RelatedOrder |
+----------------------+
| A | B |
| A | C |
+----------------------+
| B | A |
+----------------------+
| C | A |
+-------+--------------+
So query only CCSO or RelatedOrder.
As for the results of WHERE CCSO = '100-10115', see image here, which includes all the links you showed in your Image #1, as well as additional depths of relations.

pass results between queries and display joint results ( google bigquery )

I want to make a query q1, and use the result of q1 on a second query q2.
I want to display all columns of q1 and q2, so that results are based on a common column.
(Please let me know if title is not so clear)
The example below should display columns [id, publisher, author] in the q1.
I want to pass them to q2, retrieve properties [id, cited_id, category] for all items within the id column of q1.
As results, for each id I want to display all cited_ids and their properties (of both ids and cited_ids).
Alternatively, for better clarity, it is also ok to retrieve an array of cited_ids for each ids, and in a separate query I will decorate my ids and cited_ids with their properties.
Please advise also on the "performance" (I m using bigquery, so if you could explain why a solution is more efficient that would help in saving computational resources!).
I came up with this, but cannot display all columns of q1.
WITH q1 AS (
SELECT id, publisher, a.name
FROM `db.publications`,
UNNEST (publisher) as h,
UNNEST (author) as a
WHERE h Like '%penguin%'
)
SELECT p.id, c.id AS Cited, c.Category AS Cat
FROM `db.publications` AS p, UNNEST(citation) AS c
WHERE p.id IN (SELECT id from q1)
Sample Data:
# result of q1
Row | Id | Publisher | Author
1 | item0 | penguin | Bob
2 | item0 | penguin | Alice
3 | item1 | penguin | Charlie
I want to find other items that are cited by each unique item in q1 (item0, item1).
I wish to have results in an handy format that could be used in this way:
# Citations: books mentioned by item0, item1 ...
item0 : [item10, item15, item100]
item1 : [item23, item0, item101, item15]
..
# Decorators : information about each book:
Row | Id | Publisher | Author(s) |
My question is can achieve both in a single query?
If so, is it convenient or better to split in two separated queries for lower computational resources?
My approach is first query a set of books and their decorators, and then use a list of ids to look for their citations. I could not carry decorators along with above example.

Regarding the first part of your question, instead of using where p.id in(select id from q1), use a join to bring in q1 fields.
WITH q1 AS (
SELECT id, publisher, a.name
FROM `db.publications`,
UNNEST (publisher) as h,
UNNEST (author) as a
WHERE h Like '%penguin%'
),
joined as (
select id, p.citation, q1.publisher, q1.name
from `db.publications` p
inner join q1 using(id)
)
select id, c.id as Cited, c.Category as Cat
from joined
left join unnest(citation) c

SQL Two SELECT vs. JOIN best performance?

I wonder which has better performance in this case. First of all, I want to show to the user his medical information. I have two tables
user
-----
id_user | type_blood | number | ...
1 O 123
2 A+ 442
user_allergies
-----------
id_user | name
1 name1
1 name2
I want to return:
JSON {id_user=1, type_blood=0, allergies=(name1,name2)}
So, Its better do a JOIN for user and user_allergies and iterate, or maybe two SELECT?
But if then I have another table like user_allergies, that the result can be:
user_another_table
-----------
id_user | name
1 namet1
1 namet2
1 namet3
JSON {id_user=1, type_blood=0, allergies=(name1,name2), table=(namet1,namet2,namet3)}
It's better three SELECT or a JOIN, but then I have to iterate on the results and I can't imagine a esay way. A JOIN can give me a result like:
id_user | type_blood | allergy_name | another_table_name
1 O name1 namet1
1 O name1 namet2
1 O name1 namet3
1 O name2 namet1
1 O name2 namet2
1 O name2 namet3
Is there any way to extract:
id_user | type_blood | allergy_name | another_table_name
1 O name1 namet1
1 O name2 namet2
1 O namet3
Thanks community, I'm newbie in SQL

Depending on the data - there is no way to get the 2nd set of results you've shown, if the 1st set of results shows the values. The 2nd one is throwing data away - in this case allergy 'name2' for another_table_name 'namet3'. This is why you get many rows back with repeated data.
You can use the group by clause to restrict this in some cases, but again - it won't let you throw away data like that.
You could try using the COALESCE clause, if your DB supports it.
If not, I think you're going to have to construct your JSON in some business logic, in which case its fine to read the data in a 3-way join. You order by the user id and either create or append the row data to the JSON document depending if a user record is present or not (if you order by user id, you only need to keep track of when the user id value changes).
Alternatively, you can read a list of users and single-item data in one query, and then ht the DB again for the repeating data.

sybase - values from one table that aren't on another, on opposite ends of a 3-table join

Hypothetical situation: I work for a custom sign-making company, and some of our clients have submitted more sign designs than they're currently using. I want to know what signs have never been used.
3 tables involved:
table A - signs for a company
sign_pk(unique) | company_pk | sign_description
1 --------------------1 ---------------- small
2 --------------------1 ---------------- large
3 --------------------2 ---------------- medium
4 --------------------2 ---------------- jumbo
5 --------------------3 ---------------- banner
table B - company locations
company_pk | company_location(unique)
1 ------|------ 987
1 ------|------ 876
2 ------|------ 456
2 ------|------ 123
table C - signs at locations (it's a bit of a stretch, but each row can have 2 signs, and it's a one to many relationship from company location to signs at locations)
company_location | front_sign | back_sign
987 ------------ 1 ------------ 2
987 ------------ 2 ------------ 1
876 ------------ 2 ------------ 1
456 ------------ 3 ------------ 4
123 ------------ 4 ------------ 3
So, a.company_pk = b.company_pk and b.company_location = c.company_location. What I want to try and find is how to query and get back that sign_pk 5 isn't at any location. Querying each sign_pk against all of the front_sign and back_sign values is a little impractical, since all the tables have millions of rows. Table a is indexed on sign_pk and company_pk, table b on both fields, and table c only on company locations. The way I'm trying to write it is along the lines of "each sign belongs to a company, so find the signs that are not the front or back sign at any of the locations that belong to the company tied to that sign."
My original plan was:
Select a.sign_pk
from a, b, c
where a.company_pk = b.company_pk
and b.company_location = c.company_location
and a.sign_pk *= c.front_sign
group by a.sign_pk having count(c.front_sign) = 0
just to do the front sign, and then repeat for the back, but that won't run because c is an inner member of an outer join, and also in an inner join.
This whole thing is fairly convoluted, but if anyone can make sense of it, I'll be your best friend.

How about something like this:
SELECT DISTINCT sign_pk
FROM table_a
WHERE sign_pk NOT IN
(
SELECT DISTINCT front_sign sign
FROM table_c
UNION
SELECT DISTINCT rear_sign sign
FROM table_c
)

ANSI outer join is your friend here. *= has dodgy semantics and should be avoided
select distinct a.sign_pk, a.company_pk
from a join b on a.company_pk = b.company_pk
left outer join c on b.company_location = c.company_location
and (a.sign_pk = c.front_sign or a.sign_pk = c.back_sign)
where c.company_location is null
Note that the where clause is a filter on the rows returned by the join, so it says "do the joins, but give me only the rows that didn't to join to c"
Outer join is almost always faster than NOT EXISTS and NOT IN

I would be tempted to create a Temp table for the inner join and then outer join that.
But it really depends on the size of your data sets.
Yes, the schema design is flawed, but we can't always fix that!

SQL: Looking up the same field in one table for multiple values in another table?

(Not sure if the name of this question really makes sense, but what I want to do is pretty straightforward)
Given tables that looks something like this:
Table Foo
---------------------------------
| bar1_id | bar2_id | other_val |
---------------------------------
Table Bar
--------------------
| bar_id | bar_desc|
--------------------
How would I create a select that would return a table that would look like the following:
---------------------------------------------------------
| bar1_id | bar1_desc | bar2_id | bar2_desc | other_val |
---------------------------------------------------------
i.e. I want to grab every row from Foo and add in a column containing the description of that bar_id from Bar. So there might be some rows from Bar that don't end up in the result set, but every row of Foo should be in it.
Also, this is postgres, if that makes a difference.

SELECT F.bar_id1,
(SELECT bar_desc FROM Bar B WHERE (F.bar_id1 = B.bar_id)),
F.bar_id2,
(SELECT bar_desc FROM Bar B WHERE (F.bar_id2 = B.bar_id)),
F.other_val
FROM FOO F;

This doesn't directly answer your question (but that's ok, the people above already have), but...
This is considered very bad design. What happens in the future when your foo can be associated with 3 bars? Or more? (Don't say it will never happen. I've lost count of the number of "that'll never happen" things I've implemented over the years).
The generally correct way to do this is to do a one-to-many relationship (either with each bar pointing back to a foo, or an intermediate foo-to-bar table, see many-to-many relationships). Now you correctly format output on the front end, and just fetch a list of bars per foo to pass up to it (easy to do in SQL). Reports are a special case, but it's still relatively easily accomplished with pivoting or CrossTab queries.

SELECT
foo.bar1_id, bar1.bar_desc AS bar1_desc,
foo.bar2_id, bar2.bar_desc AS bar2_desc,
foo.other_val
FROM
foo
INNER JOIN bar bar1 ON bar1.id = foo.bar1_id
INNER JOIN bar bar2 ON bar2.id = foo.bar2_id
This assumes you'll always have both a bar1_id and a bar2_id in foo. If these can be null then change INNER JOIN to LEFT OUTER JOIN.

select f.bar1, b1.desc, f.bar2, b2.desc, f.value
from foo as f, bar as b1, bar as b2
where f.bar1 = b1.id
and f.bar2 = b2.id

I would try with information_schema.colums table.
SELECT concat(table_name,'_',column_name)
FROM information_schema.columns WHERE table_name = bar1 OR table_name = bar2
into new_table
Then you can populate it.
with foo as select * from bar1
or select into

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Find and group duplicates - sql

Related

Aggregating or Bundle a Many to Many Relationship in SQL Developer

pass results between queries and display joint results ( google bigquery )

SQL Two SELECT vs. JOIN best performance?

sybase - values from one table that aren't on another, on opposite ends of a 3-table join

SQL: Looking up the same field in one table for multiple values in another table?

Categories

Resources