SQL comparing two sets of complex data - sql

Sorry, probably the title is not the best one but I hope you will understand what problem I have.
I need to compare and analyse two sets of data and I'm using MS-Access for that. My data is organized in two tables. Following is not the real data I'm working with but will serve ok as example:
TABLE 1
ID Name
1 Zoie
2 Rohan
2 Simon
3 Jerome
4 Jakob
4 Mathew
4 Cora
6 Keely
7 Aiyana
7 Jake
8 Reid
9 Emerson
TABLE 2
ID Name
1 Michael
2 Rohan
2 Simon
3 Jill
4 Jakob
4 Cora
5 Charlie
7 John
8 Reid
9 Yadiel
9 Emerson
9 Paris
So, I need to select only those IDs which fully corresponds (all names under specific IDs are the same) in both tables and those are: 2 and 8
I would also like to have separate select statement which will result with IDs 2 and 8 but also IDs with names from table 1 which also appears in table 2 (all from table 1 plus possible some extra in table 2 under the same ID). So that would be: 2, 8, 9
I would also like to have separate select statement which will result with IDs 2 and 8 but also IDs with names from table 2 which also appears in table 1 (all from table 2 plus possible some extra in table 1 under the same ID). So that would be: 2, 4, 8
I would also like to have separate select statement which would be a combination of last two.
So result would be: 2, 4, 8, 9
I would appreciate any suggestions.
Thanks in advance!
Best regards,
Mario

Q#1:
select id
from table1
group by id
having count(*) =
(
select count(*)
from table2
group by table2.id
having table2.id = table1.id
)
and count(*) =
(
select count(*)
from table1 table1_1
inner join table2 on table1_1.id = table2.id and table1_1.name = table2.name
group by table1_1.id
having table1_1.id = table1.id
)
Explanation of this query:
It is grouping table1 by ID
For each group (for each ID), it is counting the number of rows in table1 that have this ID.
For each group, it is counting the number of rows in table2 that have this ID.
For each group, it is counting the number of rows where the name appears in both tables for this ID (it does that by inner joining table1 and table2 on the ID and Name which means only rows where both ID and Name match in both tables will be counted, for each ID).
It then returns IDs (from table1) where each of the above counts are equal. This is what results in returning IDs where all names are in both tables (no more, no less).
Q#2 - In this case you don't care that table2 has the same number of names per ID. So remove the first sub-query (that counts matching rows in table2).
select id
from table1
group by id
having count(*) =
(
select count(*)
from table1 table1_1
inner join table2 on table1_1.id = table2.id and table1_1.name = table2.name
group by table1_1.id
having table1_1.id = table1.id
)
Although the above is easy enough to understand following the same logic as Q#1, it is probably more efficient to do the following, and more straightforward. It only matters if you find it running too slow for your data (which is subjective and context dependent).
select table1.id
from table1
left join table2 on table1.id = table2.id and table1.name = table2.name
group by table1.id
having count(table1.id) = count(table2.id)
Here, the two tables are LEFT (outer) joined which means all records from table1 are gathered and records in table2 that match by ID and Name are also included alongside. Then, we group them by ID and we compare the count of each group in table1 with those that had matching names in table2.
Q#3 - This case is the same as Q#2 except table1 and table2 are swapped.
Q#4 - In this case you only care about IDs that have at least one name that appears in both tables. So join the tables and return the distinct IDs:
select distinct id
from table1
inner join table2 on table1.id = table2.id and table1.name = table2.name
Here is a SQLFiddle to play with containing the four queries: http://www.sqlfiddle.com/#!18/3fc71/22

Related

SQL Subquery Join and Sum

I have Table 1 & 2 with common Column name ID in both.
Table 1 has duplicate entries of rows which I was able to trim using:
SELECT DISTINCT
Table 2 has duplicate numeric entries(dollarspent) for ID's which I needed and was able to sum up:
Table 1 Table 2
------------ ------------------
ID spec ID Dol1 Dol2
54 A 54 1 0
54 A 54 2 1
55 B 55 0 2
56 C 55 3 0
-I need to join these two queries into one so I get a resultant JOIN of Table 1 & Table 2 ON column ID, (a) without duplicates in Table 1 & (b) Summed $ values from Table 2
For eg:
NewTable
----------------------------------------
ID Spec Dol1 Dol2
54 A 3 1
55 B 3 2
Notes : No. of rows in Table 1 and 2 are not the same.
Thanks
Use a derived table to get the distinct values from table1 and simply join to table 2 and use aggregation.
The issue you have is you have a M:M relationship between table1 and table2. You need it to be a 1:M for the summations to be accurate. Thus we derive t1 from table1 by using a select distinct to give us the unique records in the 1:M relationship (assuming specs are same for each ID)
SELECT T1.ID, T1.Spec, Sum(T2.Dol1) as Dol1, sum(T2.Dol2) as Dol2
FROM (SELECT distinct ID, spec
FROM table1) T1
INNER JOIN table2 T2
on t2.ID = T1.ID
GROUP BY T1.ID, T1.Spec
This does assume you only want records that exist in both. Otherwise we may need to use an (LEFT, RIGHT, or FULL) outer join; depending on desired results.
I can't really see your data, but you might want to try:
SELECT DISTINCT ID
FROM TblOne
UNION ALL
SELECT DISTINCT ID, SUM(Dol)
FROM TblTwo
GROUP BY ID
Pre-aggregate table 2 and then join:
select t1.id, t1.spec, t2.dol1, t2.dol2
from (select t2.id, sum(dol1) as dol1, sum(dol2) as dol2
from table2 t2
group by t2.id
) t2 join
(select distinct t1.id, t1.spec
from table1 t1
) t1
on t1.id = t2.id;
For your data examples, you don't need to pre-aggregate table 2. This gives the correct sums -- albeit in multiple rows -- if table1 has multiple specs for a given id.

Remove Duplicate Data Based on Three Fields and Two tables

I have two tables that have more than three fields each. There is a group of records that are on both files, the below is a mock example:
Table 1:
ID Name Town State
1 Dave Chicago IL
2 Mark Tea MD
Table 2:
ID Name State Job Married
1 Dave IL Manager Yes
2 Mark MD Driver No
For my purpose duplicates exist if ID, Name, and State are the same. So the above data are duplicates. How do I delete them from one table (I have over 900 duplicates so deleting one by one is not possible)?
delete table1
where ID in(select distinct ID from table1 where ID in (Select ID from table2))
i dont understand which table has duplicates, if you want to delete duplicate data from one table1 then you can use this query
This query will produce a de-duplicated result set:
SELECT Table1.ID,
Table1.NAME,
Table1.Town,
Table1.STATE,
NULL AS Job,
NULL AS Married
FROM Table1
WHERE Table1.ID NOT IN (
SELECT Table1.ID
FROM Table1
INNER JOIN Table2 ON (Table2.STATE = Table1.STATE)
AND (Table2.NAME = Table1.NAME)
AND (Table1.ID = Table2.ID)
)
UNION
SELECT Table2.ID,
Table2.NAME,
NULL AS Town,
Table2.STATE,
Table2.Job,
Table2.Married
FROM Table2
This is the most straightforward way to do it, assuming that you want to delete from Table1. I'm a little rusty with Access SQL syntax, but I believe this works:
DELETE FROM [Table1]
WHERE EXISTS (
SELECT 1
FROM [Table2]
WHERE [Table2].[ID] = [Table1].[ID]
AND [Table2].[Name] = [Table1].[Name]
AND [Table2].[State] = [Table1].[State]
)

DISTINCT based on ID between Two table and SUM of the other exiting column in SQL

SELECT
SUM(amount), (DISTINCT table1.id)
FROM
table1
INNER JOIN
table2 ON table1.id = table2.id IS NULL;
Table1 id amount Table2 id product
1 40 5 10
2 364.25 2 20
3 704.5 8 30
4 404.5 3 40
5 580.5 2 20
The id is not unique or primary ------------------first i need to ignore all double entry ID from table2 then match id from table2 to table1 after that those ids will be match i need total of amount figure amount is a column name i will not calculate single the data will be more than 20000. please help me if you can
First compare match table2 id with table1 id if found any id match then those id amount need to be SUM i mean total. here match id is 2 and 3 according to table2 and then we will add this 2 and 3 id amount so result will be 364.25+704.5=1068.75 i am looking the result how can i do it using mysql.
I am trying to DISTINCT based on ID between two tables and SUM of the other existing column we have in table1. Can somebody help me how to do it?
Based on your latest comment try this
SELECT
SUM(amount)
FROM
table1
INNER JOIN
(Select Distinct ID From table2) T2 ON table1.id = T2.id
Lets leave a matter of performance beyond the topic:)
So according to this:
First compare match table2 id with table1 id if found any id match
then those id amount need to be SUM i mean total
it is possible to use correlated subquery:
select sum(t1.amount)
from table1 t1
where exists (select 1 from table2 t2 where t2.id = t1.id)
From what I understood, you need common IDs from both table and corresponding sum of amount.
SELECT
SUM(amount)
FROM
table1
WHERE
id IN (SELECT id
FROM table2);

Join table on Count

I have two tables in Access, one containing IDs (not unique) and some Name and one containing IDs (not unique) and Location. I would like to return a third table that contains only the IDs of the elements that appear more than 1 time in either Names or Location.
Table 1
ID Name
1 Max
1 Bob
2 Jack
Table 2
ID Location
1 A
2 B
Basically in this setup it should return only ID 1 because 1 appears twice in Table 1 :
ID
1
I have tried to do a JOIN on the tables and then apply a COUNT but nothing came out.
Thanks in advance!
Here is one method that I think will work in MS Access:
(select id
from table1
group by id
having count(*) > 1
) union -- note: NOT union all
(select id
from table2
group by id
having count(*) > 1
);
MS Access does not allow union/union all in the from clause. Nor does it support full outer join. Note that the union will remove duplicates.
Simple Group By and Having clause should help you
select ID
From Table1
Group by ID
having count(1)>1
union
select ID
From Table2
Group by ID
having count(1)>1
Based on your description, you do not need to join tables to find duplicate records, if your table is what you gave above, simply use:
With A
as
(
select ID,count(*) as Times From table group by ID
)
select * From A where A.Times>1
Not sure I understand what query you already tried, but this should work:
select table1.ID
from table1 inner join table2 on table1.id = table2.id
group by table1.ID
having count(*) > 1
Or if you have ID's in one table but not the other
select table1.ID
from table1 full outer join table2 on table1.id = table2.id
group by table1.ID
having count(*) > 1

Query three non-bisecting sets of data

I am retrieving three different sets of data (or what should be "unique" rows). In total, I expect 3 different unique sets of rows because I have to complete different operations on each set of data. I am, however, retrieving more rows than there are in total in the table, meaning that I must be retrieving duplicate rows somewhere. Here is an example of my three sets of queries:
SELECT DISTINCT t1.*
FROM table1 t1
INNER JOIN table2 t2
ON t2.ID = t1.ID
AND t2.NAME = t1.NAME
AND t2.ADDRESS <> t1.ADDRESS
SELECT DISTINCT t1.*
FROM table1 t1
INNER JOIN table2 t2
ON t2.ID = t1.ID
AND t2.NAME <> t1.NAME
AND t2.ADDRESS <> t1.ADDRESS
SELECT DISTINCT t1.*
FROM table1 t1
INNER JOIN table2 t2
ON t2.ID <> t1.ID
AND t2.NAME = t1.NAME
AND t2.ADDRESS <> t1.ADDRESS
As you can see, I am selecting (in order of queries)
Set of data where the id AND name match
Set of data where the id matches but the name does NOT
Set of data where the id does not match but name DOES
I am retrieving MORE rows than exist in T1 when adding up the number of results returned from all three queries which I don't think is logically possible, plus this means I must be duplicating rows (if it is logically possible) somewhere which prevents me from executing different commands against each set (since a row would have another command executed on it).
Can someone find where I'm going wrong here?
Consider if Name is not unique. If you have the following data:
Table 1 Table 2
ID Name Address ID Name Address
0 Jim Smith 1111 A St 0 Jim Smith 2222 A St
1 Jim Smith 2222 B St 1 Jim Smith 3333 C St
Then Query 1 gives you:
0 Jim Smith 1111 A St
1 Jim Smith 2222 B St
Because rows 1 & 2 in Table 1 match rows 1 & 2, respectively in Table 2.
Query 2 gives you nothing.
Query 3 gives you
0 Jim Smith 1111 A St
1 Jim Smith 2222 B St
Because row 1 in Table 1 matches row 2 in Table 2 and row 2 in Table 1 matches row 1 in Table 2. Thus you get 4 rows out of Table 1 when there are only 2 rows in it.
Are you sure that NAME and ID are unique in both tables?
If not, you could have a situation, for example, where table 1 has this:
NAME: Fred
ID: 1
and table2 has this:
NAME: Fred
ID: 1
NAME: Fred
ID: 2
In this case, the record in table1 will be returned by two of your queries: ID and NAME both match, and NAME matches but ID doesn't.
You might be able to narrow down the problem by intersecting each combination of two queries to find out what the duplicates are, e.g.:
SELECT DISTINCT t1.*
FROM table1 t1
INNER JOIN table2 t2
ON t2.ID = t1.ID
AND t2.NAME = t1.NAME
AND t2.ADDRESS <> t1.ADDRESS
INTERSECT
SELECT DISTINCT t1.*
FROM table1 t1
INNER JOIN table2 t2
ON t2.ID = t1.ID
AND t2.NAME <> t1.NAME
AND t2.ADDRESS <> t1.ADDRESS
Assuming that T2.ID has a unique constraint, it is still quite logically possible for this scenario to occur.If for every record in T1, there are two corresponding records in T2:
Same name, same id, different address
Same name, different id, different address
Then the same record for T1 can come up in the first and third query for example.
It is also possible to simultaneously get the same row in the second and third query.
If T2.ID is not guaranteed to be unique, then you could get the same row from T1 in all three queries.
I think the last query could be the one fetching extra set of rows.
i.e. It is relying on Name matching in both tables (and not on ID)
To find the offending data (and help find your logic hole) I would recommend:
(caution pseudo-code)
Limit the results to just SELECT id FROM ....
UNION the result sets
COUNT(id)
GROUP BY id
HAVING count(id) > 1
This will show the records that match more than one sub-query.