Identify duplicates based on multiple columns - sql

I want to identify duplicates in a db based on multiple columns from various tables. In the example below, 1&5 and 2&4 are duplicates - as all four columns have same values. How do I identify such records using a sql? I have used group by having count>1 when I had to identify duplicates based on a single column, but I am unsure how to identify them based on multiple columns. However, I see that when I do group by having count>1 based on all 4 columns, #3 and 6 are showing up, they are technically not duplicates per my requirement.
T1
ID | Col1 | Col2
---| --- | ---
1 | A | US
2 | B | FR
3 | C | AU
4 | B | FR
5 | A | US
6 | D | UK
T2
ID | Col1
---| ---
1 | Apple
1 | Kiwi
2 | Pear
3 | Banana
3 | Banana
4 | Pear
5 | Apple
T3
ID | Col1
---| ---
1 | Spinach
1 | Beets
2 | Celery
3 | Radish
4 | Celery
5 | Spinach
6 | Celery
6 | Celery
My expected result would be:
1 A US Apple Spinach
5 A US Apple Spinach
2 B FR Pear Celery
4 B FR Pear Celery

For your sample data, you can achieve this using inner join-ing all three tables and using just group by tA.Col1 having count(tA.Col1)>1 in where clause sub-query as below to obtain your desired result.
SELECT t1.ID,
t1.Col1,
t1.Col2,
t2.Col1,
t3.Col1
FROM table1 t1
JOIN table2 t2 ON t1.ID = t2.ID
JOIN table3 t3 ON t1.ID = t3.ID
WHERE t1.Col1 IN
( SELECT tA.Col1
FROM table1 tA
GROUP BY tA.Col1
HAVING count(tA.Col1)>1)
ORDER BY t1.ID;
Result
ID Col1 Col2 Col1 Col1
-----------------------------------
1 A US Apple Spinach
2 B FR Pear Celery
4 B FR Pear Celery
5 A US Apple Spinach
You can check the demo here
Hope this will help.

The problem is your result set needs to include the ID column which is unique. So a straightforward GROUP BY ... HAVING won't cut it. This would work.
with cte as
( select t1.id
, t1.col1 as t1_col1
, t1.col2 as t1_col2
, t2.col1 as t2_col1
, t3.col1 as t3_col1
from t1
join t2 on t1.id = t2.id
join t3 on t1.id = t3.id
)
select cte.*
from cte
where (t1_col1, t1_col2, t2_col1, t3_col1) in
( select t1_col1, t1_col2, t2_col1, t3_col1
from cte
group by t1_col1, t1_col2, t2_col1, t3_col1 having count(*) > 1)
/
The use of the sub-query factoring syntax is optional, but I find it useful to signal that the subquery is used more than one in the query.
"I have encountered another scenario in the data, some of the IDs have same values in T2 and T3 and they are showing up as dups."
The duplicated IDs in the child tables cause Cartesian products in the joined subquery, which causes false positives in the main result set. Ideally you should be able to handle this by introducing additional filters on those tables to remove the unwanted rows. However, if the data quality is so poor that there are no valid rules you will have to fall back on distinct:
with cte as (
select t1.id
, t1.col1 as t1_col1
, t1.col2 as t1_col2
, t2.col1 as t2_col1
, t3.col1 as t3_col1
from t1
join ( select distinct id, col1 from t2) t2 on t1.id = t2.id
join ( select distinct id, col1 from t3) t3 on t1.id = t3.id
) ...

You can add all columns in the group by clause for which you want to find the duplicate and then write the count condition in having claus
select t1.id,t1.col1,t2.col2,t2.col3,t3.col4 from t1 join t2 on t1.id=t2.id join t3 on t3.id=t1.id where (t1.col1,t2.col2,t2.col3,t3.col4) in (
select t1.col1,t2.col2,t2.col3,t3.col4
from t1 join t2 on t1.id=t2.id join t3 on t3.id=t1.id
group by t1.col1,t2.col2,t2.col3,t3.col4
having count(*) >1 )

Related

How to group total amount spend based on both ID and name

i have a table where
patientId | Units | Amount | PatientName
1234 | 1 | 20 |lisa
1111 | 5 | 10 |john
1234 | 10 | 200 |lisa
345 | 2 | 30 | xyz
i want to get ID in one column, then patient name then total amount spent by him on different items,
please note i have got patient name in the column above by doing a join on 2 tables using ID as the key
i am doing this to get this table
select t1.*,t2.name from table1 as t1 inner join table2 as t2
on t1.id = t2.id
then for adding i am trying to use the group by clause but that gives an error
please note i cannot use temp table in this, only need to do this using subquery, how to do it?
Are you looking for group by?
select t1.patientid, t2.patientname, sum(t1.amount)
from table1 t1 join
table2 t2
on t1.id = t2.id
group by t1.patientid, t2.patientname;
select t1.*,
t2.name
from table1 t1
inner join table2 t2
on t1.id = t2.id
group by t1.id, t2.name
What are table1 and table2 like? What's the error message?

Microsoft SQL Server Conditional Joining based on 2 columns

I am looking to join 3 tables, all with the same data except one column is a different name (different date for each of the the 3 tables). The three tables look like the following. The goal is if a condition exists in table 1 AND/OR table 2 determine if a condition does or does not exist in table 3 for each individual id/condition. I'm currently left joining table 2 to table 1 but I'm aware that is not accounting for if a condition in table 2 exists that is not in table it is not being accounted for, anyways, any help would into this would be useful.
Table 1
id place Condition_2018
123 ABC flu
456 ABC heart attack
Table 2
id place Condition_2019
123 ABC flu
789 def copd
Table 3
id place Condition_2020
456 ABC heart attack
789 def copd
123 ABC flu
OUTPUT:
Table 2
id place Condition_2018 Condition_2019 Condition_2020
123 ABC flu flu flu
456 ABC heart attack null heart attack
789 def NULL copd copd
Thank you!
How about this (SQL Server syntax)...
SELECT
x.id
, x.place
, x.Condition_2018
, x.Condition_2019
, t3.Condition_2020
FROM (
SELECT
COALESCE(t1.id, t2.id) AS id
, COALESCE(t1.place, t2.place) AS place
, t1.Condition_2018
, t2.Condition_2019
FROM Table1 AS t1
FULL OUTER JOIN Table2 AS t2 ON t1.id = t2.id AND t1.place = t2.place
) AS x LEFT JOIN Table3 AS t3 ON x.id = t3.id AND x.place = t3.place
If your database supports full join, you can just do:
select
id,
place,
t1.condition_2018,
t2.condition_2019,
t3.condition_2020
from table1 t1
full join table2 t2 using(id, place)
full join table3 t3 using(id, place)
Otherwise, it is a bit more complicated: union all and aggregation is one method:
select
id,
place,
max(condition_2018) condition_2018,
max(condition_2019) condition_2019,
max(condition_2020) condition_2020
from (
select id, place, condition_2018, null condition_2019, null condition 2020 from table1
union all
select id, place, null, condition_2019, null from table2
select id, place, null, null, condition_2020 from table3
) t
group by id, place
You seem to want everything in Table3 and matches in the other two tables. That is just left joins:
select t3.id, t3.place,
t1.condition_2018, t2.condition_2019,
t3.condition_2020
from table3 t3 left join
table2 t2
on t3.id = t2.id and t3.place = t2.place left join
table1 t1
on t3.id = t1.id and t3.place = t1.place;
You need a full outer join of table1 and table2 and a left join to table3:
select
coalesce(t1.id, t2.id) id,
coalesce(t1.place, t2.place) place,
t1.Condition_2018,
t2.Condition_2019,
t3.Condition_2020
from table1 t1 full outer join table2 t2
on t2.id = t1.id
left join table3 t3
on t3.id = coalesce(t1.id, t2.id)
See the demo.
Results:
> id | place | Condition_2018 | Condition_2019 | Condition_2020
> --: | :---- | :------------- | :------------- | :-------------
> 123 | ABC | flu | flu | flu
> 456 | ABC | heart attack | null | heart attack
> 789 | def | null | copd | copd

Joining two sql tables with a one to many relationship, but want the max of the second table

I am trying to join two tables one is a unique feature the seconds is readings taken on several dates that relate to the unique features. I want all of the records in the first table plus the most recent reading. I was able to get the results I was looking for before adding the shape field. By using the code
SELECT
Table1.Name, Table1.ID, Table1.Shape,
Max(Table2.DATE) as Date
FROM
Table1
LEFT OUTER JOIN
Table2 ON Table1.ID = table2.ID
GROUP BY
Table1.Name, Table1.ID, Table1.Shape
The shape field is a geometry type and I get the error
'The type "Geometry" is not comparable. It can not be use in the Group By Clause'
So I need to go about it a different way, but not sure how.
Below is a sample of the two tables and the desired results.
Table1
Name| ID |Shape
AA1 | 1 | X
BA2 | 2 | Y
CA1 | 3 | Z
CA2 | 4 | Q
Table2
ID | Date
1 | 5/27/2013
1 | 6/27/2014
2 | 5/27/2013
2 | 6/27/2014
3 | 5/27/2013
3 | 6/27/2014
My Desired Result is
Name| ID |Shape |Date
AA1 | 1 | X | 6/27/2014
BA2 | 2 | Y | 6/27/2014
CA1 | 3 | Z | 6/27/2014
CA2 | 4 | Q | Null
You can do the aggregation on Table2 in a CTE, finding the MAX(DATE) for each ID, and then join that result to Table1:
WITH AggregatedTable2(ID, MaxDate) AS
(
SELECT
ID, MAX(DATE)
FROM
Table2
GROUP BY
ID
)
SELECT
t1.ID, t1.Name, t1.Shape, t2.MaxDate
FROM
Table1 t1
LEFT JOIN
AggregatedTable2 t2 ON t1.ID = t2.ID
Try casting geometry as a varchar.
Select Table1.Name, Table1.ID, cast(Table1.Shape as varchar(1)) AS Shape, Max(Table2.DATE) as Date
FROM Table1 LEFT OUTER JOIN
Table2 ON Table1.ID = table2.ID
Group By Table1.Name, Table1.ID, cast(Table1.Shape as varchar(1))
Try this:
SELECT t1.Name
, t1.ID
, t1.Shape
, MAX(t2.Date) As Date
FROM Table1 AS t1
LEFT JOIN Table2 AS t2
ON t2.ID = t1.ID
GROUP
BY t1.Name
, t1.ID
, t1.Shape

PostgreSQL LEFT OUTER JOIN query syntax

Lets say I have a table1:
id name
-------------
1 "one"
2 "two"
3 "three"
And a table2 with a foreign key to the first:
id tbl1_fk option value
-------------------------------
1 1 1 1
2 2 1 1
3 1 2 1
4 3 2 1
Now I want to have as a query result:
table1.id | table1.name | option | value
-------------------------------------
1 "one" 1 1
2 "two" 1 1
3 "three"
1 "one" 2 1
2 "two"
3 "three" 2 1
How do I achieve that?
I already tried:
SELECT
table1.id,
table1.name,
table2.option,
table2.value
FROM table1 AS table1
LEFT outer JOIN table2 AS table2 ON table1.id = table2.tbl1fk
but the result seems to omit the null vales:
1 "one" 1 1
2 "two" 1 1
1 "one" 2 1
3 "three" 2 1
SOLVED: thanks to Mahmoud Gamal: (plus the GROUP BY)
Solved with this query
SELECT
t1.id,
t1.name,
t2.option,
t2.value
FROM
(
SELECT t1.id, t1.name, t2.option
FROM table1 AS t1
CROSS JOIN table2 AS t2
) AS t1
LEFT JOIN table2 AS t2 ON t1.id = t2.tbl1fk
AND t1.option = t2.option
group by t1.id, t1.name, t2.option, t2.value
ORDER BY t1.id, t1.name
You have to use CROSS JOIN to get every possible combination of name from the first table with the option from the second table. Then LEFT JOIN these combination with the second table. Something like:
SELECT
t1.id,
t1.name,
t2.option,
t2.value
FROM
(
SELECT t1.id, t1.name, t2.option
FROM table1 AS t1
CROSS JOIN table2 AS t2
) AS t1
LEFT JOIN table2 AS t2 ON t1.id = t2.tbl1_fk
AND t1.option = t2.option
SQL Fiddle Demo
Simple version: option = group
It's not specified in the Q, but it seems like option is supposed to define a group somehow. In this case, the query can simply be:
SELECT t1.id, t1.name, t2.option, t2.value
FROM (SELECT generate_series(1, max(option)) AS option FROM table2) o
CROSS JOIN table1 t1
LEFT JOIN table2 t2 ON t2.option = o.option AND t2.tbl1_fk = t1.id
ORDER BY o.option, t1.id;
Or, if options are not numbered in sequence, starting with 1:
...
FROM (SELECT DISTINCT option FROM table2) o
...
Returns:
id | name | option | value
----+-------+--------+-------
1 | one | 1 | 1
2 | two | 1 | 1
3 | three | |
1 | one | 2 | 1
2 | two | |
3 | three | 2 | 1
Faster and cleaner, avoiding the big CROSS JOIN and the big GROUP BY.
You get distinct rows with a group number (grp) per set.
Requires Postgres 8.4+.
More complex: group indicated by sequence of rows
WITH t2 AS (
SELECT *, count(step OR NULL) OVER (ORDER BY id) AS grp
FROM (
SELECT *, lag(tbl1_fk, 1, 2147483647) OVER (ORDER BY id) >= tbl1_fk AS step
FROM table2
) x
)
SELECT g.grp, t1.id, t1.name, t2.option, t2.value
FROM (SELECT generate_series(1, max(grp)) AS grp FROM t2) g
CROSS JOIN table1 t1
LEFT JOIN t2 ON t2.grp = g.grp AND t2.tbl1_fk = t1.id
ORDER BY g.grp, t1.id;
Result:
grp | id | name | option | value
-----+----+-------+--------+-------
1 | 1 | one | 1 | 1
1 | 2 | two | 1 | 1
1 | 3 | three | |
2 | 1 | one | 2 | 1
2 | 2 | two | |
2 | 3 | three | 2 | 1
-> SQLfiddle for both.
How?
Explaining the complex version ...
Every set is started with a tbl1_fk <= the last one. I check for this with the window function lag(). To cover the corner case of the first row (no preceding row) I provide the biggest possible integer 2147483647 the default for lag().
With count() as aggregate window function I add the running count to each row, effectively forming the group number grp.
I could get a single instance for every group with:
(SELECT DISTINCT grp FROM t2) g
But it's faster to just get the maximum and employ the nifty generate_series() for the reduced CROSS JOIN.
This CROSS JOIN produces exactly the rows we need without any surplus. Avoids the need for a later GROUP BY.
LEFT JOIN t2 to that, using grp in addition to tbl1_fk to make it distinct.
Sort any way you like - which is possible now with a group number.
try this
SELECT
table1.id, table1.name, table2.option, table2.value FROM table1 AS table11
JOIN table2 AS table2 ON table1.id = table2.tbl1_fk
This is enough:
select * from table1 left join table2 on table1.id=table2.tbl1_fk ;

Is there a way to have one to many + one to many query in one result in group by?

I know the title is confusing. I have an example below. I know how to do it with inline SELECTs, try to avoid that:
T1
id | title
1 a
T2
t1_id | title
1 a1
1 a2
1 a3
T3
t1_id | amount
1 10
The result set should be: t1.id, group_concat(t2.title) , sum(t3.amount)
1 | a1,a2,a3 | 10
How about this?
select t2.t1_id, group_concat(t2.title), t3.amount
from t2
join t3 on t2.t1_id = t3.t1_id
group by t2.t1_id, t3.amount