SQL find identical group - sql

Given a table like:
id key val
---- ---- -----
bob hair red
bob eyes green
And another table like:
id key val
---- ---- -----
fred hair red
fred eyes green
fred shoe 42
joe hair red
joe eyes green
greg eyes blue
greg hair brown
I'd like to find people in table b who match people in table a exactly, in this case Bob and Joe. Fred doesn't count because he also has a shoe size. This is in Sybase so there's no full outer join. I've come up with a select of a select with a union that returns people who definitely aren't the same, but I'm not sure how to efficiently select people who are.
Alternatively, if it's simpler, how can I check which groups in a occur in b more than once?

Try this
select a.id,b.id
from a
join b on a.[key] = b.[key] and a.val = b.val -- match all rows
join (select id,count(*) total from a group by id) a2 on a.id = a2.id -- get the total keys for table a per id
join (select id,count(*) total from b group by id) b2 on b.id = b2.id -- get the total keys for table b per id
group by a.id,b.id,a2.total,b2.total
having count(*) = a2.total AND count(*) = b2.total -- the matching row's total should be equal with each tables keys per id
After #t-clausen.dk comments I made a revision of the original sql code.
In this case i count each distinct pair/value that matches on both tables, with each tables distinct pair/value.
select td.aid,td.bid
from (
select a.id as aid,b.id as bid, count(distinct a.[key]+' '+a.val) total
from a
join b on a.[kry] = b.[key] and a.val = b.val
group by a.id,b.id
) td -- match all distinct attribute rows
join (select id,count(distinct [key]+' '+val) total from a group by id) a2 on td.aid = a2.id -- get the total distinct keys for table a per id
join (select id,count(distinct [key]+' '+val) total from b group by id) b2 on td.bid = b2.id -- get the total keys for table b per id
where td.total = a2.total AND td.total = b2.total -- the matching distinct attribute total should be equal with each tables distinct key-val pair
Tested on
Table a
bob hair red
bob eyes green
nick hair red
nick eyes green
nick shoe 45
Table b
fred hair red
fred eyes green
joe hair red
joe eyes green
fred shoe 42

You can emulate a full outer join by grabbing all ids in a subquery, and then left joining them in two directions:
select ids.id
from (
select distinct id
from #a
union
select id
from #b
) as ids
left join
#a a1
on a1.id = ids.id
left join
#b b1
on a1.id = b1.id
and a1.[key] = b1.[key]
and a1.val = b1.val
left join
#b b2
on b2.id = ids.id
left join
#a a2
on b2.id = a2.id
and b2.[key] = a2.[key]
and b2.val = a2.val
group by
ids.id
having sum(case when b1.id is null or a2.id is null then 1 else 0 end) = 0
Example at SE DATA.

This syntax will find the exact matches on different names in #t1 and #t2. I appologize because is written in MSSQL. I hope it can be converted to Sybase. After playing with it all day I want to share this beauty. I know these long scripts are not popular pointwise. I hope someone will appriciate it anyway.
This select make an exact match on #t2 within #t1.
I have populated the tables in this link https://data.stackexchange.com/stackoverflow/q/108035/
DECLARE #t1 TABLE(id varchar(10), [key] varchar(10), val varchar(10))
DECLARE #t2 TABLE(id varchar(10), [key] varchar(10), val varchar(10))
;WITH t1 AS (
SELECT t1.id, t1.[key], t1.val, count(*) count1, sum(count(*)) OVER(PARTITION BY t1.id) sum1 FROM #t1 t1
GROUP BY t1.id, t1.[key], t1.val
), t2 as (
SELECT t2.id, t2.[key], t2.val, count(*) count1, sum(count(*)) OVER(PARTITION BY t2.id) sum1 FROM #t2 t2
GROUP BY t2.id, t2.[key], t2.val
), t3 AS (
SELECT t1.*, sum(t1.count1) OVER(PARTITION BY t1.id) sum2
FROM t1
JOIN t2 on t1.val = t2.val AND t1.[key]=t2.[key]
AND t1.count1 = t2.count1 AND t1.sum1 = t2.sum1
)
SELECT t3.id, t3.[key], t3.val FROM t3
JOIN #t2 t ON t3.[key] = t.[key] AND t3.val = t.val
WHERE t3.sum2 = t3.sum1
Don't try the script, it doesn't contain data, use the link where the tables are populated.

Related

How do I update a table that references duplicate records?

I have two SQL tables. One gets a reference value from another table which stores a list of Modules and their ID. But these descriptions are not unique. I am trying to remove the duplicates of Table A but I'm not sure how to update Table B to only reference the single values.
Example:
Table A: Table B:
-------------------------------- ------------------------------------
ID Description RefID ID Name
-------------------------------- ------------------------------------
1 Test 1 2 1 QuickReports
-------------------------------- ------------------------------------
2 Test 2 1 2 QuickReports
-------------------------------- ------------------------------------
I want the results to be the following:
Table A: Table B:
-------------------------------- ------------------------------------
ID Description RefID ID Name
-------------------------------- ------------------------------------
1 Test 1 1 1 QuickReports
-------------------------------- ------------------------------------
2 Test 2 1
--------------------------------
I managed to delete duplicates from table B using the below code but I haven't been able to update the records in Table A. Each table have over 500 records each.
WITH cte AS(
SELECT
Name,
ROW_NUMBER() OVER (
PARTITION BY
Name
ORDER BY
Name
)row_num
FROM ReportmodulesTest
)
DELETE FROM cte
WHERE row_num > 1;
You would need to update table A first, before deleting from table B.
You tagged your question MySQL but that database would not support the delete statement that you are showing. I suspect that you are running SQL Server, so here is how to do it in that database:
update a
set refid = b.minid
from tablea
inner join (select name, id, min(id) over(partition by name) minid from tableb) b
on b.id = a.id and b.minid <> a.id
In MySQL, you would phrase the same query as:
update tablea a
from tablea
inner join (select name, id, min(id) over(partition by name) minid from tableb) b on b.id = a.id
set a.refid = b.minid
where b.minid <> a.id
You can update the first table using :
update a join
(select b.*,
min(id) over (partition by name) as min_id
from b
) b
on a.refid = b.id
set a.refid = b.min_id
where a.refid <> b.min_id;
Then, you can delete rows in the second table with a similar logic :
delete b
from b join
(select b.*,
min(id) over (partition by name) as min_id
from b
) bb
on bb.id = b.id
where b.id <> bb.min_id;
I found a solution that has made this process easier. I first use Row_Number to find duplicates in Table A and SELECT INTO a temporary table.
SELECT
a.Id
, a.Name
, ROW_NUMBER() OVER(PARTITION BY Name ORDER BY Id DESC) RN
INTO
#TestTable
FROM
TableA a WITH(NOLOCK)
I then JOIN Table A and Table B to see where the ID's match and identify which ID I need to keep and which ID's I need to delete:
SELECT
b.Id
, b.Name
, b.RefId
, ToKeep.Id KeepId
, ToDelete.Id DeleteId
FROM
#TestTable ToDelete
JOIN TableB b WITH(NOLOCK)
ON b.RefId = ToDelete.Id
JOIN #TestTable ToKeep
ON ToDelete.Name = ToKeep.Name
AND ToKeep.RN = 1
WHERE ToDelete.RN > 1
Then using a similar statement, I just update the records:
UPDATE b
SET
b.RefId = ToKeep.Id,
FROM #TestTable ToDelete
JOIN TableB b WITH(NOLOCK)
ON b.RefId = ToDelete.Id
JOIN #TestTable ToKeep
ON ToDelete.Name = ToKeep.Name
AND ToKeep.RN = 1
WHERE
ToDelete.RN > 1
Lastly, I can now delete the duplicate records:
DELETE a
FROM #TestTable b
INNER JOIN TableA a
ON b.Id = a.Id
WHERE
b.RN > 1
After this, you can use the same first SELECT statement to ensure that all duplicates are deleted. Just remove the SELECT INTO statement.
Thanks to an anonymous colleague of mine for this solution and hope this helps someone out there.

Using Min in Update statement to get oldest record

I want to update my tableA with tableB but get only those records from table B having the oldest entry
TableA:
name ID
nick 15
john 12
tableB:
ID sportsname createddate
12 tennis 15march2019
14 baseball 15march2019
15 basketball 16march2019
15 cricket 20march2020
15 football 17may2020
My query:
update a
set a.sportsname=b.sportsname
from tablea a join tableb b
on a.id=b.id where b.createdate=( select min(createdate) from tableb )
But this is not giving correct result
update a
set a.name=b.sportsname
from #T a join (select min(createddate) as min_createddate,ID,sportsname from #t2
group by ID,sportsname) b ON b.ID=a.ID
You can use a SUB QUERY to attain this.
I suspect that the problem with your query is that you are using the minimum create date over the entire tableb rather than per id. Although you could fix that using a correlated subquery, I would recommend apply:
update a
set a.sportsname = b.sportsname
from tablea a cross apply
(select top (1) b.*
from tableb b
where a.id = b.id
order by b.createdate asc
) b;
For performance, you want an index on tableb(id, createdate desc, sportname).
You can use FIRST_VALUE() window function:
UPDATE a
SET a.sportsname=b.sportsname
FROM TableA a INNER JOIN (
SELECT DISTINCT ID,
FIRST_VALUE(sportsname) OVER (PARTITION BY ID ORDER BY createddate) sportsname
FROM TableB
) b ON b.ID = a.ID
See the demo.

SQL Server - Setting multiple columns from another table

I have two tables.
Table 1
ID Code1 Code2 Code3
10 1.1 1.2 1.3
Table 2
Code Group Category
1.1 a cat1
1.2 b cat1
1.3 c cat2
1.4 d cat3
Now I need to get the outputs in two different forms from these two tables tables
Output 1
ID Group1 Group2 Group3
10 a b c
Output 2
ID cat1 cat2 cat3
10 1 1 0
Here the cat1, cat2, cat3 columns are Boolean in nature since the table 1 did not have any code corresponding to cat3 so the value for this is 0.
I was thinking of doing this with case statements but there are about 1000 codes mapped to about 50 categories. Is their a way to do this? I am struggling to come up with a query for this.
First off, I strongly suggest you look into an alternative. This will get messy very fast, as you're essentially treating rows as columns. It doesn't help much that Table1 is already denormalized - though if it really only has 3 columns, it's not that big of a deal to normalize it again.:
CREATE VIEW v_Table1 AS
SELECT Id, Code1 as Code FROM Table1
UNION SELECT Id, Code2 as Code FROM Table1
UNION SELECT Id, Code3 as Code FROM Table1
If we take you second query, it appears you want all possible combinations of ID and Category, and a boolean of whether that combination appears in Table2 (using Code to get back to ID in Table1).
Since there doesn't appear to be a canonical list of ID and Category, we'll generate it:
CREATE VIEW v_AllCategories AS
SELECT DISTINCT ID, Category FROM v_Table1 CROSS JOIN Table2
Getting the list of represented ID and Category is pretty straightforward:
CREATE VIEW v_ReportedCategories AS
SELECT DISTINCT ID, Category FROM Table2
JOIN v_Table1 ON Table2.Code = v_Table1.Code
Put those together, and we can then get the bool to tell us which exists:
CREATE VIEW v_CategoryReports AS
SELECT
T1.ID, T1.Category, CASE WHEN T2.ID IS NULL THEN 0 ELSE 1 END as Reported
FROM v_AllCategories as T1
LEFT OUTER JOIN v_ReportedCategories as T2 ON
T1.ID = T2.ID
AND T1.Category = T2.Category
That gets you your answer in a normalized form:
ID | Category | Reported
10 | cat1 | 1
10 | cat2 | 1
10 | cat3 | 0
From there, you'd need to do a PIVOT to get your Category values as columns:
SELECT
ID,
cat1,
cat2,
cat3
FROM v_CategoryReports
PIVOT (
MAX([Reported]) FOR Category IN ([cat1], [cat2], [cat3])
) p
Since you mentioned over 50 'Categories', I'll assume they're not really 'cat1' - 'cat50'. In which case, you'll need to code gen the pivot operation.
SqlFiddle with a self-contained example.
These answers assume that all 3 codes are available in table 2. If not, then you should use OUTER joins instead of INNER.
Output 1 can be achieved like this:
select t1.ID,
cd1.Group as Group1,
cd2.Group as Group2,
cd3.Group as Group3
from table1 t1
inner join table2 cd1
on t1.Code1 = cd1.Code
inner join table2 cd2
on t1.Code2 = cd2.Code
inner join table2 cd3
on t1.Code3 = cd3.Code
Output 2 is trickier. Since you want a column for every row in Table2, you could write SQL that writes SQL.
Basically start with this base statement:
select t1.ID,
//THE BELOW WILL BE GENERATED ONCE PER ROW
Case when cd1.Category = '' OR
cd2.Category = '' OR
cd3.Category = '' then convert(bit,1) else 0 end as '',
//END GENERATED CODE
from table1 t1
inner join table2 cd1
on t1.Code1 = cd1.Code
inner join table2 cd2
on t1.Code2 = cd2.Code
inner join table2 cd3
on t1.Code3 = cd3.Code
then you can generate the code in the middle like this:
select distinct 'Case when cd1.Category = '''+t2.Category+''' OR
cd2.Category = '''+t2.Category+''' OR
cd3.Category = '''+t2.Category+''' then convert(bit,1) else 0 end as ['+t2.Category+'],'
from table2 t2
Paste those results into the original SQL statement (strip off the trailing comma) and you should be good to go.
We can use the Pivot feature and build the query dynamically. Some what like below:
Query 1
Select * from
(SELECT Id, Code, GroupCode
FROM Table2 join Table1
ON Table1.Code1 = Table2.Code
OR Table1.Code2 = Table2.Code
OR Table1.Code3 = Table2.Code
) ps
PIVOT
(
Max (GroupCode)
FOR Code IN
( [1.1], [1.2], [1.3])
) AS Result
Query 2
Select * from
(SELECT Id, GroupCode, Category
FROM Table2 join Table1
ON Table1.Code1 = Table2.Code
OR Table1.Code2 = Table2.Code
OR Table1.Code3 = Table2.Code
) ps
PIVOT
(
Count (GroupCode)
FOR Category IN
( [cat1], [cat2], [cat3])
) AS Result
Unfortunately your stuck with a bad design for Table1. A better approach would have been to have 3 rows for ID 10.
But, given your current design, your query will look something like this:
SELECT ID, G1.Group Group1, G2.Group Group2, G3.Group Group3
FROM Table1 T1
INNER JOIN Table2 G1 ON T1.Code1 = G1.Code
INNER JOIN Table2 G2 ON T1.Code2 = G2.Code
INNER JOIN Table2 G3 ON T1.Code3 = G3.Code
and
SELECT ID, G1.Category Cat1, G2.Category Cat2, G3.Category Cat3
FROM Table1 T1
INNER JOIN Table2 G1 ON T1.Code1 = G1.Code
INNER JOIN Table2 G2 ON T1.Code2 = G2.Code
INNER JOIN Table2 G3 ON T1.Code3 = G3.Code
The PIVOT and CROSS APPLY keywords within MSSQL would help you out. Though it's not exactly clear what you are trying to accomplish. CROSS APPLY for performing a join on a correlated subquery and displaying different output for each join, and PIVOT for doing a crosstab on your data.
For table 1 it might be easier if you mash it together into a more normalized style.
WITH cteTab1 (Id, Code) AS
(
SELECT Id, Code1 FROM Table1
UNION ALL
SELECT Id, Code2 FROM Table1
UNION ALL
SELECT Id, Code3 FROM Table1)
SELECT *
FROM Table2 INNER JOIN cteTab1 ON Table2.Code = cteTab1.Code

Is it possible to join two tables of multiple rows by only the first ID in each table?

Let's say I have two tables of data:
Table A
ID Colors
-- ------
1 Blue
1 Green
1 Red
Table B
ID States
-- ------
1 MD
1 VA
1 WY
1 CA
Is it possible to join these tables so that I get the following instead of 12 rows?
ID Colors States
-- ------ ------
1 Blue MD
1 Green VA
1 Red WY
1 CA
There's no association between the colors and states columns and the order of the columns doesn't matter. (e.g. Blue can be next to MD, VA, WY, or CA) The number of items in each column (Colors or States) per ID is not equal.
Thanks.
You can do this by using row_number() to create a fake join column:
select coalesce(a.id, b.id) as id, a.colors, b.states
from (select a.*, row_number() over (order by id) as seqnum
from a
) a full outer join
(select b.*, row_number() over (order by id) as seqnum
from b
) b
on b.seqnum = a.seqnum
Actually, in Oracle, you can also just use rownum:
select coalesce(a.id, b.id) as id, a.colors, b.states
from (select a.*, rownum as seqnum
from a
) a full outer join
(select b.*, rownum as seqnum
from b
) b
on b.seqnum = a.seqnum
You can also use a CTE (Common Table Expression) Like so:
WITH TableA (ID, Color) AS
(
SELECT "ID", Color
FROM DatabaseName.TableA
)
, Joined AS (
SELECT
a.ID AS "AID",
a.Color
b.ID AS "BID",
b."State"
FROM,
TableA AS a
RIGHT OUTER JOIN DatabaseName.TableB AS b ON a.ID = b.ID
)
SELECT
AID,
Color,
State
FROM
Joined

SQL Server - Providing priority to where clause condtions

Please consider the following SQL.
declare #t1 table(site int, id int, name varchar(2))
declare #t2 table(site int, id int, mark int)
insert into #t1
select 1,1,'A'
union select 1,2,'B'
union select 1,3,'C'
union select 2,2,'D'
union select 2,3,'C'
insert into #t2
select 1,1,10
union select 1,2,20
union select 0,3,30
union select 1,3,40
union select 2,3,40
union select 2,3,40
select distinct a.site, a.id,a.name,b.mark
from #t1 a
inner join #t2 b
on (a.site =b.site or b.site = 0) and a.id = b.id
where a.site=1
It produces the following result
site id name mark
----------------------------
1 1 A 10
1 2 B 20
1 3 C 30
1 3 C 40
It's correct.
But I want a person's data exactly once. The SQL should first check whether there is an entry for a person in #t2 for a specific site. If entry is found, then use it. If not, the mark of that person will be the person's mark who has the same name in site 0.
In this case, I want the result as follows.
site id name mark
----------------------------
1 1 A 10
1 2 B 20
1 3 C 40
But if (1,3,40) isn't in #t2, The result should be as follows.
site id name mark
----------------------------
1 1 A 10
1 2 B 20
1 3 C 30
How can I do this?
I can do it using Common Table Expression.
So please provide me a faster way.
I'll run it on about 100 millions rows.
You can roll all of the conditions into the on clause:
declare #target_site as Int = 1
select distinct a.site, a.id, a.name, b.mark
from #t1 as a inner join
#t2 as b on a.site = #target_site and a.id = b.id and
( a.site = b.site or ( b.site = 0 and not exists ( select 42 from #t2 where site = #target_site and id = a.id ) ) )
Outer Join to the t2 table twice, and Use a subquery to ensure that only records that have a match or are zeroes are included.
Select distinct a.site, a.id, a.name,
coalesce(sm.mark, zs.mark) mark
from #t1 a
Left Join #t2 sm -- for site match
on sm.id = a.id
And sm.site = a.site
Left Join #t2 zs -- for zero site
on zs.id = a.id
And zs.site = 0
Where Exists (Select * From #t2
Where id = a.id
And Site In (a.Site, 0))
And a.site=1