SQL: rows that share several values on a specific column

SQL: rows that share several values on a specific column - sql

I have a table Visited with 2 columns:
ID | City
ID is an integer, City is a string.
Note that none of the columns is a key by itself - we can have the same ID visiting several cities, and several different IDs in the same city.
Given a specific ID, I want to return all the IDs in the table that visited at least half of the places that the input ID did (not including themselves)
edit: We only count places that are the same.
so if
ID 1 visited cities a,b,c.
ID 2 visited b,c,d.
ID 3 visited c,d,e.
then for ID=1 we return only [2], because out of the three cities ID1 visited, ID3 visited only one

Inner join the visited table with the list of cities visited by the specific id, then select ids with at least half of the number of rows when grouped by id.
with u as
(select city as visitedBySpecificId from visited where id = *specificId*),
v as
(select * from visited inner join u on city = visitedBySpecificId where id <> *specificId*)
(select id from v group by id having count(*) >= (select count(*) from u)/2.0)
Fiddle

Join them and compare the counts.
create table suspect_tracking (id int, city varchar(30))
insert into suspect_tracking values
(1, 'Brussels'), (1,'London'), (1,'Paris')
, (1,'New York'), (1,'Bangkok'), (1, 'Hong Kong')
, (1,'Dubai'), (1,'Singapoor'), (1,'Rome')
, (1,'Macau'), (1, 'Istanbul'), (1,'Kuala Lumpur')
, (1,'Dehli'), (1,'Tokyo'), (1,'Moscow')
, (2,'New York'), (2,'Bangkok'), (2, 'Hong Kong')
, (2,'Dubai'), (2,'Singapoor'), (2,'Rome')
, (2,'Macau'), (2, 'Istanbul'), (2,'Kuala Lumpur')
, (3,'Macau'), (3, 'Istanbul'), (3,'Kuala Lumpur')
, (3,'Dehli'), (3,'Tokyo'), (3,'Moscow')
with cte_suspects as (
select id, city
from suspect_tracking
group by id, city
)
, cte_prime_suspect as (
select distinct id, city
from suspect_tracking
where id = 1
)
, cte_prime_total as (
select id, count(city) as cities
from cte_prime_suspect
group by id
)
select sus.id
from cte_prime_suspect prime
join cte_prime_total primetot
on primetot.id = prime.id
join cte_suspects sus
on sus.city = prime.city and sus.id <> prime.id
group by prime.id, sus.id, primetot.cities
having count(sus.city) >= primetot.cities/2
| id |
| -: |
| 2 |
db<>fiddle here

Related

How to synthesize attribute for joined tables

I have a view defined like this:
CREATE VIEW [dbo].[PossiblyMatchingContracts] AS
SELECT
C.UniqueID,
CC.UniqueID AS PossiblyMatchingContracts
FROM [dbo].AllContracts AS C
INNER JOIN [dbo].AllContracts AS CC
ON C.SecondaryMatchCodeFB = CC.SecondaryMatchCodeFB
OR C.SecondaryMatchCodeLB = CC.SecondaryMatchCodeLB
OR C.SecondaryMatchCodeBB = CC.SecondaryMatchCodeBB
OR C.SecondaryMatchCodeLB = CC.SecondaryMatchCodeBB
OR C.SecondaryMatchCodeBB = CC.SecondaryMatchCodeLB
WHERE C.UniqueID NOT IN
(
SELECT UniqueID FROM [dbo].DefinitiveMatches
)
AND C.AssociatedUser IS NULL
AND C.UniqueID <> CC.UniqueID
Which is basically finding contracts where f.e. the first name and the birthday are matching. This works great. Now I want to add a synthetic attribute to each row with the value from only one source row.
Let me give you an example to make it clearer. Suppose I have the following table:
UniqueID | FirstName | LastName | Birthday
1 | Peter | Smith | 1980-11-04
2 | Peter | Gray | 1980-11-04
3 | Peter | Gray-Smith| 1980-11-04
4 | Frank | May | 1985-06-09
5 | Frank-Paul| May | 1985-06-09
6 | Gina | Ericson | 1950-11-04
The resulting view should look like this:
UniqueID | PossiblyMatchingContracts | SyntheticID
1 | 2 | PeterSmith1980-11-04
1 | 3 | PeterSmith1980-11-04
2 | 1 | PeterSmith1980-11-04
2 | 3 | PeterSmith1980-11-04
3 | 1 | PeterSmith1980-11-04
3 | 2 | PeterSmith1980-11-04
4 | 5 | FrankMay1985-06-09
5 | 4 | FrankMay1985-06-09
6 | NULL | NULL [or] GinaEricson1950-11-04
Notice that the SyntheticID column uses ONLY values from one of the matching source rows. It doesn't matter which one. I am exporting this view to another application and need to be able to identify each "match group" afterwards.
Is it clear what I mean? Any ideas how this could be done in sql?
Maybe it helps to elaborate a bit on the actual use case:
I am importing contracts from different systems. To account for the possibility of typos or people that have married but the last name was only updated in one system, I need to find so called 'possible matches'. Two or more contracts are considered a possible match if they contain the same birthday plus the same first, last or birth name. That implies, that if contract A matches contract B, contract B also matches contract A.
The target system uses multivalue reference attributes to store these relationships. The ultimate goal is to create user objects for these contracts. The catch first is, that the shall only be one user object for multiple matching contracts. Thus I'm creating these matches in the view. The second catch is, that the creation of user objects happens by workflows, which run parallel for each contract. To avoid creating multiple user objects for matching contracts, each workflow needs to check, if there is already a matching user object or another workflow, which is about to create said user object. Because the workflow engine is extremely slow compared to sql, the workflows should not repeat the whole matching test. So the idea is, to let the workflow check only for the 'syntheticID'.

I have solved it with a multi step approach:
Create the list of possible 1st level matches for each contract
Create the base groups list, assigning a different group for for
each contract (as if they were not related to anybody)
Iterate the matches list updating the group list when more contracts need to
be added to a group
Recursively build up the SyntheticID from final group list
Output results
First of all, let me explain what I have understood, so you can tell if my approach is correct or not.
1) matching propagates in "cascade"
I mean, if "Peter Smith" is grouped up with "Peter Gray", it means that all Smith and all Gray are related (if they have the same birth date) so Luke Smith can be in the same group of John Gray
2) I have not understood what you mean with "Birth Name"
You say contracts matches on "first, last or birth name", sorry, I'm italian, I thought birth name and first were the same, also in your data there is not such column. Maybe it is related to that dash symbol between names?
When FirstName is Frank-Paul it means it should match both Frank and Paul?
When LastName is Gray-Smith it means it should match both Gray and Smith?
In following code I have simply ignored this problem, but it could be handled if needed (I already did a try, breaking names, unpivoting them and treating as double match).
Step Zero: some declaration and prepare base data
declare #cli as table (UniqueID int primary key, FirstName varchar(20), LastName varchar(20), Birthday varchar(20))
declare #comb as table (id1 int, id2 int, done bit)
declare #grp as table (ix int identity primary key, grp int, id int, unique (grp,ix))
declare #str_id as table (grp int primary key, SyntheticID varchar(1000))
declare #id1 as int, #g int
;with
t as (
select *
from (values
(1 , 'Peter' , 'Smith' , '1980-11-04'),
(2 , 'Peter' , 'Gray' , '1980-11-04'),
(3 , 'Peter' , 'Gray-Smith', '1980-11-04'),
(4 , 'Frank' , 'May' , '1985-06-09'),
(5 , 'Frank-Paul', 'May' , '1985-06-09'),
(6 , 'Gina' , 'Ericson' , '1950-11-04')
) x (UniqueID , FirstName , LastName , Birthday)
)
insert into #cli
select * from t
Step One: Create the list of possible 1st level matches for each contract
;with
p as(select UniqueID, Birthday, FirstName, LastName from #cli),
m as (
select p.UniqueID UniqueID1, p.FirstName FirstName1, p.LastName LastName1, p.Birthday Birthday1, pp.UniqueID UniqueID2, pp.FirstName FirstName2, pp.LastName LastName2, pp.Birthday Birthday2
from p
join p pp on (pp.Birthday=p.Birthday) and (pp.FirstName = p.FirstName or pp.LastName = p.LastName)
where p.UniqueID<=pp.UniqueID
)
insert into #comb
select UniqueID1,UniqueID2,0
from m
Step Two: Create the base groups list
insert into #grp
select ROW_NUMBER() over(order by id1), id1 from #comb where id1=id2
Step Three: Iterate the matches list updating the group list
Only loop on contracts that have possible matches and updates only if needed
set #id1 = 0
while not(#id1 is null) begin
set #id1 = (select top 1 id1 from #comb where id1<>id2 and done=0)
if not(#id1 is null) begin
set #g = (select grp from #grp where id=#id1)
update g set grp= #g
from #grp g
inner join #comb c on g.id = c.id2
where c.id2<>#id1 and c.id1=#id1
and grp<>#g
update #comb set done=1 where id1=#id1
end
end
Step Four: Build up the SyntheticID
Recursively add ALL (distinct) first and last names of group to SyntheticID.
I used '_' as separator for birth date, first names and last names, and ',' as separator for the list of names to avoid conflicts.
;with
c as(
select c.*, g.grp
from #cli c
join #grp g on g.id = c.UniqueID
),
d as (
select *, row_number() over (partition by g order by t,s) n1, row_number() over (partition by g order by t desc,s desc) n2
from (
select distinct c.grp g, 1 t, FirstName s from c
union
select distinct c.grp, 2, LastName from c
) l
),
r as (
select d.*, cast(CONVERT(VARCHAR(10), t.Birthday, 112) + '_' + s as varchar(1000)) Names, cast(0 as bigint) i1, cast(0 as bigint) i2
from d
join #cli t on t.UniqueID=d.g
where n1=1
union all
select d.*, cast(r.names + IIF(r.t<>d.t,'_',',') + d.s as varchar(1000)), r.n1, r.n2
from d
join r on r.g = d.g and r.n1=d.n1-1
)
insert into #str_id
select g, Names
from r
where n2=1
Step Five: Output results
select c.UniqueID, case when id2=UniqueID then id1 else id2 end PossibleMatchingContract, s.SyntheticID
from #cli c
left join #comb cb on c.UniqueID in(id1,id2) and id1<>id2
left join #grp g on c.UniqueID = g.id
left join #str_id s on s.grp = g.grp
Here is the results
UniqueID PossibleMatchingContract SyntheticID
1 2 1980-11-04_Peter_Gray,Gray-Smith,Smith
1 3 1980-11-04_Peter_Gray,Gray-Smith,Smith
2 1 1980-11-04_Peter_Gray,Gray-Smith,Smith
2 3 1980-11-04_Peter_Gray,Gray-Smith,Smith
3 1 1980-11-04_Peter_Gray,Gray-Smith,Smith
3 2 1980-11-04_Peter_Gray,Gray-Smith,Smith
4 5 1985-06-09_Frank,Frank-Paul_May
5 4 1985-06-09_Frank,Frank-Paul_May
6 NULL 1950-11-04_Gina_Ericson
I think that in this way the resulting SyntheticID should also be "unique" for each group

This creates a synthetic value and is easy to change to suit your needs.
DECLARE #T TABLE (
UniqueID INT
,FirstName VARCHAR(200)
,LastName VARCHAR(200)
,Birthday DATE
)
INSERT INTO #T(UniqueID,FirstName,LastName,Birthday) SELECT 1,'Peter','Smith','1980-11-04'
INSERT INTO #T(UniqueID,FirstName,LastName,Birthday) SELECT 2,'Peter','Gray','1980-11-04'
INSERT INTO #T(UniqueID,FirstName,LastName,Birthday) SELECT 3,'Peter','Gray-Smith','1980-11-04'
INSERT INTO #T(UniqueID,FirstName,LastName,Birthday) SELECT 4,'Frank','May','1985-06-09'
INSERT INTO #T(UniqueID,FirstName,LastName,Birthday) SELECT 5,'Frank-Paul','May','1985-06-09'
INSERT INTO #T(UniqueID,FirstName,LastName,Birthday) SELECT 6,'Gina','Ericson','1950-11-04'
DECLARE #PossibleMatches TABLE (UniqueID INT,[PossibleMatch] INT,SynKey VARCHAR(2000)
)
INSERT INTO #PossibleMatches
SELECT t1.UniqueID [UniqueID],t2.UniqueID [Possible Matches],'Ln=' + t1.LastName + ' Fn=' + + t1.FirstName + ' DoB=' + CONVERT(VARCHAR,t1.Birthday,102) [SynKey]
FROM #T t1
INNER JOIN #T t2 ON t1.Birthday=t2.Birthday
AND t1.FirstName=t2.FirstName
AND t1.LastName=t2.LastName
AND t1.UniqueID<>t2.UniqueID
INSERT INTO #PossibleMatches
SELECT t1.UniqueID [UniqueID],t2.UniqueID [Possible Matches],'Fn=' + t1.FirstName + ' DoB=' + CONVERT(VARCHAR,t1.Birthday,102) [SynKey]
FROM #T t1
INNER JOIN #T t2 ON t1.Birthday=t2.Birthday
AND t1.FirstName=t2.FirstName
AND t1.UniqueID<>t2.UniqueID
INSERT INTO #PossibleMatches
SELECT t1.UniqueID,t2.UniqueID,'Ln=' + t1.LastName + ' DoB=' + CONVERT(VARCHAR,t1.Birthday,102) [SynKey]
FROM #T t1
INNER JOIN #T t2 ON t1.Birthday=t2.Birthday
AND t1.LastName=t2.LastName
AND t1.UniqueID<>t2.UniqueID
INSERT INTO #PossibleMatches
SELECT t1.UniqueID,pm.UniqueID,'Ln=' + t1.LastName + ' Fn=' + + t1.FirstName + ' DoB=' + CONVERT(VARCHAR,t1.Birthday,102) [SynKey]
FROM #T t1
LEFT JOIN #PossibleMatches pm on pm.UniqueID=t1.UniqueID
WHERE pm.UniqueID IS NULL
SELECT *
FROM #PossibleMatches
ORDER BY UniqueID,[PossibleMatch]

I think this will work for you
SELECT
C.UniqueID,
CC.UniqueID AS PossiblyMatchingContracts,
FIRST_VALUE(CC.FirstName+CC.LastName+CC.Birthday)
OVER (PARTITION BY C.UniqueID ORDER BY CC.UniqueID) as SyntheticID
FROM
[dbo].AllContracts AS C INNER JOIN
[dbo].AllContracts AS CC ON
C.SecondaryMatchCodeFB = CC.SecondaryMatchCodeFB OR
C.SecondaryMatchCodeLB = CC.SecondaryMatchCodeLB OR
C.SecondaryMatchCodeBB = CC.SecondaryMatchCodeBB OR
C.SecondaryMatchCodeLB = CC.SecondaryMatchCodeBB OR
C.SecondaryMatchCodeBB = CC.SecondaryMatchCodeLB
WHERE
C.UniqueID NOT IN(
SELECT UniqueID FROM [dbo].DefinitiveMatches)
AND C.AssociatedUser IS NULL

You can try this:
SELECT
C.UniqueID,
CC.UniqueID AS PossiblyMatchingContracts,
FIRST_VALUE(CC.FirstName+CC.LastName+CC.Birthday)
OVER (PARTITION BY C.UniqueID ORDER BY CC.UniqueID) as SyntheticID
FROM
[dbo].AllContracts AS C
INNER JOIN
[dbo].AllContracts AS CC
ON
C.SecondaryMatchCodeFB = CC.SecondaryMatchCodeFB
OR
C.SecondaryMatchCodeLB = CC.SecondaryMatchCodeLB
OR
C.SecondaryMatchCodeBB = CC.SecondaryMatchCodeBB
OR
C.SecondaryMatchCodeLB = CC.SecondaryMatchCodeBB
OR
C.SecondaryMatchCodeBB = CC.SecondaryMatchCodeLB
WHERE
C.UniqueID NOT IN
(
SELECT UniqueID FROM [dbo].DefinitiveMatches
)
AND
C.AssociatedUser IS NULL
This will generate one extra row (because we left out C.UniqueID <> CC.UniqueID) but will give you the good souluton.

Following an example with some example data extracted from your original post. The idea: Generate all SyntheticID in a CTE, query all records with a "PossibleMatch" and Union it with all records which are not yet included:
DECLARE #t TABLE(
UniqueID int
,FirstName nvarchar(20)
,LastName nvarchar(20)
,Birthday datetime
)
INSERT INTO #t VALUES (1, 'Peter', 'Smith', '1980-11-04');
INSERT INTO #t VALUES (2, 'Peter', 'Gray', '1980-11-04');
INSERT INTO #t VALUES (3, 'Peter', 'Gray-Smith', '1980-11-04');
INSERT INTO #t VALUES (4, 'Frank', 'May', '1985-06-09');
INSERT INTO #t VALUES (5, 'Frank-Paul', 'May', '1985-06-09');
INSERT INTO #t VALUES (6, 'Gina', 'Ericson', '1950-11-04');
WITH ctePrep AS(
SELECT UniqueID, FirstName, LastName, BirthDay,
ROW_NUMBER() OVER (PARTITION BY FirstName, BirthDay ORDER BY FirstName, BirthDay) AS k,
FirstName+LastName+CONVERT(nvarchar(10), Birthday, 126) AS SyntheticID
FROM #t
),
cteKeys AS(
SELECT FirstName, BirthDay, SyntheticID
FROM ctePrep
WHERE k = 1
),
cteFiltered AS(
SELECT
C.UniqueID,
CC.UniqueID AS PossiblyMatchingContracts,
keys.SyntheticID
FROM #t AS C
JOIN #t AS CC ON C.FirstName = CC.FirstName
AND C.Birthday = CC.Birthday
JOIN cteKeys AS keys ON keys.FirstName = c.FirstName
AND keys.Birthday = c.Birthday
WHERE C.UniqueID <> CC.UniqueID
)
SELECT UniqueID, PossiblyMatchingContracts, SyntheticID
FROM cteFiltered
UNION ALL
SELECT UniqueID, NULL, FirstName+LastName+CONVERT(nvarchar(10), Birthday, 126) AS SyntheticID
FROM #t
WHERE UniqueID NOT IN (SELECT UniqueID FROM cteFiltered)
Hope this helps. The result looked OK to me:
UniqueID PossiblyMatchingContracts SyntheticID
---------------------------------------------------------------
2 1 PeterSmith1980-11-04
3 1 PeterSmith1980-11-04
1 2 PeterSmith1980-11-04
3 2 PeterSmith1980-11-04
1 3 PeterSmith1980-11-04
2 3 PeterSmith1980-11-04
4 NULL FrankMay1985-06-09
5 NULL Frank-PaulMay1985-06-09
6 NULL GinaEricson1950-11-04

Tested in SSMS, it works perfect. :)
--create table structure
create table #temp
(
uniqueID int,
firstname varchar(15),
lastname varchar(15),
birthday date
)
--insert data into the table
insert #temp
select 1, 'peter','smith','1980-11-04'
union all
select 2, 'peter','gray','1980-11-04'
union all
select 3, 'peter','gray-smith','1980-11-04'
union all
select 4, 'frank','may','1985-06-09'
union all
select 5, 'frank-paul','may','1985-06-09'
union all
select 6, 'gina','ericson','1950-11-04'
select * from #temp
--solution is as below
select ab.uniqueID
, PossiblyMatchingContracts
, c.firstname+c.lastname+cast(c.birthday as varchar) as synID
from
(
select a.uniqueID
, case
when a.uniqueID < min(b.uniqueID)over(partition by a.uniqueid)
then a.uniqueID
else min(b.uniqueID)over(partition by a.uniqueid)
end as SmallestID
, b.uniqueID as PossiblyMatchingContracts
from #temp a
left join #temp b
on (a.firstname = b.firstname OR a.lastname = b.lastname) AND a.birthday = b.birthday AND a.uniqueid <> b.uniqueID
) as ab
left join #temp c
on ab.SmallestID = c.uniqueID
Result capture is attached below:

Say we have following table (a VIEW in your case):
UniqueID PossiblyMatchingContracts SyntheticID
1 2 G1
1 3 G2
2 1 G3
2 3 G4
3 1 G4
3 4 G6
4 5 G7
5 4 G8
6 NULL G9
In your case you can set initial SyntheticID as a string like PeterSmith1980-11-04 using UniqueID for each line. Here is a recursive CTE query it divides all lines to unconnected groups and select MAX(SyntheticId) in the current group as a new SyntheticID for all lines in this group.
WITH CTE AS
(
SELECT CAST(','+CAST(UniqueID AS Varchar(100)) +','+ CAST(PossiblyMatchingContracts as Varchar(100))+',' as Varchar(MAX)) as GroupCont,
SyntheticID
FROM PossiblyMatchingContracts
UNION ALL
SELECT CAST(GroupCont+CAST(UniqueID AS Varchar(100)) +','+ CAST(PossiblyMatchingContracts as Varchar(100))+',' AS Varchar(MAX)) as GroupCont,
pm.SyntheticID
FROM CTE
JOIN PossiblyMatchingContracts as pm
ON
(
CTE.GroupCont LIKE '%,'+CAST(pm.UniqueID AS Varchar(100))+',%'
OR
CTE.GroupCont LIKE '%,'+CAST(pm.PossiblyMatchingContracts AS Varchar(100))+',%'
)
AND NOT
(
CTE.GroupCont LIKE '%,'+CAST(pm.UniqueID AS Varchar(100))+',%'
AND
CTE.GroupCont LIKE '%,'+CAST(pm.PossiblyMatchingContracts AS Varchar(100))+',%'
)
)
SELECT pm.UniqueID,
pm.PossiblyMatchingContracts,
ISNULL(
(SELECT MAX(SyntheticID) FROM CTE WHERE
(
CTE.GroupCont LIKE '%,'+CAST(pm.UniqueID AS Varchar(100))+',%'
OR
CTE.GroupCont LIKE '%,'+CAST(pm.PossiblyMatchingContracts AS Varchar(100))+',%'
))
,pm.SyntheticID) as SyntheticID
FROM PossiblyMatchingContracts pm

SQL Select with Priority

I need to select top 1 most valid discount for a given FriendId.
I have the following tables:
DiscountTable - describes different discount types
DiscountId, Percent, Type, Rank
1 , 20 , Friend, 2
2 , 10 , Overwrite, 1
Then I have another two tables (both list FriendIds)
Friends
101
102
103
Overwrites
101
105
I have to select top 1 most valid discount for a given FriendId. So for the above data this would be sample output
Id = 101 => gets "Overwrite" discount (higher rank)
Id = 102 => gets "Friend" discount (only in friends table)
Id = 103 => gets "Friend" discount (only in friends table)
Id = 105 => gets "Overwrite" discount
Id = 106 => gets NO discount as it does not exist in neither Friend and overwrite tables
INPUT => SINGLE friendId (int).
OUTPUT => Single DISCOUNT Record (DiscountId, Percent, Type)
Overwrites and Friend tables are the same. They only hold list of Ids (single column)

Having multiple tables of identical structure is usually bad practice, a single table with ID and Type would suffice, you could then use it in a JOIN to your DiscountTable:
;WITH cte AS (SELECT ID,[Type] = 'Friend'
FROM Friends
UNION ALL
SELECT ID,[Type] = 'Overwrite'
FROM Overwrites
)
SELECT TOP 1 a.[Type]
FROM cte a
JOIN DiscountTable DT
ON a.[Type] = DT.[Type]
WHERE ID = '105'
ORDER BY [Rank]
Note, non-existent ID values will not return.

This will get you all the FriendIds and the associate discount of the highest rank. It's an older hack that doesn't require using top or row numbering.
select
elig.FriendId,
min(Rank * 10000 + DiscountId) % 10000 as DiscountId
min(Rank * 10000 + Percent) % 10000 as Percent,
from
DiscountTable as dt
inner join (
select FriendId, 'Friend' as Type from Friends union all
select FriendId, 'Overwrite' from Overwrites
) as elig /* for eligible? */
on elig.Type = dt.Type
group by
elig.FriendId

create table discounts (id int, percent1 int, type1 varchar(12), rank1 int)
insert into discounts
values (1 , 20 , 'Friend', 2),
(2 , 10 , 'Overwrite', 1)
create table friends (friendid int)
insert into friends values (101),(102), (103)
create table overwrites (overwriteid int)
insert into overwrites values (101),(105)
select ids, isnull(percent1,0) as discount from (
select case when friendid IS null and overwriteid is null then 'no discount'
when friendid is null and overwriteid is not null then 'overwrite'
when friendid is not null and overwriteid is null then 'friend'
when friendid is not null and overwriteid is not null then (select top 1 TYPE1 from discounts order by rank1 desc)
else '' end category
,ids
from tcase left outer join friends
on tcase.ids = friends.friendid
left join overwrites
on tcase.ids = overwrites.overwriteid
) category1 left join discounts
on category1.category=discounts.type1

pivot/cross distinct row data to colums with postgres

I have distinct data that I want to pivot/cross, for instance
Given table A with
name tag
Bob sport
Bob action
Bob comedy
Tom action
Tom drama
Sue sport
I'd like a query that transforms the data to
name sport action comedy drama
Bob 1 1 1 0
Tom 0 1 0 1
Sue 1 0 0 0
For any number n of distinct tags.
How would I create this transformation using sql if I didn't know the distinct tags before I begin.

Some simple solutions adequate for some cases. Using this table (SQL Fiddle is not working right now)
create table a (
name text,
tag text
);
insert into a (name, tag) values
('Bob', 'sport'),
('Bob', 'action'),
('Bob', 'comedy'),
('Tom', 'action'),
('Tom', 'drama'),
('Sue', 'sport');
A simple arrays aggregation if they can be split somewhere else
select
name,
array_agg(tag order by tag) as tags,
array_agg(total order by tag) as totals
from (
select name, tag, count(a.name) as total
from
a
right join (
(select distinct tag from a) t
cross join
(select distinct name from a) n
) c using (name, tag)
group by name, tag
) s
group by name
order by 1
;
name | tags | totals
------+-----------------------------+-----------
Bob | {action,comedy,drama,sport} | {1,1,0,1}
Sue | {action,comedy,drama,sport} | {0,0,0,1}
Tom | {action,comedy,drama,sport} | {1,0,1,0}
For JSON aware clients a set of JSON objects
select format(
'{%s:{%s}}',
to_json(name),
string_agg(o, ',')
)::json as o
from (
select name,
format(
'%s:%s',
to_json(tag),
to_json(count(a.name))
) as o
from
a
right join (
(select distinct tag from a) t
cross join
(select distinct name from a) n
) c using (name, tag)
group by name, tag
) s
group by name
;
o
-----------------------------------------------------
{"Bob":{"action":1,"comedy":1,"drama":0,"sport":1}}
{"Sue":{"action":0,"comedy":0,"drama":0,"sport":1}}
{"Tom":{"action":1,"comedy":0,"drama":1,"sport":0}}
or a single JSON object
select format('{%s}', string_agg(o, ','))::json as o
from (
select format(
'%s:{%s}',
to_json(name),
string_agg(o, ',')
) as o
from (
select name,
format(
'%s:%s',
to_json(tag),
to_json(count(a.name))
) as o
from
a
right join (
(select distinct tag from a) t
cross join
(select distinct name from a) n
) c using (name, tag)
group by name, tag
) s
group by name
) s
;
o
---------------------------------------------------------------------------------------------------------------------------------------------------------
{"Bob":{"action":1,"comedy":1,"drama":0,"sport":1},"Sue":{"action":0,"comedy":0,"drama":0,"sport":1},"Tom":{"action":1,"comedy":0,"drama":1,"sport":0}}

How to return one row from group by multiple columns

I am trying to extract a list of unique customers from a database where some customers are listed more than once. The (almost) duplicate rows exist because customers have been moved from one division to another or because the customers have been registered with another address (or both).
So my challenge is in data that looks something like this:
ID Customer Division Address
-----------------------------------
1 A M X
1 A L X
2 B N Y
2 B N Z
3 C P W
3 C T S
I want my select statement to return one row for each customer (I dont care which one).
ID Customer Division Address
-----------------------------------
1 A M X
2 B N Y
3 C P W
I am using SQL Server 2008. I think I need to do a "GROUP BY" the last two columns but I don't know how to get just one row out of it.
I hope someone can help me!
(Yes, I know the problem should be solved at the source but unfortunately that is not possible within any reasonable time-frame...).

select ID, Customer,Division, Address from
(
SELECT
ID, Customer,Division, Address,
ROW_NUMBER() OVER (PARTITON BY Customer Order by Id) as RN
FROM T
) t1
WHERE RN=1

Try this one -
DECLARE #temp TABLE
(
ID INT
, Customer CHAR(1)
, Division CHAR(1)
, [Address] CHAR(1)
)
INSERT INTO #temp (ID, Customer, Division, [Address])
VALUES
(1, 'A', 'M', 'X'),
(1, 'A', 'L', 'X'),
(2, 'B', 'N', 'Y'),
(2, 'B', 'N', 'Z'),
(3, 'C', 'P', 'W'),
(3, 'C', 'T', 'S')
SELECT t.id
, t.Customer
, t.Division
, t.[Address]
FROM
(
SELECT *
, rn = ROW_NUMBER() OVER (PARTITION BY Customer ORDER BY 1/0)
FROM #temp
) t
WHERE T.rn = 1
SELECT ID, Customer, Division = MAX(Division), [Address] = MAX([Address])
FROM #temp
GROUP BY ID, Customer
Output -
id Customer Division Address
----------- -------- -------- -------
1 A M X
2 B N Y
3 C P W
ID Customer Division Address
----------- -------- -------- -------
1 A M X
2 B N Z
3 C T W

query to count number of unique relations

I have 3 tables:
t_user (id, name)
t_user_deal (id, user_id, deal_id)
t_deal (id, title)
multiple user can be linked to the same deal. (I'm using oracle but it should be similar, I can adapt it)
How can I get all the users (name) with the number of unique user he made a deal with.
let's explain with some data:
t_user:
id, name
1, joe
2, mike
3, John
t_deal:
id, title
1, deal number 1
2, deal number 2
t_user_deal:
id, user_id, deal_id
1, 1, 1
2, 2, 1
3, 1, 2
4, 3, 2
the result I expect:
user_name, number of unique user he made a deal with
Joe, 2
Mike, 1
John, 1
I've try this but I didn't get the expected result:
SELECT tu.name,
count(tu.id) AS nbRelations
FROM t_user tu
INNER JOIN t_user_deal tud ON tu.id = tud.user_id
INNER JOIN t_deal td ON tud.deal_id = td.id
WHERE
(
td.id IN
(
SELECT DISTINCT td.id
FROM t_user_deal tud2
INNER JOIN t_deal td2 ON tud2.deal_id = td2.id
WHERE tud.id <> tud2.user_id
)
)
GROUP BY tu.id
ORDER BY nbRelations DESC
thanks for your help

This should get you the result
SELECT id1, count(id2),name
FROM (
SELECT distinct tud1.user_id id1 , tud2.user_id id2
FROM t_user_deal tud1, t_user_deal tud2
WHERE tud1.deal_id = tud2.deal_id
and tud1.user_id <> tud2.user_id) as tab, t_user tu
WHERE tu.id = id1
GROUP BY id1,name

Something like
select name, NVL (i.ud, 0) ud from t_user join (
SELECT user_id, count(*) ud from t_user_deal group by user_id) i on on t_user.id = i.user_id
where i.ud > 0
Unless I'm missing somethig here. It actually sounds like your question references having a second user in the t_user_deal table. The model you've described here doesn't include that.

PostgreSQL example:
create table t_user (id int, name varchar(255)) ;
create table t_deal (id int, title varchar(255)) ;
create table t_user_deal (id int, user_id int, deal_id int) ;
insert into t_user values (1, 'joe'), (2, 'mike'), (3, 'john') ;
insert into t_deal values (1, 'deal 1'), (2, 'deal 2') ;
insert into t_user_deal values (1, 1, 1), (2, 2, 1), (3, 1, 2), (4, 3, 2) ;
And the query.....
SELECT
name, COUNT(DISTINCT deal_id)
FROM
t_user INNER JOIN t_user_deal ON (t_user.id = t_user_deal.user_id)
GROUP BY
user_id, name ;
The DISTINCT might not be necessary (in the COUNT(), that is). Depends on how clean your data is (e.g., no duplicate rows!)
Here's the result in PostgreSQL:
name | count
------+-------
joe | 2
mike | 1
john | 1
(3 rows)

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

SQL: rows that share several values on a specific column - sql

Related

How to synthesize attribute for joined tables

SQL Select with Priority

pivot/cross distinct row data to colums with postgres

How to return one row from group by multiple columns

query to count number of unique relations

Categories

Resources