postgreSQL Reversing REGEXP_SPLIT TO TABLE - sql

I'm working on an assignment where the question is:
"List all of the actor names, semi-colon separated, that acted in the same movie. Your final list should
include both the movie title and semi-colon separated actor names"
Currently I have
SELECT title, name as actor FROM (SELECT *, count(*) OVER (partition by movies.movieid) as count FROM actors
JOIN movieroles on actors.actorid = movieroles.actorid
JOIN movies on movieroles.movieid = movies.movieid)X
WHERE X.count>1
GROUP BY title, actor;
which outputs
title | actor
------------------------------------+-----------------------
The Terminator | Arnold Schwarzenegger
The Terminator | Michael Biehn
Sherlock Holmes: A Game of Shadows | Jude Law
The Terminator | Linda Hamilton
Sherlock Holmes: A Game of Shadows | Rachel McAdams
However, I need to produce
"title" "actor"
------------------------------------------------------------------------------
Sherlock Holmes: A Game of Shadows Jude Law;Rachel McAdams
The Terminator Linda Hamilton;Michael Biehn;Arnold Schwarzenegger
So my question is how would one approach reversing the split_part or a similar function of sorts to give desired output?

Just STRING_AGG() function is needed with ; argument as a separator with grouping by title column :
SELECT title, STRING_AGG(DISTINCT name, ';') as actors
FROM actors a
JOIN movieroles mr ON a.actorid = mr.actorid
JOIN movies m ON m.movieid = mr.movieid
GROUP BY title;
and an optional order by clause might be used
( syntax : STRING_AGG ( expression, separator [order_by_clause] ) ) .
Aliasing tables would be elegant for readability of the query as a later reference.

Related

Lookup from a comma separated SQL column

I have a user table which has comma separated ids in one of the columns, like:
Id
Name
PrimaryTeamId
SecondaryTeamIds
1
John
123
456,789,669
2
Ringo
123
456,555
and a secondary table which contains the team names
Id
TeamId
TeamName
1
456
Red Team
2
669
Blue Team
3
789
Purple Team
4
555
Black Team
5
123
Orange Team
I'm trying to create a view which gives the following format:
Name
Primary Team
Secondary Teams
John
Orange Team
Red Team, Purple Team, Blue Team
Ringo
Orange Team
Red Team, Black Team
I have created
select
u.Name,
t.TeamName as 'Primary Team'
SELECT ... ?? as 'Secondary Teams'
from
users u
inner join teams t on u.PrimaryTeamId = t.TeamId
I've tried numerous things but can't seem to put it together. I can't seem to find the same use case here or elsewhere. I do control the data coming in so I could parse those values out relationally to begin with or do some kind of lookup on the ETL side, but would like to figure it out.
If the sequence of Secondary Teams is essential, you can parse the string via OpenJSON while preserving the sequence [key].
Then it becomes a small matter of string_agg()
Example or dbFiddle
Select A.ID
,A.Name
,PrimaryTeam = C.TeamName
,B.SecendaryTeams
from YourTable A
Cross Apply (
Select SecendaryTeams = string_agg(B2.TeamName,', ') within group (order by B1.[Key])
From OpenJSON( '["'+replace(string_escape([SecondaryTeamIds],'json'),',','","')+'"]' ) B1
Join YourTeams B2 on B1.Value=B2.TeamID
) B
Join YourTeams C on A.[PrimaryTeamId]=C.TeamId
Results
ID Name PrimaryTeam SecendaryTeams
1 John Orange Team Red Team, Purple Team, Blue Team
2 Ringo Orange Team Red Team, Black Team
I played around with this a little bit and I found you can do it using two functions, STRING_SPLIT and STRING_AGG.
STRING_SPLIT allows you to convert a NVARCHAR and split it into a table where each row is a value, and STRING_AGG allows you to do the opposite, Join a table into a NVARCHAR. Then I just used a JOIN in between.
Maybe its not the cleanest solution but it does the job. Also its a bit inefitient but using native functions instead of loops help a lot.
I attach a working example. In this online editor I just had one table so I joined it with itself but it must work joining with other tables.
SELECT
*,
(
SELECT STRING_AGG(CAST(Val AS NVARCHAR), ', ') -- concatenates the rows together
FROM
(
SELECT demo.hint AS Val
FROM STRING_SPLIT((SELECT d.name FROM demo as d where id = demo.id), ',') -- splits the row by comas
JOIN demo ON value = demo.id -- joins so that the values are replaced with names
) Vals
) as JointValues -- name of the column with the joint values
FROM demo

SQL query without join

I have the following tables
Table food Table Race Table animal
+------------+--------------+ +------------+--------------+ +------------+--------------+
| Quantity | animal_id | | race_code | race_name | | animal_id | race_code |
+------------+--------------+ +------------+--------------+ +------------+--------------+
I was asked to calculate the average food quantity for every race (race_name). The challenge here is that I should not use JOIN because we have not studied it yet.
I have written the following query:
select AVG(f.quantity),r.race_name from food f, race r
group by r.race_name;
but it doesn't work as I want it to be since it returns the same average food quantity for all races. I know I have to use the animal table to link the other 2 but I didn't know how. I should be using subqueries
That question is exactly the same as your previous, where you had to use SUM (instead of AVG). No difference at all.
Ah, sorry - it wasn't you, but your school colleague, here
Saying that you "didn't learn joins", well - what do you call what you posted here, then? That's a cross join and will produce Cartesian product, once you fix the error you got by not including non-aggregated column into the group by clause and include additional joins required to return desired result.
The "old" syntax is
select r.name,
avg(f.quantity) avg_quantity
from race r, animal a, food f
where a.race_code = r.race_code
and f.animal_id = a.animal_id
group by r.name;
What you "didn't learn yet" does the same, but looks differently:
from race r join animal a on a.race_code = r.race_code
join food f on f.animal_id = a.animal_id
The rest of the query remains the same.
Nowadays, you should use JOINs to join tables, and put conditions into the WHERE clause. For example, condition would be that you want to calculate averages for donkeys only. As you don't have it, you don't need it.
You still have to do some matching of related rows. If not explicitly with JOIN you can do it in the WHERE clause. Ie something like
select AVG(f.quantity),r.race_name
from food f, race r, animal a
where f.animal_id = a.animal_id and a.race_code = r.race_code
group by r.race_name;
select race_name ,(select avg(quantity) from food where animal_id in (select animal_id from animal a where r.race_code = a.race_code))
from race r

Querying for swapped columns in SQL database

We've had a few cases of people entering in first names where last names should be and vice versa. So I'm trying to come up with a SQL search to match the swapped columns. For example, someone may have entered the record as first_name = Smith, last_name = John by accident. Later, another person may see that John Smith is not in the database and enter a new user as first_name = John, last_name = Smith, when in fact it is the same person.
I used this query to help narrow my search:
SELECT person_id, first_name, last_name
FROM people
WHERE first_name IN (
SELECT last_name FROM people
) AND last_name IN (
SELECT first_name FROM people
);
But if we have people named John Allen, Allen Smith, and Smith John, they would all be returned even though none of those are actually duplicates. In this case, it's actually good enough that I can see the duplicates in my particular data set, but I'm wondering if there's a more precise way to do this.
I would do a self join like this:
SELECT p1.person_id, p1.first_name, p1.last_name
FROM people p1
join people p2 on p1.first_name = p2.last_name and p1.last_name = p2.first_name
To also find typos on names I recommend this:
SELECT p1.person_id, p1.first_name, p1.last_name
FROM people p1
join people p2 on soundex(p1.first_name) = soundex(p2.last_name) and
soundex(p1.last_name) = soundex(p2.first_name)
soundex is a neat function that "hashes" words in a way that two words that sound the same get the same hash. This means Anne and Ann will have the same soundex. So if you had an Anne Smith and a Smith Ann the query above would find them as a match.
Interesting. This is a problem that I cover in Data Analysis Using SQL and Excel (note: I only very rarely mention books in my answers or comments).
The idea is to summarize the data to get a likelihood of a mismatch. So, look at the number of times a name appears as a first name and as a last name and then combine these. So:
with names as (
select first_name as name, 1.0 as isf, 0.0 as isl
from people
union all
select last_name, 0, 1
from people
),
nl as (
select name, sum(isf) as numf, sum(isl) as numl,
avg(isf) as p_f, avg(isl) as p_l
from names
group by name
)
select p.*
from people p join
nl nlf
on p.first_name = nlf.name join
nl nll
on p.last_name = nll.name
order by (coalesce(nlf.p_l, 0) + coalesce(nll.p_f, 0));
This orders the records by a measure of mismatch of the names -- the sum of the probabilities of the first name used by a last name and a last name used as a first name.

Populating column for Oracle Text search from 2 tables

I am investigating the benefits of Oracle Text search, and currently am looking at collecting search text data from multiple (related) tables and storing the data in the smaller table in a 1-to-many relationship.
Consider these 2 simple tables, house and inhabitants, and there are NEVER any uninhabited houses:
HOUSE
ID Address Search_Text
1 44 Some Road
2 31 Letsby Avenue
3 18 Moon Crescent
INHABITANT
ID House Name Nickname
1 1 Jane Doe Janey
2 1 John Doe JD
3 2 Jo Smythe Smithy
4 2 Percy Plum PC
5 3 Apollo Lander Moony
I want to to write SQL that updates the HOUSE.Search_Text column with text from INHABITANT. Now because this is a 1-to-many, the SQL needs to collate the data in INHABITANT for each matching row in house, and then combine the data (comma separated) and update the Search_Text field.
Once done, the Oracle Text search index on HOUSE.Search_Text will return me HOUSEs that match the search criteria, and I can look up INHABITANTs accordingly.
Of course, this is a very simplified example, I want to pick up data from many columns and Full Text Search across fields in both tables.
With the help of a colleague we've got:
select id, ADDRESS||'; '||Names||'; '||Nicknames as Search_Text
from house left join(
SELECT distinct house_id,
LISTAGG(NAME, ', ') WITHIN GROUP (ORDER BY NAME) OVER (PARTITION BY house_id) as Names,
LISTAGG(NICKNAME, ', ') WITHIN GROUP (ORDER BY NICKNAME) OVER (PARTITION BY house_id) as Nicknames
FROM INHABITANT)
i on house.id = i.house_id;
which returns:
1 44 Some Road; Jane Doe, John Doe; JD, Janey
2 31 Letsby Avenue; Jo Smythe, Percy Plum; PC, Smithy
3 18 Moon Crescent; Apollo Lander; Moony
Some questions:
Is this an efficient query to return this data? I'm slightly
concerned about the distinct.
Is this the right way to use Oracle Text search across multiple text fields?
How to update House.Search_Text with the results above? I think I need a correlated subquery, but can't quite work it out.
Would it be more efficient to create a new table containing House_ID and Search_Text only, rather than update House?

Fuzzy grouping in SQL

I need to modify a SQL table to group slightly mismatched names, and assign all elements in the group a standardized name.
For instance, if the initial table looks like this:
Name
--------
Jon Q
John Q
Jonn Q
Mary W
Marie W
Matt H
I would like to create a new table or add a field to the existing one like this:
Name | StdName
--------------------
Jon Q | Jon Q
John Q | Jon Q
Jonn Q | Jon Q
Mary W | Mary W
Marie W | Mary W
Matt H | Matt H
In this case, I've chosen the first name to assign as the "standardized name," but I don't actually care which one is chosen -- ultimately the final "standardized name" will be hashed into a unique person ID. (I'm also open to alternative solutions that go directly to a numerical ID.) I will have birthdates to match on as well, so the accuracy of the name matching doesn't actually need to be all that precise in practice. I've looked into this a bit and will probably use the Jaro-Winkler algorithm (see e.g. here).
If I knew that the names were all in pairs, this would be a relatively easy query, but there can be an arbitrary number of the same name.
I can easily conceptualize how to do this query in a procedural language, but I'm not very familiar with SQL. Unfortunately I don't have direct access to the data -- it's sensitive data and so somebody else (a bureaucrat) has to run the actual query for me. The specific implementation will be SQL Server, but I'd prefer an implementation-agnostic solution.
EDIT:
In response to a comment, I had the following procedural approach in mind. It's in Python, and I replaced the Jaro-Winkler with simply matching on the first letter of the name, for the sake of having a working code example.
nameList = ['Jon Q', 'John Q', 'Jonn Q', 'Mary W', 'Marie W', 'Larry H']
stdList = nameList[:]
# loop over all names
for i1, name1 in enumerate(stdList):
# loop over later names in list to find matches
for i2, name2 in enumerate(stdList[i1+1:]):
# If there's a match, replace latter with former.
if (name1[0] == name2[0]):
stdList[i1+1+i2] = name1
print stdList
The result is ['Jon Q', 'Jon Q', 'Jon Q', 'Mary W', 'Mary W', 'Larry H'].
Just a thought, but you might be able to use the SOUNDEX() function. This will create a value for the names that are similar.
If you started with something like this:
select name, soundex(name) snd,
row_number() over(partition by soundex(name)
order by soundex(name)) rn
from yt;
See SQL Fiddle with Demo. Which would give a result for each row that is similar along with a row_number() so you could return only the first value for each group. For example, the above query will return:
| NAME | SND | RN |
-----------------------
| Jon Q | J500 | 1 |
| John Q | J500 | 2 |
| Jonn Q | J500 | 3 |
| Matt H | M300 | 1 |
| Mary W | M600 | 1 |
| Marie W | M600 | 2 |
Then you could select all of the rows from this result where the row_number() is equal to 1 and then join back to your main table on the soundex(name) value:
select t1.name,
t2.Stdname
from yt t1
inner join
(
select name as stdName, snd, rn
from
(
select name, soundex(name) snd,
row_number() over(partition by soundex(name)
order by soundex(name)) rn
from yt
) d
where rn = 1
) t2
on soundex(t1.name) = t2.snd;
See SQL Fiddle with Demo. This gives a result:
| NAME | STDNAME |
---------------------
| Jon Q | Jon Q |
| John Q | Jon Q |
| Jonn Q | Jon Q |
| Mary W | Mary W |
| Marie W | Mary W |
| Matt H | Matt H |
Assuming you copy and paste the jaro-winkler implementation from SSC (registration required), the following code will work. I tried to build a SQLFiddle for it but it kept going belly up when I was building the schema.
This implementation has a cheat---I'm using a cursor. Generally, cursors are not conducive to performance but in this case, you need to be able to compare the set against itself. There's probably a graceful number/tally table approach to eliminate the declared cursor.
DECLARE #SRC TABLE
(
source_string varchar(50) NOT NULL
, ref_id int identity(1,1) NOT NULL
);
-- Identify matches
DECLARE #WORK TABLE
(
source_ref_id int NOT NULL
, match_ref_id int NOT NULL
);
INSERT INTO
#src
SELECT 'Jon Q'
UNION ALL SELECT 'John Q'
UNION ALL SELECT 'JOHN Q'
UNION ALL SELECT 'Jonn Q'
-- Oops on matching joan to jon
UNION ALL SELECT 'Joan Q'
UNION ALL SELECT 'june'
UNION ALL SELECT 'Mary W'
UNION ALL SELECT 'Marie W'
UNION ALL SELECT 'Matt H';
-- 2 problems to address
-- duplicates in our inbound set
-- duplicates against a reference set
--
-- Better matching will occur if names are split into ordinal entities
-- Splitting on whitespace is always questionable
--
-- Mat, Matt, Matthew
DECLARE CSR CURSOR
READ_ONLY
FOR
SELECT DISTINCT
S1.source_string
, S1.ref_id
FROM
#SRC AS S1
ORDER BY
S1.ref_id;
DECLARE #source_string varchar(50), #ref_id int
OPEN CSR
FETCH NEXT FROM CSR INTO #source_string, #ref_id
WHILE (##fetch_status <> -1)
BEGIN
IF (##fetch_status <> -2)
BEGIN
IF NOT EXISTS
(
SELECT * FROM #WORK W WHERE W.match_ref_id = #ref_id
)
BEGIN
INSERT INTO
#WORK
SELECT
#ref_id
, S.ref_id
FROM
#src S
-- If we have already matched the value, skip it
LEFT OUTER JOIN
#WORK W
ON W.match_ref_id = S.ref_id
WHERE
-- Don't match yourself
S.ref_id <> #ref_id
-- arbitrary threshold, will need to examine this for sanity
AND dbo.fn_calculateJaroWinkler(#source_string, S.source_string) > .95
END
END
FETCH NEXT FROM CSR INTO #source_string, #ref_id
END
CLOSE CSR
DEALLOCATE CSR
-- Show me the list of all the unmatched rows
-- plus the retained
;WITH MATCHES AS
(
SELECT
S1.source_string
, S1.ref_id
, S2.source_string AS match_source_string
, S2.ref_id AS match_ref_id
FROM
#SRC S1
INNER JOIN
#WORK W
ON W.source_ref_id = S1.ref_id
INNER JOIN
#SRC S2
ON S2.ref_id = W.match_ref_id
)
, UNMATCHES AS
(
SELECT
S1.source_string
, S1.ref_id
, NULL AS match_source_string
, NULL AS match_ref_id
FROM
#SRC S1
LEFT OUTER JOIN
#WORK W
ON W.source_ref_id = S1.ref_id
LEFT OUTER JOIN
#WORK S2
ON S2.match_ref_id = S1.ref_id
WHERE
W.source_ref_id IS NULL
and s2.match_ref_id IS NULL
)
SELECT
M.source_string
, M.ref_id
, M.match_source_string
, M.match_ref_id
FROM
MATCHES M
UNION ALL
SELECT
M.source_string
, M.ref_id
, M.match_source_string
, M.match_ref_id
FROM
UNMATCHES M;
-- To specifically solve your request
SELECT
S.source_string AS Name
, COALESCE(S2.source_string, S.source_string) As StdName
FROM
#SRC S
LEFT OUTER JOIN
#WORK W
ON W.match_ref_id = S.ref_id
LEFT OUTER JOIN
#SRC S2
ON S2.ref_id = W.source_ref_id
query output 1
source_string ref_id match_source_string match_ref_id
Jon Q 1 John Q 2
Jon Q 1 JOHN Q 3
Jon Q 1 Jonn Q 4
Jon Q 1 Joan Q 5
june 6 NULL NULL
Mary W 7 NULL NULL
Marie W 8 NULL NULL
Matt H 9 NULL NULL
query output 2
Name StdName
Jon Q Jon Q
John Q Jon Q
JOHN Q Jon Q
Jonn Q Jon Q
Joan Q Jon Q
june june
Mary W Mary W
Marie W Marie W
Matt H Matt H
There be dragons
Over on SuperUser, I talked about my experience matching people. In this section, I'll list some things to be aware of.
Speed
As part of your matching, hooray in that you have a birthday to augment the match process. I would actually propose you generate a match based exclusively on birthdate first. That is an exact match and one that, with a proper index, SQL Server will be able to quickly include/exclude rows. Because you're going to need it. The TSQL implementation is dog slow. I've been running the equivalent match against a dataset of 28k names (names that had been listed as conference attendees). There ought to be some good overlap there and while I did fill #src with data, it is a table variable with all that that implies but it's been running now for 15 minutes and still hasn't completed.
It's slow for a number of reasons but things that jumped out at me are all the looping and string manipulation in the functions. That is not where SQL Server shines. If you have a need to do a lot of this, it might be a good idea to convert them into CLR methods so at least you can leverage the strength of the .NET libraries for some of the manipulations.
One of the matches we used to use was the Double Metaphone and it would generate a pair of possible phonetic interpretations of the name. Instead of computing that every time, compute it once and store it alongside the name. That would help speed some of the matching. Unfortunately, it doesn't look like JW lends itself to breaking it down like that.
Look at iterating too. We'd first try the algs that we knew were fast. 'John' = 'John' so there's no need to pull out the big guns so we'd try a first pass of straight name checks. If we didn't find a match, we'd try harder. The hope was that by taking various swipes at matching we'd get the low hanging fruit as fast as possible and worry about the harder matches later.
Names
In my SU answer and in the code comments, I mention nicknames. Bill and Billy are going to match. Billy, Liam and William are definitely not going to match even though they may be the same person. You might want to look at a list like this to provide translation between nickname and full name. After running a set of matches on the supplied name, maybe we'd try looking for a match based on the possible root name.
Obviously, there are draw backs to this approach. For example, my grandfather-in-law is Max. Just Max. Not Maximilian, Maximus or any other things you might thing.
Your supplied names look like it's first and last concatenated together. Future readers, if you ever have the opportunity to capture individual portions of a name, please do so. There are products out there that will split names and try to match them up against directories to try and guess whether something is first/middle name or a surname but then you have people like "Robar Mike". If you saw that name there, you'd think Robar is a last name and you'd also pronounce it like "robber." Instead, Robar (say it with a French accent) is his first name and Mike is his last name. At any rate, I think you'll have a better matching experience if you can split first and last out into separate fields and match the individual pieces together. An exact last name match plus a partial first name match might suffice, especially in cases where legally they are "Franklin Roosevelt" and you have a candidate of "F. Roosevelt" Perhaps you have a rule that an initial letter can match. Or you don't.
Noise - as referenced in the JW post and my answer, strip out crap (punctuation, stop words, etc) for matching purposes. Also watch out for honorific tites (phd, jd, etc) and generationals (II, III, JR, SR). Our rule was a candidate with/without a generational could match one in the opposite state (Bob Jones Jr == Bob Jones) or could exactly match the generation (Bob Jones Sr = Bob Jones Sr) but you'd never want to match if both records supplied them and they were conflicting (Bob Jones Sr != Bob Jones Jr).
Case sensitivity, always check your database and tempdb to make sure you aren't making case sensitive matches. And if you are, convert everything to upper or lower for purposes of matching but don't ever throw the supplied casing away. Good luck trying to determine whether latessa should be Latessa, LaTessa or something else.
My query is coming up on a hour's worth of processing with no rows returned so I'm going to kill it and turn in. Best of luck, happy matching.