Similarity search for name surname - sql

I have a column name which contains name surname (name space surname) and I would like to search it based on
name, surname but I would like to match cases where people accidentally inserted surname name in a different order
misspelled names surnames by 1-2 characters.

You should read about the pg_trgm extension and its function similarity(). A few examples below.
Example data:
create table my_table(id serial primary key, name text);
insert into my_table (name) values
('John Wilcock'),
('Henry Brown'),
('Jerry Newcombe');
create extension if not exists pg_trgm; -- install the extension
Example 1:
select *,
similarity(name, 'john wilcock') as "john wilcock",
similarity(name, 'wilcock john') as "wilcock john"
from my_table;
id | name | john wilcock | wilcock john
----+----------------+--------------+--------------
1 | John Wilcock | 1 | 1
2 | Henry Brown | 0 | 0
3 | Jerry Newcombe | 0.037037 | 0.037037
(3 rows)
Example 2:
select *,
similarity(name, 'henry brwn') as "henry brwn",
similarity(name, 'brovn henry') as "brovn henry"
from my_table;
id | name | henry brwn | brovn henry
----+----------------+------------+-------------
1 | John Wilcock | 0 | 0
2 | Henry Brown | 0.642857 | 0.6
3 | Jerry Newcombe | 0.04 | 0.0384615
(3 rows)
Example 3:
select *
from my_table
where similarity(name, 'J Newcombe') >= 0.6;
id | name
----+----------------
3 | Jerry Newcombe
(1 row)

To counter the exchanged parts of the name you could use split_part() to split the name in its two parts and compare both of them, something similar to the following:
SELECT *
FROM person
WHERE split_part(name, ' ', 1) IN ('<given_name_searched_for>'
'<surname_searched_for>')
OR split_part(name, ' ', 2) IN ('<given_name_searched_for>'
'<surname_searched_for>');
Or have a look at the other string functions and operators. -- there a variants of split functions using regular expressions, e.g..
Are there names like 'John F. Kennedy', that is, with more than one token? Are there names with more than one contiguous spaces? Bear in mind that these have to be addressed with further means if any. (Such things can get hairy. If possible consider revising your design and use a separate column for the surname.)
For the similarity part: PostgreSQL provides some modules, that might be useful here:
fuzzystrmatch
pg_trm

Related

Reorganize multiple rows in a new table with more columns

I have a table that looks like that:
+----------+----------+----------+----------+--------------------------+
| Club | Role | Name | Lastname | Email |
+----------+----------+----------+----------+--------------------------+
| Porto | 1 | Peter | Pan | peter.pan#mail.com |
| Porto | 2 | Michelle | Obama | michelle.obama#mail.com |
| Monaco | 1 | Serena | Williams | serena.williams#mail.com |
| Monaco | 2 | David | Beckham | david.beckham#mail.com |
+----------+----------+----------+----------+--------------------------+
and i want to get a table like that:
+----------+-----------------+-----------------+---------------------------------+-----------------+-----------------+---------------------------------+
| Club | Role 1 Name | Role 1 Lastname | Role 1 Email | Role 2 Name | Role 2 Lastname | Role 2 Email |
+----------+-----------------+-----------------+---------------------------------+-----------------+-----------------+---------------------------------+
| Porto | Peter | Pan | peter.pan#mail.com | Michelle | Obama | michelle.obama#mail.com |
| Monaco | Serena | Williams | serena.williams#mail.com | David | Beckham | david.beckham#mail.com |
+----------+-----------------+-----------------+---------------------------------+-----------------+-----------------+---------------------------------+
where the persons with different roles in each club puts in the same row.
I would ideally like to find a way to do that in Excel, but i am not sure if its possible. If not, SQL code would also help a lot.
Here is what I could come up with for an excel formula. Hopefully it can push you in the right direction.
This formula is assuming that your first table exists at the range A1:E5 and the second table exists at the range G1:M3. It is also assuming that the second table's column names are just repeating without the Role number attached to the front of it (same as the first table). This formula is an array formula, so you have to make sure to do CTRL+SHIFT+ENTER when inputting it.
{=INDEX($A$2:$E$5,
MATCH(1,($G2=$A$2:$A$5)*((FLOOR((COLUMN() - COLUMN($H$1) ) / 3,1) + 1)=$B$2:$B$5),0),
MATCH(H$1,$A$1:$E$1,0))}
The first part is using the INDEX forumla which pulls data from the range suppled ($A$2:$E$5) based on the row and column numbers supplied by the following MATCH formulas.
The first MATCH is supplying the row number for when the result of the lookup array section is equal to 1. I am checking two conditions, the first is to check for the "Club" name ($G2=$A$2:$A$5) and the second is to check for which "Role" we are currently on ((FLOOR((COLUMN() - COLUMN($H$1) ) / 3,1) + 1). This is using the FLOOR function to round the result down to the whole number and dividing by the number of columns (3 in this case: Name, Lastname, and Email).
The final MATCH is pulling the column number based on the header names from both tables. If you wanted to incorporate the changing names of the roles in the column headers, you could change this part to something like this:
{=INDEX($A$2:$E$5,
MATCH(1,($G2=$A$2:$A$5)*((FLOOR((COLUMN() - COLUMN($H$1) ) / 3,1) + 1)=$B$2:$B$5),0),
MOD(COLUMN() - COLUMN($H$1),3) + 3)}
I am adding 3 to the end of the mod because of the original range that was selected for table 1. The columns that we want to pull start at location 3 in the range.
If you want to do this in Oracle Sql, there's a nice approach in analytical sql.
To convert/swap rows to columns or columns to rows, we can use Pivot or Unpivot operators.
In you example use below query to covert data as you like,
select * from
(
with all_roles as
(select 1 role from dual union all
select 2 role from dual),
ddata as
(select 1 c_role, 'porto' club, 'peter' fname,'pan' lname,'peter.pan#mail.com' email from dual union all
select 2 c_role, 'porto' club, 'Michelle' fname, 'Obama' lname,'michelle.obama#mail.com' email from dual union all
select 1 c_role, 'monaco' club, 'Serena' fname, 'Williams' lname,'serena.williams#mail.com' email from dual union all
select 2 c_role, 'monaco' club, 'David' fname, 'Beckham' lname,'david.beckham#mail.com' email from dual )
(select role, club, fname,lname, email from ddata,all_roles
where all_roles.role=ddata.c_role)) all_data
pivot (
max(fname) fname,
max(lname) lname,
max(email) email
for role in ( 1 role1, 2 role2 )
)
order by club;

Specify Which Column Comes First SQL

I am processing a large list of church members in order to send them a letter. We want the letter to say "Dear John & Jane Smith". We will use Word to do the mail merge from an Excel sheet. The important thing is the male name has to always come first.
Each individual has their own row in the table I am using. They have a unique ID as well as a family ID. I am using that family ID to put families together on the same row. Currently I have the male name and the female name separated using MAX(CASE WHEN) in order to specify what goes where. It looks something like this:
+-----------+------------+--------------------------+
| family id | male name | female name | last name |
+-----------+------------+--------------------------+
| 1234 | john | jane | doe |
| 1235 | bob | cindy | smith |
| 1236 | NULL | susan | jones |
| 1237 | jim | NULL | taylor |
+-----------+------------+--------------------------+
But I run into a problem when the family only has one member.
Here's a part of the query I have:
SELECT
fm.family_id AS 'Family ID',
MAX(CASE WHEN PB.gender like 'm' and FM.role_luid=29 THEN PB.nick_name END)
AS 'Male Name',
MAX(CASE WHEN PB.gender like 'f' and FM.role_luid=29 THEN PB.nick_name END)
AS 'Female Name',
PB.last_name AS 'Last Name',
FROM core_family F
I was thinking that I need to combine rows using STUFF or something like that, but I'd need some way of specifying which column comes first so that the male name always comes first. Essentially, as stated above, I need the letter to read "Dear John & Jane Smith" for families with two people and "Dear John Smith" for families with one person. So I am hoping my results might look like:
+-----------+--------------+-----------+
| family id | First name | last name |
+-----------+--------------+-----------+
| 1234 | john & jane | doe |
| 1235 | bob & cindy | smith |
| 1236 | susan | jones |
| 1237 | jim | taylor |
+-----------+--------------+-----------+
You can use your intermediate table (assuming you don't have 3 names for a family id).
From the table you indicated use:
select
id
, coalesce(male_name+' & '+female_name,male_name, female_name)
, last_name
from F;
Here is an example with your data
Basically if you concatenate using + in Sql Server you will get null. So if either male or female name is NULL, you get NULL. Coalesce will move on to the next value if it sees NULL. This way you either get a pair with '&' or a single name for each family.
I've created some test data. This technique works with the test data.
CREATE TABLE #Temp (FamID INT,
MaleName VARCHAR(20),
FemaleName VARCHAR(20),
LName VARCHAR(20))
INSERT #Temp
VALUES (1234, 'John' ,'Jane' , 'Doe' ),
(1235, 'Bob' , 'Cindy' , 'Smith'),
(1236 , NULL , 'Susan' , 'Jones'),
(1237 , 'Jim' , NULL , 'Taylor')
Here is your query.
SELECT FamID,
ISNULL(MaleName+' ','') +
CASE WHEN MaleName IS NULL OR FemaleName IS NULL THEN '' ELSE 'and ' END+
ISNULL(FemaleName,'') AS FirstName,
LName
FROM #Temp
You can use like this
SELECT
fm.family_id AS 'Family ID',
MAX(CASE WHEN PB.gender like 'm' and FM.role_luid=29 THEN PB.nick_name END)
+ '&'+
MAX(CASE WHEN PB.gender like 'f' and FM.role_luid=29 THEN PB.nick_name END)
AS 'First Name',
PB.last_name AS 'Last Name',
FROM core_family F

How to get initials easily out of text field using Postgres

I am using Postgres version 9.4 and I have a full_name field in a table.
In some cases, I want to put initials instead of the full_name of the person in my table.
Something like:
Name | Initials
------------------------
Joe Blow | J. B.
Phil Smith | P. S.
The full_name field is a string value (obviously) and I think the best way to go about this is to split the string into an array foreach space i.e.:
select full_name, string_to_array(full_name,' ') initials
from my_table
This produces the following result-set:
Eric A. Korver;{Eric,A.,Korver}
Ignacio Bueno;{Ignacio,Bueno}
Igmar Mendoza;{Igmar,Mendoza}
Now, the only thing I am missing is how to loop through each array element and pull the 1st character out of it. I will end up using substring() to get the initial character of each element - however I am just stuck on how to loop through them on-the-fly..
Anybody have a simple way to go about this?
Use unnest with string_agg:
select full_name, string_agg(substr(initials, 1,1)||'.', ' ') initials
from (
select full_name, unnest(string_to_array(full_name,' ')) initials
from my_table
) sub
group by 1;
full_name | initials
------------------------+-------------
Phil Smith | P. S.
Joe Blow | J. B.
Jose Maria Allan Pride | J. M. A. P.
Eric A. Korver | E. A. K.
(4 rows)
In Postgres 14+ you can replace unnest(string_to_array(...)) with string_to_table(...).
Test it in db<>fiddle.
You can also create a helper function for this, in case you want to use similar logic in multiple queries. Check this out
--
-- Function to extract a person's initials from the full name.
--
DROP FUNCTION IF EXISTS get_name_initials(TEXT);
CREATE OR REPLACE FUNCTION get_name_initials(full_name TEXT)
RETURNS TEXT AS $$
DECLARE
result TEXT :='';
part VARCHAR :='';
BEGIN
FOREACH part IN ARRAY string_to_array($1, ' ') LOOP
result := result || substr(part, 1, 1) || '.';
END LOOP;
RETURN result;
END;
$$ LANGUAGE plpgsql;
Now you can simply use this function to get the initials like this.
SELECT full_name, get_name_initials(full_name) as initials
FROM my_table;
SELECT get_name_initials('Phil Smith'); -- Returns P. H.
SELECT get_name_initials('Joe Blow'); -- Returns J. B.
SqlFiddleDemo
WITH add_id AS (
SELECT n.*, row_number() OVER (ORDER BY "Name") AS id
FROM names n
),
split_names AS (
SELECT id, regexp_split_to_table("Name", E'\\s+') AS single_name
FROM add_id
),
initials AS (
SELECT id, left(single_name, 1) || '.' AS initial
FROM split_names
),
final AS (
SELECT id, string_agg(initial, ' ')
FROM initials
GROUP BY id
)
SELECT a.*, f.*
FROM add_id a
JOIN final f USING (id)
For debug I create the Initial to Show how match the string_agg
| Name | Initials | id | id | string_agg |
|----------------|----------|----|----|------------|
| Eric A. Korver | E. A. K. | 1 | 1 | E. A. K. |
| Igmar Mendoza | I. M. | 2 | 2 | I. M. |
| Ignacio Bueno | I. B. | 3 | 3 | I. B. |
| Joe Blow | J. B. | 4 | 4 | J. B. |
| Phil Smith | P. S. | 5 | 5 | P. S. |
After some work I got a compact version SqlFiddleDemo
SELECT "Name", string_agg(left(single_name, 1) || '.', '') AS Initials
FROM (
SELECT
"Name",
regexp_split_to_table("Name", E'\\s+') AS single_name
FROM names
) split_names
GROUP BY "Name"
OUTPUT
| Name | initials |
|----------------|----------|
| Eric A. Korver | E.K.A. |
| Igmar Mendoza | M.I. |
| Ignacio Bueno | I.B. |
| Joe Blow | B.J. |
| Phil Smith | P.S. |

Splitting a string column in BigQuery

Let's say I have a table in BigQuery containing 2 columns. The first column represents a name, and the second is a delimited list of values, of arbitrary length. Example:
Name | Scores
-----+-------
Bob |10;20;20
Sue |14;12;19;90
Joe |30;15
I want to transform into columns where the first is the name, and the second is a single score value, like so:
Name,Score
Bob,10
Bob,20
Bob,20
Sue,14
Sue,12
Sue,19
Sue,90
Joe,30
Joe,15
Can this be done in BigQuery alone?
Good news everyone! BigQuery can now SPLIT()!
Look at "find all two word phrases that appear in more than one row in a dataset".
There is no current way to split() a value in BigQuery to generate multiple rows from a string, but you could use a regular expression to look for the commas and find the first value. Then run a similar query to find the 2nd value, and so on. They can all be merged into only one query, using the pattern presented in the above example (UNION through commas).
Trying to rewrite Elad Ben Akoune's answer in Standart SQL, the query becomes like this;
WITH name_score AS (
SELECT Name, split(Scores,';') AS Score
FROM (
(SELECT * FROM (SELECT 'Bob' AS Name ,'10;20;20' AS Scores))
UNION ALL
(SELECT * FROM (SELECT 'Sue' AS Name ,'14;12;19;90' AS Scores))
UNION ALL
(SELECT * FROM (SELECT 'Joe' AS Name ,'30;15' AS Scores))
))
SELECT name, score
FROM name_score
CROSS JOIN UNNEST(name_score.score) AS score;
And this outputs;
+------+-------+
| name | score |
+------+-------+
| Bob | 10 |
| Bob | 20 |
| Bob | 20 |
| Sue | 14 |
| Sue | 12 |
| Sue | 19 |
| Sue | 90 |
| Joe | 30 |
| Joe | 15 |
+------+-------+
If someone is still looking for an answer
select Name,split(Scores,';') as Score
from (
# replace the inner custome select with your source table
select *
from
(select 'Bob' as Name ,'10;20;20' as Scores),
(select 'Sue' as Name ,'14;12;19;90' as Scores),
(select 'Joe' as Name ,'30;15' as Scores)
);

SQLite, selecting values having same criteria (throughout all table)

I have an sqlite database table similar to the one given below
Name | Surname | AddrType | Age | Phone
John | Kruger | Home | 23 | 12345
Sarah | Kats | Home | 33 | 12345
Bill | Kruger | Work | 15 | 12345
Lars | Kats | Home | 54 | 12345
Javier | Roux | Work | 45 | 12345
Ryne | Hutt | Home | 36 | 12345
I would like to select Name values matching same "Surname" value for each of the rows in the table.
For example, for the first line the query would be "select Name from myTable where Surname='Kruger'" whereas for the second line the query would be "select Name from myTable where Surname='Kats' and so an....
Is it possible to traverse through the whole table and select all values like that?
PS : I will use these method in a C++ application, the alternative method is to use sqlite3_exec() and process each row one by one. I just want to know if there is any other possible way for the same approach.
I'd do:
sqlite> SELECT group_concat(Name, '|') Names FROM People GROUP BY Surname;
Names
----------
Ryne
Sarah|Lars
John|Bill
Javier
Then split each value of "Names" in C++ using the "|" separator (or any other you choose in group_concat function.
Basically you just want to exclude any records that don't have a buddy.
Something simple like joining the table against itself should work:
SELECT a.Name
FROM tab AS a
JOIN tab AS b
ON a.Surname = b.Surname;
Just returning the full sorted table and doing the duplicate check yourself may be faster if incidence is high (and will always be high for all sets of data). That would be a pretty strong assumption though.
SELECT Name
FROM tab
SORT BY Surname;