Pandas split string on matching substring from list - pandas

I have not been able to find an answer to the question how to split strings in rows that have substrings matching values in a list (not a part of the dataframe). In other words, I need to split/extract the substrings that match any of the values in a dynamic list from a Series rows. There are many answers on how to mark such rows as True/False or how to split on a match to a static list, but I am stuck at trying to combine both tasks in one. Any help will be greatly appreciated.
Example:
Series - Mr. John Doe, Ms. Jane Smith, Dr. Who, Dr. No, Doctor Doolittle, Mister X, Batman
List 1 - Dr., Doctor
Output - Mr. John Doe, Ms. Jane Smith, Who, No, Doolittle, Mister X, Batman
List 2 - Mr, Mister
Output - John Doe, Ms. Jane Smith, Dr. Who, Dr. No, Doctor Doolittle, X, Batman

s = pd.Series('Mr. John Doe, Ms. Jane Smith, Dr. Who, Dr. No, Doctor Doolittle, Mister X, Batman'.split(', '))
l = ['Dr. ', 'Doctor ']
list(s.str.replace('({})'.format('|'.join(l)), ''))
Out:
['Mr. John Doe',
'Ms. Jane Smith',
'Who',
'No',
'Doolittle',
'Mister X',
'Batman']
l = ['Mr. ', 'Mister ']
list(s.str.replace('({})'.format('|'.join(l)), ''))
Out:
['John Doe',
'Ms. Jane Smith',
'Dr. Who',
'Dr. No',
'Doctor Doolittle',
'X',
'Batman']

Related

Get name(s) from JSON format column, that not in 2 other columns with names

I need to create column with name(s) (Supervisors - can be multiple supervisors at the same time, but also there might not be supervisor at all) from JSON format column, that not in 2 other column with names (Employee and Client).
Id
Employee
Client
AllParticipants
1
Justin Bieber
Ariana Grande
[{"ParticipantName":"Justin Bieber"},{"ParticipantName":"Ariana Grande"}]
2
Lionel Messi
Christiano Ronaldo
[{"ParticipantName":"Christiano Ronaldo"},{"ParticipantName":"Lionel Messi"}]
3
Nicolas Cage
Robert De Niro
[{"ParticipantName":"Robert De Niro"},{"ParticipantName":"Nicolas Cage"},{"ParticipantName":"Brad Pitt"}]
4
Harry Potter
Ron Weasley
[{"ParticipantName":"Ron Weasley"},{"ParticipantName":"Albus Dumbldor"},{"ParticipantName":"Harry Potter"},{"ParticipantName":"Lord Voldemort"}]
5
Tom Holland
Henry Cavill
[{"ParticipantName":"Henry Cavill"},{"ParticipantName":"Tom Holland"}]
6
Spider Man
Venom
[{"ParticipantName":"Venom"},{"ParticipantName":"Iron Man"},{"ParticipantName":"Superman"},{"ParticipantName":"Spider Man"}]
7
Andrew Garfield
Leonardo DiCaprio
[{"ParticipantName":"Tom Cruise"},{"ParticipantName":"Andrew Garfield"},{"ParticipantName":"Leonardo DiCaprio"}]
8
Dwayne Johnson
Jennifer Lawrence
[{"ParticipantName":"Jennifer Lawrence"},{"ParticipantName":"Dwayne Johnson"}]
The output column I need:
Supervisors
NULL
NULL
Brad Pitt
Albus Dumbldor, Lord Voldemort
NULL
Iron Man, Superman
Tom Cruise
NULL
I've tried to create extra columns to use Case expression after that, but it seems too complex.
SELECT *,
JSON_VALUE(w.AllParticipants,'$[0].ParticipantName') AS ParticipantName1,
JSON_VALUE(w.AllParticipants,'$[1].ParticipantName') AS ParticipantName2,
JSON_VALUE(w.AllParticipants,'$[2].ParticipantName') AS ParticipantName3,
JSON_VALUE(w.AllParticipants,'$[3].ParticipantName') AS ParticipantName4
FROM Work AS w
I'm wondering if there is an easy way to compare values and extract only unique ones.

Extract Conditional Middle name and last name

I have a column in a data frame which has a full name as first name, middle name lastname, however for some records no middle name available and want to make sure that it populates the middle name conditionally based on the available pattern but not sure how I can achieve this.
import pandas as pd
name_df = pd.read_csv(r"NameData1.txt",delimiter=",")
splitted_name=name_df.name.str.split(' ',expand=True).fillna('No Value')
##splited_name['middle_name']= splited_name.apply(lambda x : x[1] if x[2] != 'No Value' else '' )
name_df['Middle_name']=name_df.apply(lambda splited_name : splited_name[1] if splited_name[2] != 'No Value' else '')
name_df
I want to display the middle name only when it's there else the last name should be populated.
Sample records:
Id,name
1,TOM M SMITH
2,Gary SMITH
3,John C Doe
4,Hary Knox
5,Rakesh Vaidya
6,John Doe Doe
Use numpy.where for set new column by condition, here are tested None values by Series.isna:
splitted_name=name_df.name.str.split(expand=True)
name_df['First_name'] = splitted_name[0]
name_df['Middle_name']= np.where(splitted_name[2].notna(), splitted_name[1], '')
name_df['Last_name']= np.where(splitted_name[2].notna(), splitted_name[2], splitted_name[1])
print (name_df)
Id name First_name Middle_name Last_name
0 1 TOM M SMITH TOM M SMITH
1 2 Gary SMITH Gary SMITH
2 3 John C Doe John C Doe
3 4 Hary Knox Hary Knox
4 5 Rakesh Vaidya Rakesh Vaidya
5 6 John Doe Doe John Doe Doe
I want to display middle name only wen its there else last name should be populated.
So you can do the below using str.split():
df['middle_or_last']=df.name.apply(lambda x:x.split(' ', maxsplit=len(x.split()))).str[1]
print(df)
Id name middle_or_last
0 1 TOM M SMITH M
1 2 Gary SMITH SMITH
2 3 John C Doe C
3 4 Hary Knox Knox
4 5 Rakesh Vaidya Vaidya
5 6 John Doe Doe Doe

SQL Update conservating actual data

I have a Column named 'Complete name'
I need to update people with any last name 'Smiht' to 'Smith' without losing the name and the second last name.
For example, now I have:
John Smiht G.
Sarah Connor Smiht
John Ford Connor
James Smiht Ford
And the result of update has to be the same data but with Smiht being replaced to Smith:
John Smith G.
Sarah Connor Smith
John Ford Connor
James Smith Ford
Thanks!
The generic method is something like this:
update t
set CompleteName = replace(CompleteName, ' Smiht', ' Smith'
where CompleteName like '% Smiht%';

SQL Sorting table based on two interchangeable fields

I want to sort a table having 3 columns (time, source , recipient) by the order by which communication is being made. If the source and recipient are conversing together then it will list them by the time. The goal is to see the communication happening between similar people ordered by time.An example is as:
time|source|recipient
1 paul amy
2 amy paul
3 amy paul
5 paul jane
8 amy paul
9 jane paul
10 paul amy
11 paul jane
the end result would be like
1 paul amy
2 amy paul
3 amy paul
8 amy paul
10 paul amy
5 paul jane
9 jane paul
11 paul jane
Your question is a bit vague. My educated guess is you want this:
SELECT *
FROM tbl
ORDER BY (GREATEST(source, recipient), LEAST(source, recipient), "time";
The manual about GREATEST and LEAST.

duplicate fields with an inner join

I'm having trouble understanding how to do a multi-table join without generating lots of duplicate fields.
Let's say that I have three tables:
family: id, name
parent: id, family, name
child: id, family, name
If I do a simple select:
select family.id, family.name from family
order by family.id;
I get a simple list:
ID Name
1 Smith
2 Jones
3 Wong
If I add an inner join:
select family.id, family.name, parent.first_name, parent.last_name
from family
inner join parent
on parent.family = family.id
order by family.id;
I get some duplicated fields:
ID Name Parent
1 Smith Howard Smith
1 Smith Janet Smith
2 Jones Phil Jones
2 Jones Harriet Jones
3 Wong Billy Wong
3 Wong Rachel Wong
And if I add another inner join:
select family.id, family.name, parent.first_name, parent.last_name
from family
inner join parent
on parent.family = family.id
inner join child
on child.family = family.id
order by family.id;
I get even more duplicated fields:
ID Name Parent Child
1 Smith Howard Smith Peter Smith
1 Smith Howard Smith Sally Smith
1 Smith Howard Smith Fred Smith
1 Smith Janet Smith Peter Smith
1 Smith Janet Smith Sally Smith
1 Smith Janet Smith Fred Smith
2 Jones Phil Jones Mark Jones
2 Jones Phil Jones Melissa Jones
2 Jones Harriet Jones Mark Jones
2 Jones Harriet Jones Melissa Jones
3 Wong Billy Wong Mary Wong
3 Wong Billy Wong Jennifer Wong
3 Wong Rachel Wong Mary Wong
3 Wong Rachel Wong Jennifer Wong
What I would prefer, because it's more human readable, is something like this:
ID Name Parent Child
1 Smith Howard Smith Peter Smith
Janet Smith Sally Smith
Fred Smith
2 Jones Phil Jones Mark Jones
Harriet Jones Melissa Jones
3 Wong Billy Wong Mary Wong
Rachel Wong Jennifer Wong
I know that one of the benefits of an inner join is to avoid presenting excess information through a Cartesian product. But it seems that I get something similar with a multi-table join. Is there a way to summarize each group as shown above or will this require post-processing with a scripting language like Python?
Thanks,
--Dan
This is precisely the way the relation databases work: each row must contain all information in itself, with every single field that you request. In other words, each row needs to make sense in isolation from all other rows. If you do a single query and you need to get all three levels of information, you need to deal with eliminating duplicates yourself for the desired formatting.
Alternatively, you can run three separate queries, and then do in-memory joins in code. Although this may be desirable in certain rare situations, it is generally a wrong way of spending your development time, because RDBMS are usually much more efficient at joining relational data.
You've hit it on the head. You'll need some post processing to get the results you're looking for.
SQL query results are always simple tabular data, so to get the results you're looking for would definitely not be a pretty query. You could do it, but it would involve quite a bit of query voodoo, storing things in temporary tables or using cursors, or some other funky workaround.
I'd definitely suggest using an external application to retrieve your data and format it appropriately from there.
ORMs like Entity Framework in .NET can probably do this pretty easily, but you could definitely do this with a few nested collections or dictionaries in any language.