Extract Conditional Middle name and last name - pandas

I have a column in a data frame which has a full name as first name, middle name lastname, however for some records no middle name available and want to make sure that it populates the middle name conditionally based on the available pattern but not sure how I can achieve this.
import pandas as pd
name_df = pd.read_csv(r"NameData1.txt",delimiter=",")
splitted_name=name_df.name.str.split(' ',expand=True).fillna('No Value')
##splited_name['middle_name']= splited_name.apply(lambda x : x[1] if x[2] != 'No Value' else '' )
name_df['Middle_name']=name_df.apply(lambda splited_name : splited_name[1] if splited_name[2] != 'No Value' else '')
name_df
I want to display the middle name only when it's there else the last name should be populated.
Sample records:
Id,name
1,TOM M SMITH
2,Gary SMITH
3,John C Doe
4,Hary Knox
5,Rakesh Vaidya
6,John Doe Doe

Use numpy.where for set new column by condition, here are tested None values by Series.isna:
splitted_name=name_df.name.str.split(expand=True)
name_df['First_name'] = splitted_name[0]
name_df['Middle_name']= np.where(splitted_name[2].notna(), splitted_name[1], '')
name_df['Last_name']= np.where(splitted_name[2].notna(), splitted_name[2], splitted_name[1])
print (name_df)
Id name First_name Middle_name Last_name
0 1 TOM M SMITH TOM M SMITH
1 2 Gary SMITH Gary SMITH
2 3 John C Doe John C Doe
3 4 Hary Knox Hary Knox
4 5 Rakesh Vaidya Rakesh Vaidya
5 6 John Doe Doe John Doe Doe

I want to display middle name only wen its there else last name should be populated.
So you can do the below using str.split():
df['middle_or_last']=df.name.apply(lambda x:x.split(' ', maxsplit=len(x.split()))).str[1]
print(df)
Id name middle_or_last
0 1 TOM M SMITH M
1 2 Gary SMITH SMITH
2 3 John C Doe C
3 4 Hary Knox Knox
4 5 Rakesh Vaidya Vaidya
5 6 John Doe Doe Doe

Related

Loop and calculate month difference based on criteria

I've got a dataset like the below and not really sure where to start. I'm using Aginity Workbench for Netezza and I'm wanting to see when there is an interaction see if there is a conversion within 3 months. Needs to scale across multiple customers.
Date Customer Interaction Conversion
1/01/2017 John Smith 1 0
1/02/2017 John Smith 0
1/03/2017 John Smith 0
1/04/2017 John Smith 0
1/05/2017 John Smith 0
1/06/2017 John Smith 1 0
1/07/2017 John Smith 1 0
1/08/2017 John Smith 1
1/09/2017 John Smith 0
1/10/2017 John Smith 0
1/11/2017 John Smith 0
1/12/2017 John Smith 0
Ideally the output should look like the below where the conversion is attributed once based on a three month window of interactions. So if there any interactions in subsequent months, then attribute the conversion to the first month of the 3 month window. Also needs to flag what an interaction and conversion happen in the same month.
Date Customer Interaction Conversion 3MonthConversion
1/01/2017 John Smith 1 0 0
1/02/2017 John Smith 0
1/03/2017 John Smith 0
1/04/2017 John Smith 0
1/05/2017 John Smith 0
1/06/2017 John Smith 1 0 1
1/07/2017 John Smith 1 0
1/08/2017 John Smith 1
1/09/2017 John Smith 0
1/10/2017 John Smith 0
1/11/2017 John Smith 0
1/12/2017 John Smith 0
The below query should work. Please let me know if you face any issue
select date,customer, interaction,conversion,
case when interaction=1 and (lead(conversion,1) over (order by date))=1 then 1
when interaction=1 and (lead(conversion,2) over (order by date))=1 then 1
when interaction=1 and (lead(conversion,3) over (order by date))=1 then 1
else 0 end as threeMonthconversion
from test_month

How to populate field based on groupings?

I am trying to create a PowerShell (5.1) script (Open to SQL suggestions as well using SSMS with SQL Server 2014) to eliminate manual effort of updating a very large data file exported in CSV format.
Here is what the raw data set looks like that needs updated:
Parent ID | Parent Owner | Sub ID | Sub Owner | Sub Hours
A aA Rob Green 0
A aB Rob Green 6
B aA Jane Doe 4
B aB Jane Doe 10
B aC Bob Smith 18
C cA Jane Doe 0
C cB Jane Doe 6
D dA Bob Smith 0
D dB Bob Smith 6
E dE Joe Brown 0
As you can see, Parent IDs can have one or many Sub Owners and Sub IDs.
My goal is to populate the Parent Owner field based on the following criteria:
For every Parent ID set
If there is only one distinct Sub Owner, then that Sub Owner should be the Parent Owner for all corresponding Parent IDs.
If there is only one occurrence of a Parent ID then that Sub Owner should be the Parent Owner for that Parent ID.
If there are multiple Sub Owners for the Parent ID, the Sub Owner with the highest summed Sub Hours should be the Parent Owner for every occurrence of that Parent ID.
To clarify, the criteria applies to the raw data above like so:
Parent ID "A" applies to criteria 1
Parent ID "B" applies to criteria 3
Parent ID "C" applies to criteria 1
Parent ID "D" applies to criteria 1
Parent ID "E" applies to criteria 2
This is what I expect the data above to look like after completed:
Parent ID | Parent Owner | Sub ID | Sub Owner | Sub Hours
A Rob Green aA Rob Green 0
A Rob Green aB Rob Green 6
B Bob Smith aA Jane Doe 4
B Bob Smith aB Jane Doe 10
B Bob Smith aC Bob Smith 18
C Jane Doe cA Jane Doe 0
C Jane Doe cB Jane Doe 6
D Bob Smith dA Bob Smith 0
D Bob Smith dB Bob Smith 6
E Joe Brown dE Joe Brown 0
My biggest struggle is criteria 3. I cannot wrap my head around how to do this. Can anyone give me an idea of how I can get my expected output using PS or SQL?
Any help would be GREATLY appreciated!
I vowed myself that SQL is strictly taboo for me. However, below's an example of pure PowerShell solution (and I'm pretty sure that it's convertible to SQL simply):
# mimic SQL output
$SqlOutput = #"
Parent ID|Parent Owner|Sub ID|Sub Owner|Sub Hours
A||aA|Rob Green|0
A||aB|Rob Green|6
B||aA|Jane Doe|4
B||aB|Jane Doe|10
B||aC|Bob Smith|18
C||cA|Jane Doe|0
C||cB|Jane Doe|6
D||dA|Bob Smith|0
D||dB|Bob Smith|6
E||dE|Joe Brown|0
"# | ConvertFrom-Csv -Delimiter '|'
# compute an auxiliary variable
$SqlOutputGroups = $SqlOutput |
Group-Object -Property 'Parent ID', 'Sub Owner' |
ForEach-Object {
New-Object psobject -Property #{
'Parent ID' = ( $_.Name -split ', ')[0]
'Sub Owner' = ( $_.Name -split ', ')[1]
Hours = ( $_.Group |
Measure-Object -Property 'Sub Hours' -Sum).Sum
}
}
# compute Criterium3 as a hashtable
$SqlOutputCriterium3 = #{}
$SqlOutputGroups | Group-Object -Property 'Parent ID' |
ForEach-Object {
$SqlOutputCriterium3[$_.Name] = ($_.Group |
Sort-Object -Property Hours |
Select-Object -Last 1).'Sub Owner'
}
# apply Criterium3
$SqlOutput | ForEach-Object {
$_.'Parent Owner' = $SqlOutputCriterium3.$($_.'Parent ID')
}
# show result in a table format
$SqlOutput | Format-Table -AutoSize
Output: D:\PShell\SO\45963820.ps1
Parent ID Parent Owner Sub ID Sub Owner Sub Hours
--------- ------------ ------ --------- ---------
A Rob Green aA Rob Green 0
A Rob Green aB Rob Green 6
B Bob Smith aA Jane Doe 4
B Bob Smith aB Jane Doe 10
B Bob Smith aC Bob Smith 18
C Jane Doe cA Jane Doe 0
C Jane Doe cB Jane Doe 6
D Bob Smith dA Bob Smith 0
D Bob Smith dB Bob Smith 6
E Joe Brown dE Joe Brown 0
Note that Criterium 3 covers criteria 1 and 2 but does not suffice if more Sub Owners have the same highest sum of Sub Hours for a particular Parent ID (e.g in case of B||aA|Jane Doe|8 instead of B||aA|Jane Doe|4 in the above example, then Jane Doe has sum of Sub Hours =18 as well as Bob Smith in Parent ID=B).

SQL Sorting table based on two interchangeable fields

I want to sort a table having 3 columns (time, source , recipient) by the order by which communication is being made. If the source and recipient are conversing together then it will list them by the time. The goal is to see the communication happening between similar people ordered by time.An example is as:
time|source|recipient
1 paul amy
2 amy paul
3 amy paul
5 paul jane
8 amy paul
9 jane paul
10 paul amy
11 paul jane
the end result would be like
1 paul amy
2 amy paul
3 amy paul
8 amy paul
10 paul amy
5 paul jane
9 jane paul
11 paul jane
Your question is a bit vague. My educated guess is you want this:
SELECT *
FROM tbl
ORDER BY (GREATEST(source, recipient), LEAST(source, recipient), "time";
The manual about GREATEST and LEAST.

Why does my view query split into two?

I am trying to create a view that records the selected attributes for all Computer Science majors.
This is my query to create a view:
DROP VIEW CS_grade_report;
CREATE VIEW CS_grade_report AS
SELECT Student.student_id AS "ID",
student_name AS "Name",
course_number AS "Course #",
credit AS "Credit",
grade AS Grade
FROM Student, Class, Enrolls
WHERE major = 'CSCI'
AND Student.student_id = Enrolls.student_id
AND Class.schedule_num = Enrolls.schedule_num;
SELECT *
FROM CS_grade_report;
And this is what is generated:
ID Name Course # Credit GR
------ ------------------------- -------- ---------- --
600000 John Smith CSCI3200 4 B+
600000 John Smith CSCI3700 3 C
600000 John Smith SPAN1004 3 A-
600000 John Smith CSCI4300 3 A+
600001 Andrew Tram MUSC2406 2 A+
600001 Andrew Tram SPAN1004 3 A
600001 Andrew Tram CSCI3700 3 B-
600002 Jane Doe CSCI4200 3 D+
600003 Michael Jordan CSCI4300 3 A+
600004 Tiger Woods MUSC1000 1 A
600007 Dominique Davis CSCI4300 3 F
ID Name Course # Credit GR
------ ------------------------- -------- ---------- --
600009 Will Smith CSCI3200 4 A
600010 Papa Johns CSCI3200 4 B
600011 John Doe CSCI3200 4 C
600012 Jackie Chan CSCI3200 4 D
600013 Some Guy CSCI3200 4 E
16 rows selected.
I am assuming this is output from sqlplus. There is a "pagesize" option to define when breaks are added. If you only want to see one heading, set the size to a large enough value prior to running your SELECT statement as such:
set pagesize 500
(or whatever size you want)
There are many command options for sqlplus. This link is a good cheat-sheet.

SQL query to return all columns from table, but with a max of 3 duplicate id's

Can someone please lend a hand with this query? I've been fooling with LIMIT or TOP, but I think I'm off track. I want to return all fields from a table, but with a max of 3 duplicate id's in the new table.
Table
id first last
===================
1 John Doe
1 John Doe
1 John Doe
1 John Doe
2 Mary Green
2 Mary Green
3 Stacy Kirk
3 Stacy Kirk
3 Stacy Kirk
3 Stacy Kirk
3 Stacy Kirk
Desired Results (up to 3 ids)
id first last
====================
1 John Doe
1 John Doe
1 John Doe
2 Mary Green
2 Mary Green
3 Stacy Kirk
3 Stacy Kirk
3 Stacy Kirk
Thanks!
since you mentioned TOP, this is for SQL SERVER
SELECT id, first, last
FROM
(
SELECT id, first, last,
ROW_NUMBER() OVER (PARTITION BY ID ORDER BY LAST) rn
FROM TABLE1
) s
WHERE s.rn <= 3
SQLFiddle Demo (SQL Server)