Pandas groupby multiple keys selecting unique values and transforming - pandas

I have a data frame df=
Owner Manager Date Hours City
John Jerry 1/2/16 10 LA
John Jerry 1/2/16 10 SF
Mary Jerry 1/2/16 9 LA
Zach Joe 1/3/16 5 SD
Wendy Joe 1/3/16 4 SF
Hal Joe 1/4/16 2 SD
... 100,000 entries
I would like to group by 'Manager' and 'Date', then select unique values of 'Owner' and sum 'Hours' of that selection, finally transforming the sum to a new column 'Hours_by_Manager'.
My desired output is:
Owner Manager Date Hours City Hours_by_Manager
John Jerry 1/2/16 10 LA 19
John Jerry 1/2/16 10 SF 19
Mary Jerry 1/2/16 9 LA 19
Zach Joe 1/3/16 5 SD 9
Wendy Joe 1/3/16 4 SF 9
Hal Joe 1/4/16 2 SD 2
I tried using pandas 'groupby' like this:
df['Hours_by_Manager']=df.groupby(['Manager','Date'])['Hours'].transform(lambda x: sum(x.unique()))
Which gives me what I want, but only because the value of hours is different between 'Owner'. What I'm looking for is something like this:
df['Hours_by_Manager']=df.groupby(['Manager','Date'])['Owner'].unique()['Hours']transform(lambda x: sum(x))
Which obviously is not syntactically correct. I know I could use for loops, but I would like to keep things vectorized. Any suggestions?

import pandas as pd
df = pd.DataFrame({'City': ['LA', 'SF', 'LA', 'SD', 'SF', 'SD'],
'Date': ['1/2/16', '1/2/16', '1/2/16', '1/3/16', '1/3/16', '1/4/16'],
'Hours': [10, 10, 9, 5, 4, 2],
'Manager': ['Jerry', 'Jerry', 'Jerry', 'Joe', 'Joe', 'Joe'],
'Owner': ['John', 'John', 'Mary', 'Zach', 'Wendy', 'Hal']})
uniques = df.drop_duplicates(subset=['Hours','Owner','Date'])
hours = uniques.groupby(['Manager', 'Date'])['Hours'].sum().reset_index()
hours = hours.rename(columns={'Hours':'Hours_by_Manager'})
result = pd.merge(df, hours, how='left')
print(result)
yields
City Date Hours Manager Owner Hours_by_Manager
0 LA 1/2/16 10 Jerry John 19
1 SF 1/2/16 10 Jerry John 19
2 LA 1/2/16 9 Jerry Mary 19
3 SD 1/3/16 5 Joe Zach 9
4 SF 1/3/16 4 Joe Wendy 9
5 SD 1/4/16 2 Joe Hal 2
Explanation:
An Owner on a given Date works a unique number of Hours. So let's first create a table of unique ['Hours','Owner','Date'] rows:
uniques = df.drop_duplicates(subset=['Hours','Owner','Date'])
# alternatively, uniques = df.groupby(['Hours','Owner','Date']).first().reset_index()
# City Date Hours Manager Owner
# 0 LA 1/2/16 10 Jerry John
# 2 LA 1/2/16 9 Jerry Mary
# 3 SD 1/3/16 5 Joe Zach
# 4 SF 1/3/16 4 Joe Wendy
# 5 SD 1/4/16 2 Joe Hal
Now we can group by ['Manager', 'Date'] and sum the Hours:
hours = uniques.groupby(['Manager', 'Date'])['Hours'].sum().reset_index()
Manager Date Hours
0 Jerry 1/2/16 19
1 Joe 1/3/16 9
2 Joe 1/4/16 2
The hours['Hours'] column contains the values we want in df['Hours_by_Manager'].
hours = hours.rename(columns={'Hours':'Hours_by_Manager'})
So now we can merge df and hours to obtain the desired result:
result = pd.merge(df, hours, how='left')
# City Date Hours Manager Owner Hours_by_Manager
# 0 LA 1/2/16 10 Jerry John 19
# 1 SF 1/2/16 10 Jerry John 19
# 2 LA 1/2/16 9 Jerry Mary 19
# 3 SD 1/3/16 5 Joe Zach 9
# 4 SF 1/3/16 4 Joe Wendy 9
# 5 SD 1/4/16 2 Joe Hal 2

Related

aggregate data between two dates with two dataframes

Given I have the following DF,
Assume this table has all the sales rep and all the Q end dates in the last 20 years.
Q End date
Rep
Var1
03/31/2010
Bob
11
03/31/2010
Alice
12
03/31/2010
Jack
13
06/30/2010
Bob
14
06/30/2010
Alice
15
06/30/2010
Jack
16
I also have a table of transactions events
Sell Date
Rep
04/01/2009
Bob
03/01/2010
Bob
02/01/2010
Jack
02/01/2010
Jack
I am trying to modify the first DF so to have a column that aggregates the number of transactions that happened 12 month prior to the q end date per Qend per Rep
The result should look like this
Q End end
Rep
Var1
Trailing 12M transactions
03/31/2010
Bob
11
2
03/31/2010
Alice
12
0
03/31/2010
Jack
13
2
06/30/2010
Bob
14
1
06/30/2010
Alice
15
0
06/30/2010
Jack
16
2
My table has 2000-3000 sales rep per Q for ~20 years and number of transactions per trailing 12m can range between 0-7k ish.
Any help here would be appreciated. Thanks!
Try:
df1["Q End date"] = pd.to_datetime(df1["Q End date"])
df2["Sell Date"] = pd.to_datetime(df2["Sell Date"])
df2 = df2.sort_values(by="Sell Date").set_index("Sell Date")
df1["Trailing 12M transactions"] = df1.apply(
lambda x: df2.loc[
x["Q End date"] - pd.DateOffset(years=1) : x["Q End date"]
]
.eq(x["Rep"])
.sum(),
axis=1,
)
print(df1)
Prints:
Q End date Rep Var1 Trailing 12M transactions
0 2010-03-31 Bob 11 2
1 2010-03-31 Alice 12 0
2 2010-03-31 Jack 13 2
3 2010-06-30 Bob 14 1
4 2010-06-30 Alice 15 0
5 2010-06-30 Jack 16 2

Conditionally concatenate rows of a dataframe and process additional columns based on the condition

I have an Input Dataframe that the following :
NAME TEXT START END
Tim Tim Wagner is a teacher. 10 20.5
Tim He is from Cleveland, Ohio. 20.5 40
Frank Frank is a musician. 40 50
Tim He like to travel with his family 50 62
Frank He is a performing artist who plays the cello. 62 70
Frank He performed at the Carnegie Hall last year. 70 85
Frank It was fantastic listening to him. 85 90
Want output dataframe as follows:
NAME TEXT START END
Tim Tim Wagner is a teacher. He is from Cleveland, Ohio. 10 40
Frank Frank is a musician 40 50
Tim He like to travel with his family 50 62
Frank He is a performing artist who plays the cello. He performed at the Carnegie Hall last year. It was fantastic listening to him. 62 90
Appreciate your help on this.
Thanks
Try:
grp = (df['NAME'] != df['NAME'].shift()).cumsum().rename('group')
df.groupby(['NAME', grp], sort=False)['TEXT','START','END']\
.agg({'TEXT':lambda x: ' '.join(x), 'START': 'min', 'END':'max'})\
.reset_index().drop('group', axis=1)
Output:
NAME TEXT START END
0 Tim Tim Wagner is a teacher. He is from Cleveland,... 10.0 40.0
1 Frank Frank is a musician. 40.0 50.0
2 Tim He like to travel with his family 50.0 62.0
3 Frank He is a performing artist who plays the cello.... 62.0 90.0

If last names are similar in [Name] column, fill in missing values of another column

Below is a sample of a much larger dataframe.
Fare Cabin Pclass Ticket Name
257 86.5000 B77 1 110152 Cherry, Miss. Gladys
759 86.5000 B77 1 110152 Rothes, the Countess. of (Lucy Noel Martha Dye...
504 86.5000 B79 1 110152 Maioni, Miss. Roberta
262 79.6500 E67 1 110413 Taussig, Mr. Emil
558 79.6500 E67 1 110413 Taussig, Mrs. Emil (Tillie Mandelbaum)
585 79.6500 NaN 1 110413 Taussig, Miss. Ruth
475 52.0000 A14 1 110465 Clifford, Mr. George Quincy
110 52.0000 C110 1 110465 Porter, Mr. Walter Chamberlain
335 26.0000 C106 1 110469 Maguire, Mr. John Edward
158 26.5500 D22 1 110489 Borebank, Mr. John James
430 26.5500 C52 1 110564 Bjornstrom-Steffansson, Mr. Mauritz Hakan
236 75.2500 D37 1 110813 Warren, Mr. Frank Manley
366 75.2500 D37 1 110813 Warren, Mrs. Frank Manley (Anna Sophia Atkinson)
191 26.0000 NaN 1 111163 Salomon, Mr. Abraham L
170 33.5000 B19 1 111240 Van der hoef, Mr. Wyckoff
462 38.5000 E63 1 111320 Gee, Mr. Arthur H
329 57.9792 Nan 1 111361 Hippach, Miss. Jean Gertrude
523 57.9792 B18 1 111361 Hippach, Mrs. Louis Albert (Ida Sophia Fischer)
If I want to iterate the filling of missing values of "Cabin" for people who are missing "Cabin" values, with someone else's "Cabin" values, only if
the someone else (the one who has a cabin value) has the same last name and also are in the vicinity of oneself( as in one above or one below them) .
So in the dataframe above, [Tassuig, Miss.Ruth]'s Cabin value of "Nan" would be replaced with that of [Tassuig, Mrs.Emil]'s cabin value [E67] who is one above herself because both conditions are met. (Same last name and in the vicinity)
And [Hippach, Miss. Jean Gertrude]'s missing cabin value would be replaced with
[ Hippach, Mrs. Louis Albert (Ida Sophia Fischer)]'s Cabin value of [B18].
I tried to think of iteration but this is as far as I got
for x in df.Name.str.split(',')[x][0] ==df.Name.str.split(',')[x+1][0]:
if df.Cabin[x] or df.Cabin[x+1] == np.nan:
df.Cabin.replace(np.nan,
I want to make sure the np.nan value is replaced with a True value and not np.nan. Couldn't figure out how to do that.
Thanks.
Starting with your DataFrame
print(df)
Fare Cabin Pclass Ticket \
0 86.5000 B77 1 110152
1 86.5000 B77 1 110152
2 86.5000 B79 1 110152
3 79.6500 E67 1 110413
4 79.6500 E67 1 110413
5 79.6500 NaN 1 110413
6 52.0000 A14 1 110465
7 52.0000 C110 1 110465
8 26.0000 C106 1 110469
9 26.5500 D22 1 110489
10 26.5500 C52 1 110564
11 75.2500 D37 1 110813
12 75.2500 D37 1 110813
13 26.0000 NaN 1 111163
14 33.5000 B19 1 111240
15 38.5000 E63 1 111320
16 57.9792 NaN 1 111361
17 57.9792 B18 1 111361
Name
0 Cherry, Miss. Gladys
1 Rothes, the Countess. of (Lucy Noel Martha Dye...
2 Maioni, Miss. Roberta
3 Taussig, Mr. Emil
4 Taussig, Mrs. Emil (Tillie Mandelbaum)
5 Taussig, Miss. Ruth
6 Clifford, Mr. George Quincy
7 Porter, Mr. Walter Chamberlain
8 Maguire, Mr. John Edward
9 Borebank, Mr. John James
10 Bjornstrom-Steffansson, Mr. Mauritz Hakan
11 Warren, Mr. Frank Manley
12 Warren, Mrs. Frank Manley (Anna Sophia Atkinson)
13 Salomon, Mr. Abraham L
14 Van der hoef, Mr. Wyckoff
15 Gee, Mr. Arthur H
16 Hippach, Miss. Jean Gertrude
17 Hippach, Mrs. Louis Albert (Ida Sophia Fischer)
Creating a new column/series with just the LastName. Note, might be a better way to do this with pandas str methods, but I couldn't get anything to work
df['LastName'] = df['Name'].map(lambda x : x[:x.find(',')])
Then we leverage Pandas' shift and boolean indexing to see if the passenger above has the same last name (ie the Taussig case)
filter = (df['Cabin'].isnull()) & (df['LastName'] == df['LastName'].shift())
df.loc[filter,'Cabin'] = df['Cabin'].shift()
and then the passenger below by passing a -1 to shift() (ie the Hippach case)
filter = (df['Cabin'].isnull()) & (df['LastName'] == df['LastName'].shift(-1))
df.loc[filter,'Cabin'] = df['Cabin'].shift(-1)
print(df)
Fare Cabin Pclass Ticket \
0 86.5000 B77 1 110152
1 86.5000 B77 1 110152
2 86.5000 B79 1 110152
3 79.6500 E67 1 110413
4 79.6500 E67 1 110413
5 79.6500 E67 1 110413
6 52.0000 A14 1 110465
7 52.0000 C110 1 110465
8 26.0000 C106 1 110469
9 26.5500 D22 1 110489
10 26.5500 C52 1 110564
11 75.2500 D37 1 110813
12 75.2500 D37 1 110813
13 26.0000 NaN 1 111163
14 33.5000 B19 1 111240
15 38.5000 E63 1 111320
16 57.9792 B18 1 111361
17 57.9792 B18 1 111361
Name LastName
0 Cherry, Miss. Gladys Cherry
1 Rothes, the Countess. of (Lucy Noel Martha Dye... Rothes
2 Maioni, Miss. Roberta Maioni
3 Taussig, Mr. Emil Taussig
4 Taussig, Mrs. Emil (Tillie Mandelbaum) Taussig
5 Taussig, Miss. Ruth Taussig
6 Clifford, Mr. George Quincy Clifford
7 Porter, Mr. Walter Chamberlain Porter
8 Maguire, Mr. John Edward Maguire
9 Borebank, Mr. John James Borebank
10 Bjornstrom-Steffansson, Mr. Mauritz Hakan Bjornstrom-Steffansson
11 Warren, Mr. Frank Manley Warren
12 Warren, Mrs. Frank Manley (Anna Sophia Atkinson) Warren
13 Salomon, Mr. Abraham L Salomon
14 Van der hoef, Mr. Wyckoff Van der hoef
15 Gee, Mr. Arthur H Gee
16 Hippach, Miss. Jean Gertrude Hippach
17 Hippach, Mrs. Louis Albert (Ida Sophia Fischer) Hippach
groupby + fillna
# back fills, then forward fills
def bffill(x):
return x.bfill().ffill()
# group by last name
df['Cabin'] = df.groupby(df.Name.str.split(',').str[0]).Cabin.apply(bffill)
df

How to write a query to identify names with similar sounds?

How do I write a query to identify names(possibly including non-English names) that have similar sounds? Soundex does not seem to handle non-English names well.
The code should be able to identify that for example the following(or most of them) are names with similar sounds?
Helena - Elena
Violet - Viola
Beatrix - Beatrice
Madeline - Madeleine (ma-duh-LINE vs ma-duh-LEN)
Alice - Elise
Madeline - Adeline
Kristen - Kirsten
Lily - Millie
Charlotte - Scarlett
Zara / Lara / Sara / Mara
Elena - Alana
Emily - Emmeline
Amelia - Amalia
Stella - Bella - Ella
Isabel - Isabeau
Holly - Hallie
Laura - Lara
Fiona - Finola
Louise - Eloise
Cara - Clara
Susanna vs Susannah
Nora vs Norah
Talia vs Tahlia vs Thalia
Catherine vs Katherine
Cecilia vs Cecelia
Lucy vs Lucie
Vivian vs Vivien
Lillian vs Lilian
Gwendolen vs Gwendolyn
Sofia vs Sophia
Isabel vs Isobel vs Isabelle
Seraphina vs Serafina
Juliet vs Juliette
Annabel vs Annabelle
Emily vs Emilie
Elisabeth vs Elizabeth
...and non-English names too.
Would it help by using algorithm like Levenshtein Distance to compare the similarity between two sequences?
https://en.wikipedia.org/wiki/Levenshtein_distance
Particularly in Oracle, you can use utl_match.
For example:
--Find closest names based on UTL_MATCH.EDIT_DISTANCE.
with names as
(
--Names data.
select column_value name
from table(sys.odcivarchar2list('Adeline','Alana','Alice','Amalia','Amelia','Annabel',
'Annabelle','Beatrice','Beatrix','Bella','Cara','Catherine','Cecelia','Cecilia',
'Charlotte','Clara','Elena','Elisabeth','Elise','Elizabeth','Ella','Eloise','Emilie',
'Emily','Emmeline','Finola','Fiona','Gwendolen','Gwendolyn','Hallie','Helena','Holly',
'Isabeau','Isabel','Isabelle','Isobel','Juliet','Juliette','Katherine','Kirsten',
'Kristen','Lara','Laura','Lilian','Lillian','Lily','Louise','Lucie','Lucy',
'Madeleine','Madeline','Mara','Millie','Nora','Norah','Sara','Scarlett','Serafina',
'Seraphina','Sofia','Sophia','Stella','Susanna','Susannah','Tahlia','Talia','Thalia',
'Viola','Violet','Vivian','Vivien','Zara'))
)
--Name with the closest matches.
select name1, edit_distance, listagg(name2, ',') within group (order by name2) names
from
(
--Compare strings.
select names1.name name1, names2.name name2
,utl_match.edit_distance(names1.name, names2.name) edit_distance
,min(utl_match.edit_distance(names1.name, names2.name))
over (partition by names1.name) min_edit_distance
from names names1
cross join names names2
--This cross join could get expensive. It may help to add conditions here to
--filter out obvious non-matches. For example, maybe throw out rows where the
--string length is vastly different?
where names1.name <> names2.name
order by 1, 3, 2
)
where edit_distance = min_edit_distance
group by name1, edit_distance
order by 1;
Results:
NAME1 EDIT_DISTANCE NAMES
----- ------------- -----
Adeline 2 Madeline
Alana 2 Clara,Elena
Alice 2 Elise
Amalia 1 Amelia
Amelia 1 Amalia
Annabel 2 Annabelle
Annabelle 2 Annabel
Beatrice 2 Beatrix
Beatrix 2 Beatrice
Bella 2 Ella,Stella
Cara 1 Clara,Lara,Mara,Sara,Zara
Catherine 1 Katherine
Cecelia 1 Cecilia
Cecilia 1 Cecelia
Charlotte 4 Scarlett
Clara 1 Cara
Elena 2 Alana,Ella,Helena
Elisabeth 1 Elizabeth
Elise 1 Eloise
Elizabeth 1 Elisabeth
Ella 2 Bella,Elena
Eloise 1 Elise
Emilie 2 Emily
Emily 2 Emilie,Lily
Emmeline 3 Adeline,Emilie,Madeline
Finola 2 Fiona,Viola
Fiona 2 Finola,Viola
Gwendolen 1 Gwendolyn
Gwendolyn 1 Gwendolen
Hallie 2 Millie
Helena 2 Elena
Holly 3 Bella,Ella,Emily,Hallie,Lily
Isabeau 2 Isabel
Isabel 1 Isobel
Isabelle 2 Isabel
Isobel 1 Isabel
Juliet 2 Juliette
Juliette 2 Juliet
Katherine 1 Catherine
Kirsten 2 Kristen
Kristen 2 Kirsten
Lara 1 Cara,Laura,Mara,Sara,Zara
Laura 1 Lara
Lilian 1 Lillian
Lillian 1 Lilian
Lily 2 Emily,Lucy
Louise 3 Elise,Eloise,Lucie
Lucie 2 Lucy
Lucy 2 Lily,Lucie
Madeleine 1 Madeline
Madeline 1 Madeleine
Mara 1 Cara,Lara,Sara,Zara
Millie 2 Hallie
Nora 1 Norah
Norah 1 Nora
Sara 1 Cara,Lara,Mara,Zara
Scarlett 4 Charlotte
Serafina 2 Seraphina
Seraphina 2 Serafina
Sofia 2 Sophia
Sophia 2 Sofia
Stella 2 Bella
Susanna 1 Susannah
Susannah 1 Susanna
Tahlia 1 Talia
Talia 1 Tahlia,Thalia
Thalia 1 Talia
Viola 2 Finola,Fiona,Violet
Violet 2 Viola
Vivian 1 Vivien
Vivien 1 Vivian
Zara 1 Cara,Lara,Mara,Sara

Retrieve top 48 unique records from database based on a sorted Field

I have database table that I am after some SQL for (Which is defeating me so far!)
Imagine there are 192 Athletic Clubs who all take part in 12 Track Meets per season.
So that is 2304 individual performances per season (for example in the 100Metres)
I would like to find the top 48 (unique) individual performances from the table, these 48 athletes are then going to take part in the end of season World Championships.
So imagine the 2 fastest times are both set by "John Smith", but he can only be entered once in the world champs. So i would then look for the next fastest time not set by "John Smith"... so on and so until I have 48 unique athletes..
hope that makes sense.
thanks in advance if anyone can help
PS
I did have a nice screen shot created that would explain it much better. but as a newish user i cannot post images.
I'll try a copy and paste version instead...
ID AthleteName AthleteID Time
1 Josh Lewis 3 11.99
2 Joe Dundee 4 11.31
3 Mark Danes 5 13.44
4 Josh Lewis 3 13.12
5 John Smith 1 11.12
6 John Smith 1 12.18
7 John Smith 1 11.22
8 Adam Bennett 6 11.33
9 Ronny Bower 7 12.88
10 John Smith 1 13.49
11 Adam Bennett 6 12.55
12 Mark Danes 5 12.12
13 Carl Tompkins 2 13.11
14 Joe Dundee 4 11.28
15 Ronny Bower 7 12.14
16 Carl Tompkin 2 11.88
17 Nigel Downs 8 14.14
18 Nigel Downs 8 12.19
Top 4 unique individual performances
1 John Smith 1 11.12
3 Joe Dundee 4 11.28
5 Adam Bennett 6 11.33
6 Carl Tompkins 2 11.88
Basically something like this:
select top 48 *
from (
select athleteId,min(time) as bestTime
from theRaces
where raceId = '123' -- e.g., 123=100 meters
group by athleteId
) x
order by bestTime
try this --
select x.ID, x.AthleteName , x.AthleteID , x.Time
(
select rownum tr_count,v.AthleteID AthleteID, v.AthleteName AthleteName, v.Time Time,v.id id
from
(
select
tr1.AthleteName AthleteName, tr1.Time time,min(tr1.id) id, tr1.AthleteID AthleteID
from theRaces tr1
where time =
(select min(time) from theRaces tr2 where tr2.athleteId = tr1.athleteId)
group by tr1.AthleteName, tr1.AthleteID, tr1.Time
having tr1.Time = ( select min(tr2.time) from theRaces tr2 where tr1.AthleteID =tr2.AthleteID)
order by tr1.time
) v
) x
where x.tr_count < 48