Reform Pandas Data frame based on column name - pandas

I have a dataframe
text label title version
0 Alice is in Seattle SA 1
1 Alice is in wonderland. Portlang SA 2
2 Mallory has done the task. Gotland sometitle 4
3 Mallory has done the task. california sometitle 4
4 Mallory has california sometitle 2
5 Bob is final. Portland some different title 3
6 Mallory has done Portland sometitle 3
The final result I want is to find the hightest version text for given title and corresponding label, however the label should be divided as columns.
Here is the final result:
text Seattle Portlang Gotland california Portland title
0 Alice is in wonderland. 0 1 0 0 0 SA
1 Mallory has done the task. 0 0 1 1 0 sometitle
2 Bob is final. 0 0 0 0 1 some different title
Thanks in advance,

Use pivot_table. First rename text values with the title of the highest version for each title the pivot your dataframe:
out = (
df.assign(dummy=1)
.mask(df.groupby('title')['version'].rank(method='dense', ascending=False) > 1)
.pivot_table('dummy', ['title', 'text'], 'label', fill_value=0)
.reset_index()
.rename_axis(columns=None)
)
Output:
>>> out
title text Gotland Portland Portlang california
0 SA Alice is in wonderland. 0 0 1 0
1 some different title Bob is final. 0 1 0 0
2 sometitle Mallory has done the task. 1 0 0 1

Related

Pandas: How to alter entry in current row dependant on selection of all rows up to current row

Trying to learn pandas using English football scores.
Here is part of a list of football matches in date order.
"FTR" is the Full Time Result: "A" - win for the away team, "H" - win for the home team, "D"- a draw.
I created columns "HTWTD" - home team wins to date, and "ATWTD" - away team wins to date, to hold the number of wins the home and away teams have had up until that point. I populated the columns with 0s then put a 1 in the HTWTD when the FTR was H, and a 1 in the ATWTD where the FTR was A. This obviously only produces correct data for the first time each team plays.
When we get to row 9, Leeds wins a match having already won one in row 2. The HTWTD in row 9 should read 2 i.e at this point Leeds has won 2 games.
To my untrained mind the process should be...
Look at the row above, if Leeds features, get the corresponding HTWTD or ATWTD score, add 1 to it and put it in the current row HTWTD or ATWTD column. If Leeds doesn't feature (and you are not at the first row) go up one row.
Having googled around I haven't found anything about how to select only rows above current row, then alter entry in current row depending on test on selected rows.
I could probably write a little python function to do this, but is there a pandas way to go about it?
Row
Date
HomeTeam
AwayTeam
FTR
HTWTD
ATWTD
0
12/09/2020
Fulham
Arsenal
A
0
1
1
12/09/2020
Crystal Palace
Southampton
H
1
0
2
12/09/2020
Liverpool
Leeds
H
0
1
3
12/09/2020
West Ham
Newcastle
A
0
1
4
13/09/2020
West Brom
Leicester
A
0
1
5
13/09/2020
Tottenham
Everton
A
0
1
6
14/09/2020
Brighton
Chelsea
A
0
1
7
14/09/2020
Sheffield United
Wolves
A
0
1
8
19/09/2020
Everton
West Brom
H
1
0
9
19/09/2020
Leeds
Fulham
H
1
0
IIUC, you can use .eq() to return a boolean series of True or False for the condition and then use .cumsum() to cumulatively get the sum of the True values per HomeTeam and AwayTeam group result with a .groupby:
df['home_wins'] = df['FTR'].eq('H')
df['away_wins'] = df['FTR'].eq('A')
df['HTWTD'] = df.groupby('HomeTeam')['home_wins'].cumsum()
df['ATWTD'] = df.groupby('AwayTeam')['away_wins'].cumsum()
df.drop(['home_wins', 'away_wins'], axis=1)
Out[1]:
Row Date HomeTeam AwayTeam FTR HTWTD ATWTD
0 0 12/09/2020 Fulham Arsenal A 0 1
1 1 12/09/2020 Crystal Palace Southampton H 1 0
2 2 12/09/2020 Liverpool Leeds H 1 0
3 3 12/09/2020 West Ham Newcastle A 0 1
4 4 13/09/2020 West Brom Leicester A 0 1
5 5 13/09/2020 Tottenham Everton A 0 1
6 6 14/09/2020 Brighton Chelsea A 0 1
7 7 14/09/2020 Sheffield United Wolves A 0 1
8 8 19/09/2020 Everton West Brom H 1 0
9 9 19/09/2020 Leeds Fulham H 1 0

pandas conditional fill binary [duplicate]

This question already has an answer here:
Quickest way to make a get_dummies type dataframe from a column with a multiple of strings
(1 answer)
Closed 1 year ago.
I have a df
name cars
john honda,kia
tom honda,kia,nissan
jack toyota
johnny honda,kia
tommy honda,kia,nissan
jacky toyota
What is a best way using pandas to create a data frame that would add a 1 if car present else 0 to existing df which would look like this.
name cars honda kia nissan toyota
john honda,kia 1 1 0 0
tom honda,kia,nissan 1 1 1 0
jack toyota 0 0 0 1
johnny honda,kia 1 1 0 0
tommy honda,kia,nissan 1 1 1 0
jacky toyota 0 0 0 1
i tried using np.where with multiple conditions as described here but i don't think its the right approach.
That’s exactly what pd.Series.str.get_dummies does, just join it’s result to your dataframe without the cars column:
>>> df.drop(columns=['cars']).join(df['cars'].str.get_dummies(sep=','))
name honda kia nissan toyota
0 john 1 1 0 0
1 tom 1 1 1 0
2 jack 0 0 0 1
3 johnny 1 1 0 0
4 tommy 1 1 1 0
5 jacky 0 0 0 1

Converting categorical column into a single dummy variable column

Consider I have the following dataframe:
Survived Pclass Sex Age Fare
0 0 3 male 22.0 7.2500
1 1 1 female 38.0 71.2833
2 1 3 female 26.0 7.9250
3 1 1 female 35.0 53.1000
4 0 3 male 35.0 8.0500
I used the get_dummies() function to create dummy variable. The code and output are as follows:
one_hot = pd.get_dummies(dataset, columns = ['Category'])
This will return:
Survived Pclass Age Fare Sex_female Sex_male
0 0 3 22 7.2500 0 1
1 1 1 38 71.2833 1 0
2 1 3 26 7.9250 1 0
3 1 1 35 53.1000 1 0
4 0 3 35 8.0500 0 1
What I would like to have is a single column for Sex having the values 0 or 1 instead of 2 columns.
Interestingly, when I used get_dummies() on a different dataframe, it worked just like I wanted.
For the following dataframe:
Category Message
0 ham Go until jurong point, crazy.. Available only ...
1 ham Ok lar... Joking wif u oni...
2 spam Free entry in 2 a wkly comp to win FA Cup final...
3 ham U dun say so early hor... U c already then say...
4 ham Nah I don't think he goes to usf, he lives aro...
With the code:
one_hot = pd.get_dummies(dataset, columns = ['Category'])
It returns:
Message ... Category_spam
0 Go until jurong point, crazy.. Available only ... ... 0
1 Ok lar... Joking wif u oni... ... 0
2 Free entry in 2 a wkly comp to win FA Cup fina... ... 1
3 U dun say so early hor... U c already then say... ... 0
4 Nah I don't think he goes to usf, he lives aro... ... 0
Why does get_dummies() work differently on these two dataframes?
How can I make sure I get the 2nd output everytime?
Here are multiple ways you can do:
from sklearn.preprocessing import LabelEncoder
lbl=LabelEncoder()
df['Sex_encoded'] = lbl.fit_transform(df['Sex'])
# using only pandas
df['Sex_encoded'] = df['Sex'].map({'male': 0, 'female': 1})
Survived Pclass Sex Age Fare Sex_encoded
0 0 3 male 22.0 7.2500 0
1 1 1 female 38.0 71.2833 1
2 1 3 female 26.0 7.9250 1
3 1 1 female 35.0 53.1000 1
4 0 3 male 35.0 8.0500 0

Manipulating series in a dataframe

My dataframe has a list of comma separated values in one column. I want to find the list of distinct entries, create a new column for each distinct entry in the dataframe, then fill the new columns with 1 or 0 depending on whether the row has the city name.
The idea is to use the new columns in building a logistic regression model.
As an example
Before
Name City
Jack NewYork,Chicago,Seattle
Jill Seattle, SanFrancisco
Ted Chicago,SanFrancisco
Bill NewYork,Seattle
After
Name NewYork Chicago Seattle SanFrancisco
Jack 1 1 1 0
Jill 0 0 1 1
Ted 0 1 0 1
Bill 1 0 1 0
You can do this with the get_dummies str method:
import pandas as pd
df = pd.DataFrame(
{"Name": ["Jack", "Jill", "Ted", "Bill"],
"City": ["NewYork,Chicago,Seattle", "Seattle,SanFrancisco", "Chicago,SanFrancisco", "NewYork,Seattle"]}
)
print(pd.concat((df, df.City.str.get_dummies(",")), axis=1))
Result:
Name City Chicago NewYork SanFrancisco Seattle
0 Jack NewYork,Chicago,Seattle 1 1 0 1
1 Jill Seattle,SanFrancisco 0 0 1 1
2 Ted Chicago,SanFrancisco 1 0 1 0
3 Bill NewYork,Seattle 0 1 0 1

how to apply one hot encoding or get dummies on 2 columns together in pandas?

I have below dataframe which contain sample values like:-
df = pd.DataFrame([["London", "Cambridge", 20], ["Cambridge", "London", 10], ["Liverpool", "London", 30]], columns= ["city_1", "city_2", "id"])
city_1 city_2 id
London Cambridge 20
Cambridge London 10
Liverpool London 30
I need the output dataframe as below which is built while joining 2 city columns together and applying one hot encoding after that:
id London Cambridge Liverpool
20 1 1 0
10 1 1 0
30 1 0 1
Currently, I am using the below code which works one time on a column, please could you advise if there is any pythonic way to get the above output
output_df = pd.get_dummies(df, columns=['city_1', 'city_2'])
which results in
id city_1_Cambridge city_1_London and so on columns
You can add parameters prefix_sep and prefix to get_dummies and then use max if want only 1 or 0 values (dummies or indicator columns) or sum if need count 1 values :
output_df = (pd.get_dummies(df, columns=['city_1', 'city_2'], prefix_sep='', prefix='')
.max(axis=1, level=0))
print (output_df)
id Cambridge Liverpool London
0 20 1 0 1
1 10 1 0 1
2 30 0 1 1
Or if want processing all columns without id convert not processing column(s) to index first by DataFrame.set_index, then use get_dummies with max and last add DataFrame.reset_index:
output_df = (pd.get_dummies(df.set_index('id'), prefix_sep='', prefix='')
.max(axis=1, level=0)
.reset_index())
print (output_df)
id Cambridge Liverpool London
0 20 1 0 1
1 10 1 0 1
2 30 0 1 1