pandas conditional fill binary [duplicate] - pandas

This question already has an answer here:
Quickest way to make a get_dummies type dataframe from a column with a multiple of strings
(1 answer)
Closed 1 year ago.
I have a df
name cars
john honda,kia
tom honda,kia,nissan
jack toyota
johnny honda,kia
tommy honda,kia,nissan
jacky toyota
What is a best way using pandas to create a data frame that would add a 1 if car present else 0 to existing df which would look like this.
name cars honda kia nissan toyota
john honda,kia 1 1 0 0
tom honda,kia,nissan 1 1 1 0
jack toyota 0 0 0 1
johnny honda,kia 1 1 0 0
tommy honda,kia,nissan 1 1 1 0
jacky toyota 0 0 0 1
i tried using np.where with multiple conditions as described here but i don't think its the right approach.

That’s exactly what pd.Series.str.get_dummies does, just join it’s result to your dataframe without the cars column:
>>> df.drop(columns=['cars']).join(df['cars'].str.get_dummies(sep=','))
name honda kia nissan toyota
0 john 1 1 0 0
1 tom 1 1 1 0
2 jack 0 0 0 1
3 johnny 1 1 0 0
4 tommy 1 1 1 0
5 jacky 0 0 0 1

Related

Reform Pandas Data frame based on column name

I have a dataframe
text label title version
0 Alice is in Seattle SA 1
1 Alice is in wonderland. Portlang SA 2
2 Mallory has done the task. Gotland sometitle 4
3 Mallory has done the task. california sometitle 4
4 Mallory has california sometitle 2
5 Bob is final. Portland some different title 3
6 Mallory has done Portland sometitle 3
The final result I want is to find the hightest version text for given title and corresponding label, however the label should be divided as columns.
Here is the final result:
text Seattle Portlang Gotland california Portland title
0 Alice is in wonderland. 0 1 0 0 0 SA
1 Mallory has done the task. 0 0 1 1 0 sometitle
2 Bob is final. 0 0 0 0 1 some different title
Thanks in advance,
Use pivot_table. First rename text values with the title of the highest version for each title the pivot your dataframe:
out = (
df.assign(dummy=1)
.mask(df.groupby('title')['version'].rank(method='dense', ascending=False) > 1)
.pivot_table('dummy', ['title', 'text'], 'label', fill_value=0)
.reset_index()
.rename_axis(columns=None)
)
Output:
>>> out
title text Gotland Portland Portlang california
0 SA Alice is in wonderland. 0 0 1 0
1 some different title Bob is final. 0 1 0 0
2 sometitle Mallory has done the task. 1 0 0 1

How to duplicate row based on int column

If I have a table like this in Hive:
name impressions sampling_rate
------------------------------------
paul 34 1
emma 0 3
greg 0 5
How can I duplicate each row in a select statement by the sampling_rate column so that it would look like this:
name impressions sampling_rate
------------------------------------
paul 34 1
emma 0 3
emma 0 3
emma 0 3
greg 0 5
greg 0 5
greg 0 5
greg 0 5
greg 0 5
Using space() you can produce a string of spaces with lenght=sampling_rate-1 , split it and explode with lateral view, it will duplicate rows.
Demo:
with your_table as(--Demo data, use your table instead of this CTE
select stack (3, --number of tuples
'paul',34,1,
'emma', 0,3,
'greg', 0,5
) as (name,impressions,sampling_rate)
)
select t.*
from your_table t --use your table here
lateral view explode(split(space(t.sampling_rate-1),' '))e
Result:
name impressions sampling_rate
------------------------------------
paul 34 1
emma 0 3
emma 0 3
emma 0 3
greg 0 5
greg 0 5
greg 0 5
greg 0 5
greg 0 5

Converting categorical column into a single dummy variable column

Consider I have the following dataframe:
Survived Pclass Sex Age Fare
0 0 3 male 22.0 7.2500
1 1 1 female 38.0 71.2833
2 1 3 female 26.0 7.9250
3 1 1 female 35.0 53.1000
4 0 3 male 35.0 8.0500
I used the get_dummies() function to create dummy variable. The code and output are as follows:
one_hot = pd.get_dummies(dataset, columns = ['Category'])
This will return:
Survived Pclass Age Fare Sex_female Sex_male
0 0 3 22 7.2500 0 1
1 1 1 38 71.2833 1 0
2 1 3 26 7.9250 1 0
3 1 1 35 53.1000 1 0
4 0 3 35 8.0500 0 1
What I would like to have is a single column for Sex having the values 0 or 1 instead of 2 columns.
Interestingly, when I used get_dummies() on a different dataframe, it worked just like I wanted.
For the following dataframe:
Category Message
0 ham Go until jurong point, crazy.. Available only ...
1 ham Ok lar... Joking wif u oni...
2 spam Free entry in 2 a wkly comp to win FA Cup final...
3 ham U dun say so early hor... U c already then say...
4 ham Nah I don't think he goes to usf, he lives aro...
With the code:
one_hot = pd.get_dummies(dataset, columns = ['Category'])
It returns:
Message ... Category_spam
0 Go until jurong point, crazy.. Available only ... ... 0
1 Ok lar... Joking wif u oni... ... 0
2 Free entry in 2 a wkly comp to win FA Cup fina... ... 1
3 U dun say so early hor... U c already then say... ... 0
4 Nah I don't think he goes to usf, he lives aro... ... 0
Why does get_dummies() work differently on these two dataframes?
How can I make sure I get the 2nd output everytime?
Here are multiple ways you can do:
from sklearn.preprocessing import LabelEncoder
lbl=LabelEncoder()
df['Sex_encoded'] = lbl.fit_transform(df['Sex'])
# using only pandas
df['Sex_encoded'] = df['Sex'].map({'male': 0, 'female': 1})
Survived Pclass Sex Age Fare Sex_encoded
0 0 3 male 22.0 7.2500 0
1 1 1 female 38.0 71.2833 1
2 1 3 female 26.0 7.9250 1
3 1 1 female 35.0 53.1000 1
4 0 3 male 35.0 8.0500 0

Manipulating series in a dataframe

My dataframe has a list of comma separated values in one column. I want to find the list of distinct entries, create a new column for each distinct entry in the dataframe, then fill the new columns with 1 or 0 depending on whether the row has the city name.
The idea is to use the new columns in building a logistic regression model.
As an example
Before
Name City
Jack NewYork,Chicago,Seattle
Jill Seattle, SanFrancisco
Ted Chicago,SanFrancisco
Bill NewYork,Seattle
After
Name NewYork Chicago Seattle SanFrancisco
Jack 1 1 1 0
Jill 0 0 1 1
Ted 0 1 0 1
Bill 1 0 1 0
You can do this with the get_dummies str method:
import pandas as pd
df = pd.DataFrame(
{"Name": ["Jack", "Jill", "Ted", "Bill"],
"City": ["NewYork,Chicago,Seattle", "Seattle,SanFrancisco", "Chicago,SanFrancisco", "NewYork,Seattle"]}
)
print(pd.concat((df, df.City.str.get_dummies(",")), axis=1))
Result:
Name City Chicago NewYork SanFrancisco Seattle
0 Jack NewYork,Chicago,Seattle 1 1 0 1
1 Jill Seattle,SanFrancisco 0 0 1 1
2 Ted Chicago,SanFrancisco 1 0 1 0
3 Bill NewYork,Seattle 0 1 0 1

Data analysis with pandas

The following df is a summary of my hole dataset just to illustrate my problem.
The df shows the job application of each id and i want to know which combination of sector is more likely for an individual to apply?
df
id education area_job_application
1 Collage Construction
1 Collage Sales
1 Collage Administration
2 University Finance
2 University Sales
3 Collage Finance
3 Collage Sales
4 University Administration
4 University Sales
4 University Data analyst
5 University Administration
5 University Sales
answer
Construction Sales Administration Finance Data analyst
Contruction 1 1 1 0 0
Sales 1 5 3 1 1
Administration 1 3 3 0 1
Finance 0 2 0 2 0
Data analyst 0 1 1 0 1
This answer shows that administration and sales are the sector that more chances have to receive a postulation by the same id (this is the answer which i am looking). But i am also interesting for other combinations, i think that a mapheat will be very informative to illustrate this data.
Sector combination from the same sector are irrelevant (maybe in the diagonal from the answer matrix should be a 0, doesnt matter the value, i wont anaylse).
Use crosstab or groupby with size and unstack first and then DataFrame.dot by transpose DataFrame and last add reindex for custom order of index and columns:
#dynamic create order by unique values of column
L = df['area_job_application'].unique()
#df = pd.crosstab(df.id, df.area_job_application)
df = df.groupby(['id', 'area_job_application']).size().unstack(fill_value=0)
df = df.T.dot(df).rename_axis(None).rename_axis(None, axis=1).reindex(columns=L, index=L)
print (df)
Construction Sales Administration Finance Data analyst
Construction 1 1 1 0 0
Sales 1 5 3 2 1
Administration 1 3 3 0 1
Finance 0 2 0 2 0
Data analyst 0 1 1 0 1