Manipulating series in a dataframe - pandas

My dataframe has a list of comma separated values in one column. I want to find the list of distinct entries, create a new column for each distinct entry in the dataframe, then fill the new columns with 1 or 0 depending on whether the row has the city name.
The idea is to use the new columns in building a logistic regression model.
As an example
Before
Name City
Jack NewYork,Chicago,Seattle
Jill Seattle, SanFrancisco
Ted Chicago,SanFrancisco
Bill NewYork,Seattle
After
Name NewYork Chicago Seattle SanFrancisco
Jack 1 1 1 0
Jill 0 0 1 1
Ted 0 1 0 1
Bill 1 0 1 0

You can do this with the get_dummies str method:
import pandas as pd
df = pd.DataFrame(
{"Name": ["Jack", "Jill", "Ted", "Bill"],
"City": ["NewYork,Chicago,Seattle", "Seattle,SanFrancisco", "Chicago,SanFrancisco", "NewYork,Seattle"]}
)
print(pd.concat((df, df.City.str.get_dummies(",")), axis=1))
Result:
Name City Chicago NewYork SanFrancisco Seattle
0 Jack NewYork,Chicago,Seattle 1 1 0 1
1 Jill Seattle,SanFrancisco 0 0 1 1
2 Ted Chicago,SanFrancisco 1 0 1 0
3 Bill NewYork,Seattle 0 1 0 1

Related

Reform Pandas Data frame based on column name

I have a dataframe
text label title version
0 Alice is in Seattle SA 1
1 Alice is in wonderland. Portlang SA 2
2 Mallory has done the task. Gotland sometitle 4
3 Mallory has done the task. california sometitle 4
4 Mallory has california sometitle 2
5 Bob is final. Portland some different title 3
6 Mallory has done Portland sometitle 3
The final result I want is to find the hightest version text for given title and corresponding label, however the label should be divided as columns.
Here is the final result:
text Seattle Portlang Gotland california Portland title
0 Alice is in wonderland. 0 1 0 0 0 SA
1 Mallory has done the task. 0 0 1 1 0 sometitle
2 Bob is final. 0 0 0 0 1 some different title
Thanks in advance,
Use pivot_table. First rename text values with the title of the highest version for each title the pivot your dataframe:
out = (
df.assign(dummy=1)
.mask(df.groupby('title')['version'].rank(method='dense', ascending=False) > 1)
.pivot_table('dummy', ['title', 'text'], 'label', fill_value=0)
.reset_index()
.rename_axis(columns=None)
)
Output:
>>> out
title text Gotland Portland Portlang california
0 SA Alice is in wonderland. 0 0 1 0
1 some different title Bob is final. 0 1 0 0
2 sometitle Mallory has done the task. 1 0 0 1

pandas conditional fill binary [duplicate]

This question already has an answer here:
Quickest way to make a get_dummies type dataframe from a column with a multiple of strings
(1 answer)
Closed 1 year ago.
I have a df
name cars
john honda,kia
tom honda,kia,nissan
jack toyota
johnny honda,kia
tommy honda,kia,nissan
jacky toyota
What is a best way using pandas to create a data frame that would add a 1 if car present else 0 to existing df which would look like this.
name cars honda kia nissan toyota
john honda,kia 1 1 0 0
tom honda,kia,nissan 1 1 1 0
jack toyota 0 0 0 1
johnny honda,kia 1 1 0 0
tommy honda,kia,nissan 1 1 1 0
jacky toyota 0 0 0 1
i tried using np.where with multiple conditions as described here but i don't think its the right approach.
That’s exactly what pd.Series.str.get_dummies does, just join it’s result to your dataframe without the cars column:
>>> df.drop(columns=['cars']).join(df['cars'].str.get_dummies(sep=','))
name honda kia nissan toyota
0 john 1 1 0 0
1 tom 1 1 1 0
2 jack 0 0 0 1
3 johnny 1 1 0 0
4 tommy 1 1 1 0
5 jacky 0 0 0 1

how to apply one hot encoding or get dummies on 2 columns together in pandas?

I have below dataframe which contain sample values like:-
df = pd.DataFrame([["London", "Cambridge", 20], ["Cambridge", "London", 10], ["Liverpool", "London", 30]], columns= ["city_1", "city_2", "id"])
city_1 city_2 id
London Cambridge 20
Cambridge London 10
Liverpool London 30
I need the output dataframe as below which is built while joining 2 city columns together and applying one hot encoding after that:
id London Cambridge Liverpool
20 1 1 0
10 1 1 0
30 1 0 1
Currently, I am using the below code which works one time on a column, please could you advise if there is any pythonic way to get the above output
output_df = pd.get_dummies(df, columns=['city_1', 'city_2'])
which results in
id city_1_Cambridge city_1_London and so on columns
You can add parameters prefix_sep and prefix to get_dummies and then use max if want only 1 or 0 values (dummies or indicator columns) or sum if need count 1 values :
output_df = (pd.get_dummies(df, columns=['city_1', 'city_2'], prefix_sep='', prefix='')
.max(axis=1, level=0))
print (output_df)
id Cambridge Liverpool London
0 20 1 0 1
1 10 1 0 1
2 30 0 1 1
Or if want processing all columns without id convert not processing column(s) to index first by DataFrame.set_index, then use get_dummies with max and last add DataFrame.reset_index:
output_df = (pd.get_dummies(df.set_index('id'), prefix_sep='', prefix='')
.max(axis=1, level=0)
.reset_index())
print (output_df)
id Cambridge Liverpool London
0 20 1 0 1
1 10 1 0 1
2 30 0 1 1

python pandas - set column value of column based on index and or ID of concatenated dataframes

I have a concatenated dataframe of at least two concatenated dataframes:
i.e.
df1
Name | Type | ID
0 Joe A 1
1 Fred B 2
2 Mike Both 3
3 Frank Both 4
df2
Name | Type | ID
0 Bill Both 1
1 Jill Both 2
2 Mill B 3
3 Hill A 4
ConcatDf:
Name | Type | ID
0 Joe A 1
1 Fred B 2
2 Mike Both 3
3 Frank Both 4
0 Bill Both 1
1 Jill Both 2
2 Mill B 3
3 Hill A 4
Suppose after they are concatenated, I'd like to set Type for all records from df1 to C and all records from df2 to B. Is this possible?
The indices of the dataframes can be vastly different sizes.
Thanks in advance.
df3 = pd.concat([df1,df2], keys = (1,2))
df3.loc[(1), 'Type'] == 'C'
When you concat you can assign the df's keys. This will create a multi-index with the keys separating the concatonated df's. Then when you use .loc with keys you can use( around the key to call the group. In the code above we would change all the Types of df1 (which has a key of 1) to C.
Use merge with indicator=True to find rows belong to df1 or df2. Next, use np.where to assign A or B.
t = concatdf.merge(df1, how='left', on=concatdf.columns.tolist(), indicator=True)
concatdf['Type'] = np.where(t._merge.eq('left_only'), 'B', 'C')
Out[2185]:
Name Type ID
0 Joe C 1
1 Fred C 2
2 Mike C 3
3 Frank C 4
0 Bill B 1
1 Jill B 2
2 Mill B 3
3 Hill B 4

Data analysis with pandas

The following df is a summary of my hole dataset just to illustrate my problem.
The df shows the job application of each id and i want to know which combination of sector is more likely for an individual to apply?
df
id education area_job_application
1 Collage Construction
1 Collage Sales
1 Collage Administration
2 University Finance
2 University Sales
3 Collage Finance
3 Collage Sales
4 University Administration
4 University Sales
4 University Data analyst
5 University Administration
5 University Sales
answer
Construction Sales Administration Finance Data analyst
Contruction 1 1 1 0 0
Sales 1 5 3 1 1
Administration 1 3 3 0 1
Finance 0 2 0 2 0
Data analyst 0 1 1 0 1
This answer shows that administration and sales are the sector that more chances have to receive a postulation by the same id (this is the answer which i am looking). But i am also interesting for other combinations, i think that a mapheat will be very informative to illustrate this data.
Sector combination from the same sector are irrelevant (maybe in the diagonal from the answer matrix should be a 0, doesnt matter the value, i wont anaylse).
Use crosstab or groupby with size and unstack first and then DataFrame.dot by transpose DataFrame and last add reindex for custom order of index and columns:
#dynamic create order by unique values of column
L = df['area_job_application'].unique()
#df = pd.crosstab(df.id, df.area_job_application)
df = df.groupby(['id', 'area_job_application']).size().unstack(fill_value=0)
df = df.T.dot(df).rename_axis(None).rename_axis(None, axis=1).reindex(columns=L, index=L)
print (df)
Construction Sales Administration Finance Data analyst
Construction 1 1 1 0 0
Sales 1 5 3 2 1
Administration 1 3 3 0 1
Finance 0 2 0 2 0
Data analyst 0 1 1 0 1