How to explode a list into new columns pandas - pandas

Let's say I have the following df
x
1 ['abc','bac','cab']
2 ['bac']
3 ['abc','cab']
And I would like to take each element of each list and put it into a new row, like so
abc bac cab
1 1 1 1
2 0 1 0
3 1 0 1
I have referred to multiple links but can't seem to get this correctly.
Thanks!

One approach with str.join + str.get_dummies:
out = df['x'].str.join(',').str.get_dummies(',')
out:
abc bac cab
0 1 1 1
1 0 1 0
2 1 0 1
Or with explode + pd.get_dummies then groupby max:
out = pd.get_dummies(df['x'].explode()).groupby(level=0).max()
out:
abc bac cab
0 1 1 1
1 0 1 0
2 1 0 1
Can also do pd.crosstab after explode if want counts instead of dummies:
s = df['x'].explode()
out = pd.crosstab(s.index, s)
out:
x abc bac cab
row_0
0 1 1 1
1 0 1 0
2 1 0 1
*Note output is the same here, but will be count if there are duplicates.
DataFrame:
import pandas as pd
df = pd.DataFrame({
'x': [['abc', 'bac', 'cab'], ['bac'], ['abc', 'cab']]
})

I will do
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
s = pd.DataFrame(mlb.fit_transform(df['x']), columns=mlb.classes_, index=df.index)

Related

Change 1st instance of every unique row as 1 in pandas

Hi let us assume i have a data frame
Name quantity
0 a 0
1 a 0
2 b 0
3 b 0
4 c 0
And i want something like
Name quantity
0 a 1
1 a 0
2 b 1
3 b 0
4 c 1
which is essentially i want to change first row of every unique element with one
currently i am using code like:
def store_counter(df):
unique_names = list(df.name.unique())
df['quantity'] = 0
for i,j in df.iterrows():
if j['name'] in unique_outlets:
df.loc[i, 'quantity'] = 1
unique_names.remove(j['name'])
else:
pass
return df
which is highly inefficient. is there a better approach for this?
Thank you in advance.
Use Series.duplicated with DataFrame.loc:
df.loc[~df.Name.duplicated(), 'quantity'] = 1
print (df)
Name quantity
0 a 1
1 a 0
2 b 1
3 b 0
4 c 1
If need set both values use numpy.where:
df['quantity'] = np.where(df.Name.duplicated(), 0, 1)
print (df)
Name quantity
0 a 1
1 a 0
2 b 1
3 b 0
4 c 1

insert column to df on sequenced location

i have a df like this:
id
month
1
1
1
3
1
4
1
6
i want to transform it become like this:
id
1
2
3
4
5
6
1
1
0
1
1
0
1
ive tried using this code:
ndf = df[['id']].join(pd.get_dummies(
df['month'])).groupby('id').max()
but it shows like this:
id
1
3
4
6
1
1
1
1
1
how can i insert the middle column (2 and 5) even if it's not in the data?
You can use pd.crosstab
instead, then create new columns using pd.RangeIndex based on the min and max month, and finally use DataFrame.reindex (and optionally DataFrame.reset_index afterwards):
import pandas as pd
new_cols = pd.RangeIndex(df['month'].min(), df['month'].max())
res = (
pd.crosstab(df['id'], df['month'])
.reindex(columns=new_cols, fill_value=0)
.reset_index()
)
Output:
>>> res
id 1 2 3 4 5
0 1 1 0 1 1 0

Adding new column as a sum of the subsquent columns [duplicate]

This question already has answers here:
how do I insert a column at a specific column index in pandas?
(6 answers)
Closed last year.
I have this df:
id car truck bus bike
0 1 1 0 0
1 0 0 1 0
2 1 1 1 1
I want to add another column count to this df but after id and before car to sum the values of the rows, like this:
id count car truck bus bike
0 2 1 1 0 0
1 1 0 0 1 0
2 4 1 1 1 1
I know how to add the column using this code:
df.loc[:,'count'] = df.sum(numeric_only=True, axis=1)
but the above code add the new column in the last position.
How can I fix this?
There are several ways. I provided two ways here.
#1. Changing column order after creating count column:
df.loc[:,'count'] = df.sum(numeric_only=True, axis=1)
df.columns = ['id', 'count', 'car', 'truck', 'bus', 'bike']
print(df)
# id count car truck bus bike
#0 0 2 1 1 0 0
#1 1 2 0 0 1 0
#2 2 6 1 1 1 1
#2. Inserting a Series to specific position using insert function:
df.insert(1, "count", df.sum(numeric_only=True, axis=1))
print(df)
# id count car truck bus bike
#0 0 2 1 1 0 0
#1 1 2 0 0 1 0
#2 2 6 1 1 1 1
try this slight modification of your code:
import pandas as pd
df = pd.DataFrame(data={'id':[0,1,2],'car':[1,0,1],'truck':[1,0,1],'bus':[0,1,1],'bike':[0,0,1]})
count = df.drop(columns=['id'],axis=1).sum(numeric_only=True, axis=1)
df.insert(1, "count", count)
print(df)

Simple addition of different sizes DataFrames in Pandas

I have 2 very simple addition problems with Pandas, I hope you could help me.
My first question:
Let say I have the following two dataframes: a_df and b_df
a = [[1,1,1,1],[0,0,0,0],[1,1,0,0]]
a_df = pd.DataFrame(a)
a_df =
0 1 2 3
0 1 1 1 1
1 0 0 0 0
2 1 1 0 0
b = [1,1,1,1]
b_df = pd.DataFrame(b).T
b_df=
0 1 2 3
0 1 1 1 1
I would like to add b_df to a_df to obtain c_df such that my expected output would be the follow:
c_df =
0 1 2 3
0 2 2 2 2
1 1 1 1 1
2 2 2 1 1
The current method I use is replicate b_df to the same size of a_df and carry out the addition, shown below. However, this method is not very efficient if my a_df is very very large.
a = [[1,1,1,1],[0,0,0,0],[1,1,0,0]]
a_df = pd.DataFrame(a)
b = [1,1,1,1]
b_df = pd.DataFrame(b).T
b_df = pd.concat([b_df]*len(a_df)).reset_index(drop=True)
c_df = a_df + b_df
Are there any other ways to add b_df(without replicating it) to a_df in order to obtain what I want c_df to be?
My second question is very similar to my first one:
Let say I have d_df and e_df as follows:
d = [1,1,1,1]
d_df = pd.DataFrame(d)
d_df=
0
0 1
1 1
2 1
3 1
e = [1]
e_df = pd.DataFrame(e)
e_df=
0
0 1
I want to add e_df to d_df such that I would get the following result:
0
0 2
1 2
2 2
3 2
Again, current I am replicating e_df using the following method (same as Question 1) before adding with d_df
d = [1,1,1,1]
d_df = pd.DataFrame(d)
e = [1]
e_df = pd.DataFrame(e)
e_df = pd.concat([e_df]*len(d_df)).reset_index(drop=True)
f_df = d_df + e_df
Is there a way without replicating e_df?
Please advise and help me. Thank you so much in advanced
Tommy
Try this :
pd.DataFrame(a_df.to_numpy() + b_df.to_numpy())
0 1 2 3
0 2 2 2 2
1 1 1 1 1
2 2 2 1 1
numpy offers the broadcasting features that allows you to add the way u want, as long as the shape is similar on one end. I feel someone has answered something similar to this before. Once I find it I will reference it here.
This article from numpy explains broadcasting pretty well
For first convert one row DataFrame to Series:
c_df = a_df + b_df.iloc[0]
print (c_df)
0 1 2 3
0 2 2 2 2
1 1 1 1 1
2 2 2 1 1
Same principe is for second:
c_df = d_df + e_df.iloc[0]
print (c_df)
0
0 2
1 2
2 2
3 2
More information is possible find in How do I operate on a DataFrame with a Series for every column.

Pandas truth value of series ambiguous

I am trying to set one column in a dataframe in pandas based on whether another column value is in a list.
I try:
df['IND']=pd.Series(np.where(df['VALUE'] == 1 or df['VALUE'] == 4, 1,0))
But I get: Truth value of a Series is ambiguous.
What is the best way to achieve the functionality:
If VALUE is in (1,4), then IND=1, else IND=0
You need to assign the else value and then modify it with a mask using isin
df['IND'] = 0
df.loc[df['VALUE'].isin([1,4]), 'IND'] = 1
For multiple conditions, you can do as follow:
mask1 = df['VALUE'].isin([1,4])
mask2 = df['SUBVALUE'].isin([10,40])
df['IND'] = 0
df.loc[mask1 & mask2, 'IND'] = 1
Consider below example:
df = pd.DataFrame({
'VALUE': [1,1,2,2,3,3,4,4]
})
Output:
VALUE
0 1
1 1
2 2
3 2
4 3
5 3
6 4
7 4
Then,
df['IND'] = 0
df.loc[df['VALUE'].isin([1,4]), 'IND'] = 1
Output:
VALUE IND
0 1 1
1 1 1
2 2 0
3 2 0
4 3 0
5 3 0
6 4 1
7 4 1