Split list column in a dataframe into separate 1/0 entry columns [duplicate] - pandas

I have a dataframe where one column is a list of groups each of my users belongs to. Something like:
index groups
0 ['a','b','c']
1 ['c']
2 ['b','c','e']
3 ['a','c']
4 ['b','e']
And what I would like to do is create a series of dummy columns to identify which groups each user belongs to in order to run some analyses
index a b c d e
0 1 1 1 0 0
1 0 0 1 0 0
2 0 1 1 0 1
3 1 0 1 0 0
4 0 1 0 0 0
pd.get_dummies(df['groups'])
won't work because that just returns a column for each different list in my column.
The solution needs to be efficient as the dataframe will contain 500,000+ rows.

Using s for your df['groups']:
In [21]: s = pd.Series({0: ['a', 'b', 'c'], 1:['c'], 2: ['b', 'c', 'e'], 3: ['a', 'c'], 4: ['b', 'e'] })
In [22]: s
Out[22]:
0 [a, b, c]
1 [c]
2 [b, c, e]
3 [a, c]
4 [b, e]
dtype: object
This is a possible solution:
In [23]: pd.get_dummies(s.apply(pd.Series).stack()).sum(level=0)
Out[23]:
a b c e
0 1 1 1 0
1 0 0 1 0
2 0 1 1 1
3 1 0 1 0
4 0 1 0 1
The logic of this is:
.apply(Series) converts the series of lists to a dataframe
.stack() puts everything in one column again (creating a multi-level index)
pd.get_dummies( ) creating the dummies
.sum(level=0) for remerging the different rows that should be one row (by summing up the second level, only keeping the original level (level=0))
An slight equivalent is pd.get_dummies(s.apply(pd.Series), prefix='', prefix_sep='').sum(level=0, axis=1)
If this will be efficient enough, I don't know, but in any case, if performance is important, storing lists in a dataframe is not a very good idea.

Very fast solution in case you have a large dataframe
Using sklearn.preprocessing.MultiLabelBinarizer
import pandas as pd
from sklearn.preprocessing import MultiLabelBinarizer
df = pd.DataFrame(
{'groups':
[['a','b','c'],
['c'],
['b','c','e'],
['a','c'],
['b','e']]
}, columns=['groups'])
s = df['groups']
mlb = MultiLabelBinarizer()
pd.DataFrame(mlb.fit_transform(s),columns=mlb.classes_, index=df.index)
Result:
a b c e
0 1 1 1 0
1 0 0 1 0
2 0 1 1 1
3 1 0 1 0
4 0 1 0 1
Worked for me and also was suggested here and here

This is even faster:
pd.get_dummies(df['groups'].explode()).sum(level=0)
Using .explode() instead of .apply(pd.Series).stack()
Comparing with the other solutions:
import timeit
import pandas as pd
setup = '''
import time
import pandas as pd
s = pd.Series({0:['a','b','c'],1:['c'],2:['b','c','e'],3:['a','c'],4:['b','e']})
df = s.rename('groups').to_frame()
'''
m1 = "pd.get_dummies(s.apply(pd.Series).stack()).sum(level=0)"
m2 = "df.groups.apply(lambda x: pd.Series([1] * len(x), index=x)).fillna(0, downcast='infer')"
m3 = "pd.get_dummies(df['groups'].explode()).sum(level=0)"
times = {f"m{i+1}":min(timeit.Timer(m, setup=setup).repeat(7, 1000)) for i, m in enumerate([m1, m2, m3])}
pd.DataFrame([times],index=['ms'])
# m1 m2 m3
# ms 5.586517 3.821662 2.547167

Even though this quest was answered, I have a faster solution:
df.groups.apply(lambda x: pd.Series([1] * len(x), index=x)).fillna(0, downcast='infer')
And, in case you have empty groups or NaN, you could just:
df.loc[df.groups.str.len() > 0].apply(lambda x: pd.Series([1] * len(x), index=x)).fillna(0, downcast='infer')
How it works
Inside the lambda, x is your list, for example ['a', 'b', 'c']. So pd.Series will be as follows:
In [2]: pd.Series([1, 1, 1], index=['a', 'b', 'c'])
Out[2]:
a 1
b 1
c 1
dtype: int64
When all pd.Series comes together, they become pd.DataFrame and their index become columns; missing index became a column with NaN as you can see next:
In [4]: a = pd.Series([1, 1, 1], index=['a', 'b', 'c'])
In [5]: b = pd.Series([1, 1, 1], index=['a', 'b', 'd'])
In [6]: pd.DataFrame([a, b])
Out[6]:
a b c d
0 1.0 1.0 1.0 NaN
1 1.0 1.0 NaN 1.0
Now fillna fills those NaN with 0:
In [7]: pd.DataFrame([a, b]).fillna(0)
Out[7]:
a b c d
0 1.0 1.0 1.0 0.0
1 1.0 1.0 0.0 1.0
And downcast='infer' is to downcast from float to int:
In [11]: pd.DataFrame([a, b]).fillna(0, downcast='infer')
Out[11]:
a b c d
0 1 1 1 0
1 1 1 0 1
PS.: It's not required the use of .fillna(0, downcast='infer').

You can use explode and crosstab:
s = pd.Series([['a', 'b', 'c'], ['c'], ['b', 'c', 'e'], ['a', 'c'], ['b', 'e']])
s = s.explode()
pd.crosstab(s.index, s)
Output:
col_0 a b c e
row_0
0 1 1 1 0
1 0 0 1 0
2 0 1 1 1
3 1 0 1 0
4 0 1 0 1

You can use str.join to join all elements in list present in series into string and then use str.get_dummies:
out = df.join(df['groups'].str.join('|').str.get_dummies())
print(out)
groups a b c e
0 [a, b, c] 1 1 1 0
1 [c] 0 0 1 0
2 [b, c, e] 0 1 1 1
3 [a, c] 1 0 1 0
4 [b, e] 0 1 0 1

Related

Python pandas explode on multiple columns in a Cartesian manner

I have a dataframe with some columns that are lists. I want to explode the columns but in stead of one-to-one, I want Cartesian (multiplicative) rows to be generated.
Does it have to go thru a for loop or something elegant is possible?
df = pd.DataFrame({'A': [[0, 1],5],
'B': 1,
'C': [['a', 'b'], 'phau']})
df.explode(['A','C'])
#Default output
A B C
0 0 1 a
0 1 1 b
1 5 1 phau
#Desired output
A B C
0 0 1 a
0 0 1 b
0 1 1 a
0 1 1 b
1 5 1 phau

how to count elements of list and make a table?

hi I have a datafrme and a column contained list.
d = {'col1': {0: 'A', 1: 'A', 2: 'B'},
'col2': {0: ['a', 'b', 'c', 'a'], 1: ['b', 'c'], 2: ['a', 'd', 'e']}}
pd.DataFrame(d)
col1 col2
0 A [a, b, c, a]
1 A [b, c]
2 B [a, d, e]
how I can count each element of the list and make rows columns? Note some rows have the same name as A
output:
col2 A A1 B
0 a 2 0 1
1 b 1 1 0
2 c 1 1 0
3 d 0 0 1
4 e 0 0 1
Assuming there are lists in col2 you can do groupby+cumcount for assigning 1 for the repeating A and then explode with crosstab
u = df.assign(col1=df['col1']+df.groupby("col1").cumcount()
.replace(0,'').astype(str)).explode('col2')
out = pd.crosstab(u['col2'],u['col1']).rename_axis(None,axis=1) #.reset_index()
print(out)
A A1 B
col2
a 2 0 1
b 1 1 0
c 1 1 0
d 0 0 1
e 0 0 1

Re-index to insert missing rows in a multi-indexed dataframe

I have a MultiIndexed DataFrame with three levels of indices. I would like to expand my third level to contain all values in a given range, but only for the existing values in the two upper levels.
For example, assume the first level is name, the second level is date and the third level is hour. I would like to have rows for all 24 possible hours (even if some are currently missing), but only for the already existing names and dates. The values in new rows can be filled with zeros.
So a simple example input would be:
>>> import pandas as pd
>>> df = pd.DataFrame([[1,1,1,3],[2,2,1,4], [3,3,2,5]], columns=['A', 'B', 'C','val'])
>>> df.set_index(['A', 'B', 'C'], inplace=True)
>>> df
val
A B C
1 1 1 3
2 2 1 4
3 3 2 5
if the required values for C are [1,2,3], the desired output would be:
val
A B C
1 1 1 3
2 0
3 0
2 2 1 4
2 0
3 0
3 3 1 0
2 5
3 0
I know how to achieve this using groupby and applying a defined function for each group, but I was wondering if there was a cleaner way of doing this with reindex (I couldn't make this one work for a MultiIndex case, but perhaps I'm missing something)
Use -
partial_indices = [ i[0:2] for i in df.index.values ]
C_reqd = [1, 2, 3]
final_indices = [j+(i,) for j in partial_indices for i in C_reqd]
index = pd.MultiIndex.from_tuples(final_indices, names=['A', 'B', 'C'])
df2 = pd.DataFrame(pd.Series(0, index), columns=['val'])
df2.update(df)
Output
df2
val
A B C
1 1 1 3.0
2 0.0
3 0.0
2 2 1 4.0
2 0.0
3 0.0
3 3 1 0.0
2 5.0
3 0.0

Pandas: Delete duplicated items in a specific column

I have a panda dataframe (here represented using excel):
Now I would like to delete all dublicates (1) of a specific row (B).
How can I do it ?
For this example, the result would look like that:
You can use duplicated for boolean mask and then set NaNs by loc, mask or numpy.where:
df.loc[df['B'].duplicated(), 'B'] = np.nan
df['B'] = df['B'].mask(df['B'].duplicated())
df['B'] = np.where(df['B'].duplicated(), np.nan,df['B'])
Alternative if need remove duplicates rows by B column:
df = df.drop_duplicates(subset=['B'])
Sample:
df = pd.DataFrame({
'B': [1,2,1,3],
'A':[1,5,7,9]
})
print (df)
A B
0 1 1
1 5 2
2 7 1
3 9 3
df.loc[df['B'].duplicated(), 'B'] = np.nan
print (df)
A B
0 1 1.0
1 5 2.0
2 7 NaN
3 9 3.0
df = df.drop_duplicates(subset=['B'])
print (df)
A B
0 1 1
1 5 2
3 9 3

filter dataframe rows based on length of column values

I have a pandas dataframe as follows:
df = pd.DataFrame([ [1,2], [np.NaN,1], ['test string1', 5]], columns=['A','B'] )
df
A B
0 1 2
1 NaN 1
2 test string1 5
I am using pandas 0.20. What is the most efficient way to remove any rows where 'any' of its column values has length > 10?
len('test string1')
12
So for the above e.g., I am expecting an output as follows:
df
A B
0 1 2
1 NaN 1
If based on column A
In [865]: df[~(df.A.str.len() > 10)]
Out[865]:
A B
0 1 2
1 NaN 1
If based on all columns
In [866]: df[~df.applymap(lambda x: len(str(x)) > 10).any(axis=1)]
Out[866]:
A B
0 1 2
1 NaN 1
I had to cast to a string for Diego's answer to work:
df = df[df['A'].apply(lambda x: len(str(x)) <= 10)]
In [42]: df
Out[42]:
A B C D
0 1 2 2 2017-01-01
1 NaN 1 NaN 2017-01-02
2 test string1 5 test string1test string1 2017-01-03
In [43]: df.dtypes
Out[43]:
A object
B int64
C object
D datetime64[ns]
dtype: object
In [44]: df.loc[~df.select_dtypes(['object']).apply(lambda x: x.str.len().gt(10)).any(1)]
Out[44]:
A B C D
0 1 2 2 2017-01-01
1 NaN 1 NaN 2017-01-02
Explanation:
df.select_dtypes(['object']) selects only columns of object (str) dtype:
In [45]: df.select_dtypes(['object'])
Out[45]:
A C
0 1 2
1 NaN NaN
2 test string1 test string1test string1
In [46]: df.select_dtypes(['object']).apply(lambda x: x.str.len().gt(10))
Out[46]:
A C
0 False False
1 False False
2 True True
now we can "aggregate" it as follows:
In [47]: df.select_dtypes(['object']).apply(lambda x: x.str.len().gt(10)).any(axis=1)
Out[47]:
0 False
1 False
2 True
dtype: bool
finally we can select only those rows where value is False:
In [48]: df.loc[~df.select_dtypes(['object']).apply(lambda x: x.str.len().gt(10)).any(axis=1)]
Out[48]:
A B C D
0 1 2 2 2017-01-01
1 NaN 1 NaN 2017-01-02
Use the apply function of series, in order to keep them:
df = df[df['A'].apply(lambda x: len(x) <= 10)]