I have a dataframe where one column is a list of groups each of my users belongs to. Something like:
index groups
0 ['a','b','c']
1 ['c']
2 ['b','c','e']
3 ['a','c']
4 ['b','e']
And what I would like to do is create a series of dummy columns to identify which groups each user belongs to in order to run some analyses
index a b c d e
0 1 1 1 0 0
1 0 0 1 0 0
2 0 1 1 0 1
3 1 0 1 0 0
4 0 1 0 0 0
pd.get_dummies(df['groups'])
won't work because that just returns a column for each different list in my column.
The solution needs to be efficient as the dataframe will contain 500,000+ rows.
Using s for your df['groups']:
In [21]: s = pd.Series({0: ['a', 'b', 'c'], 1:['c'], 2: ['b', 'c', 'e'], 3: ['a', 'c'], 4: ['b', 'e'] })
In [22]: s
Out[22]:
0 [a, b, c]
1 [c]
2 [b, c, e]
3 [a, c]
4 [b, e]
dtype: object
This is a possible solution:
In [23]: pd.get_dummies(s.apply(pd.Series).stack()).sum(level=0)
Out[23]:
a b c e
0 1 1 1 0
1 0 0 1 0
2 0 1 1 1
3 1 0 1 0
4 0 1 0 1
The logic of this is:
.apply(Series) converts the series of lists to a dataframe
.stack() puts everything in one column again (creating a multi-level index)
pd.get_dummies( ) creating the dummies
.sum(level=0) for remerging the different rows that should be one row (by summing up the second level, only keeping the original level (level=0))
An slight equivalent is pd.get_dummies(s.apply(pd.Series), prefix='', prefix_sep='').sum(level=0, axis=1)
If this will be efficient enough, I don't know, but in any case, if performance is important, storing lists in a dataframe is not a very good idea.
Very fast solution in case you have a large dataframe
Using sklearn.preprocessing.MultiLabelBinarizer
import pandas as pd
from sklearn.preprocessing import MultiLabelBinarizer
df = pd.DataFrame(
{'groups':
[['a','b','c'],
['c'],
['b','c','e'],
['a','c'],
['b','e']]
}, columns=['groups'])
s = df['groups']
mlb = MultiLabelBinarizer()
pd.DataFrame(mlb.fit_transform(s),columns=mlb.classes_, index=df.index)
Result:
a b c e
0 1 1 1 0
1 0 0 1 0
2 0 1 1 1
3 1 0 1 0
4 0 1 0 1
Worked for me and also was suggested here and here
This is even faster:
pd.get_dummies(df['groups'].explode()).sum(level=0)
Using .explode() instead of .apply(pd.Series).stack()
Comparing with the other solutions:
import timeit
import pandas as pd
setup = '''
import time
import pandas as pd
s = pd.Series({0:['a','b','c'],1:['c'],2:['b','c','e'],3:['a','c'],4:['b','e']})
df = s.rename('groups').to_frame()
'''
m1 = "pd.get_dummies(s.apply(pd.Series).stack()).sum(level=0)"
m2 = "df.groups.apply(lambda x: pd.Series([1] * len(x), index=x)).fillna(0, downcast='infer')"
m3 = "pd.get_dummies(df['groups'].explode()).sum(level=0)"
times = {f"m{i+1}":min(timeit.Timer(m, setup=setup).repeat(7, 1000)) for i, m in enumerate([m1, m2, m3])}
pd.DataFrame([times],index=['ms'])
# m1 m2 m3
# ms 5.586517 3.821662 2.547167
Even though this quest was answered, I have a faster solution:
df.groups.apply(lambda x: pd.Series([1] * len(x), index=x)).fillna(0, downcast='infer')
And, in case you have empty groups or NaN, you could just:
df.loc[df.groups.str.len() > 0].apply(lambda x: pd.Series([1] * len(x), index=x)).fillna(0, downcast='infer')
How it works
Inside the lambda, x is your list, for example ['a', 'b', 'c']. So pd.Series will be as follows:
In [2]: pd.Series([1, 1, 1], index=['a', 'b', 'c'])
Out[2]:
a 1
b 1
c 1
dtype: int64
When all pd.Series comes together, they become pd.DataFrame and their index become columns; missing index became a column with NaN as you can see next:
In [4]: a = pd.Series([1, 1, 1], index=['a', 'b', 'c'])
In [5]: b = pd.Series([1, 1, 1], index=['a', 'b', 'd'])
In [6]: pd.DataFrame([a, b])
Out[6]:
a b c d
0 1.0 1.0 1.0 NaN
1 1.0 1.0 NaN 1.0
Now fillna fills those NaN with 0:
In [7]: pd.DataFrame([a, b]).fillna(0)
Out[7]:
a b c d
0 1.0 1.0 1.0 0.0
1 1.0 1.0 0.0 1.0
And downcast='infer' is to downcast from float to int:
In [11]: pd.DataFrame([a, b]).fillna(0, downcast='infer')
Out[11]:
a b c d
0 1 1 1 0
1 1 1 0 1
PS.: It's not required the use of .fillna(0, downcast='infer').
You can use explode and crosstab:
s = pd.Series([['a', 'b', 'c'], ['c'], ['b', 'c', 'e'], ['a', 'c'], ['b', 'e']])
s = s.explode()
pd.crosstab(s.index, s)
Output:
col_0 a b c e
row_0
0 1 1 1 0
1 0 0 1 0
2 0 1 1 1
3 1 0 1 0
4 0 1 0 1
You can use str.join to join all elements in list present in series into string and then use str.get_dummies:
out = df.join(df['groups'].str.join('|').str.get_dummies())
print(out)
groups a b c e
0 [a, b, c] 1 1 1 0
1 [c] 0 0 1 0
2 [b, c, e] 0 1 1 1
3 [a, c] 1 0 1 0
4 [b, e] 0 1 0 1
I have 2 dataframes with the same length, and I'd like to compare specific columns between them. If the value of the first column in one of the dataframe is bigger - i'd like it to take the value in the second column and assign it to a new dataframe.
See example. The first dataframe:
0 class
0 1.9 0
1 9.8 0
2 4.5 0
3 8.1 0
4 1.9 0
The second dataframe:
0 class
0 1.4 1
1 7.8 1
2 8.5 1
3 9.1 1
4 3.9 1
The new dataframe should look like:
class
0 0
1 0
2 1
3 1
4 1
Use numpy.where with DataFrame constructor:
df = pd.DataFrame({'class': np.where(df1[0] > df2[0], df1['class'], df2['class'])})
Or DataFrame.where:
df = df1[['class']].where(df1[0] > df2[0], df2[['class']])
print (df)
class
0 0
1 0
2 1
3 1
4 1
EDIT:
If there is another condition use numpy.select and if necessary numpy.isclose
print (df2)
0 class
0 1.4 1
1 7.8 1
2 8.5 1
3 9.1 1
4 1.9 1
masks = [df1[0] == df2[0], df1[0] > df2[0]]
#if need compare floats in some accuracy
#masks = [np.isclose(df1[0], df2[0]), df1[0] > df2[0]]
vals = ['not_determined', df1['class']]
df = pd.DataFrame({'class': np.select(masks, vals, df2['class'])})
print (df)
class
0 0
1 0
2 1
3 1
4 not_determined
Or:
masks = [df1[0] == df2[0], df1[0] > df2[0]]
vals = ['not_determined', 1]
df = pd.DataFrame({'class': np.select(masks, vals, 1)})
print (df)
class
0 0
1 0
2 1
3 1
4 not_determined
Solution for out of box:
df = np.sign(df1[0].sub(df2[0])).map({1:0, -1:1, 0:'not_determined'}).to_frame('class')
print (df)
class
0 0
1 0
2 1
3 1
4 not_determined
Since class is 0 and 1, you could try,
df1[0].lt(df2[0]).astype(int)
For generic solutions, check jezrael's answer.
Try this one:
>>> import numpy as np
>>> import pandas as pd
>>> df_1
0 class
0 1.9 0
1 9.8 0
2 4.5 0
3 8.1 0
4 1.9 0
>>> df_2
0 class
0 1.4 1
1 7.8 1
2 8.5 1
3 9.1 1
4 3.9 1
>>> df_3=pd.DataFrame()
>>> df_3["class"]=np.where(df_1["0"]>df_2["0"], df_1["class"], df_2["class"])
>>> df_3
class
0 0
1 0
2 1
3 1
4 1
The Lambdas in the following code return the same Series, but the aggregation results are different. Why?
import pandas as pd
df=pd.DataFrame([1, 2])
print(df)
print(df.agg({0: lambda x: x.cumsum()}))
print(df.agg({0: lambda x: pd.Series([1, 3], name=0)}))
Which gives:
0
0 1
1 2
0
0 1
1 3
0
0 1
0 1 3
1 1 3
is_correct, question_id
t 1
t 1
f 1
f 1
t 2
t 2
Desired results:
correct_count, incorrect_count, question_id
2 2 1
2 0 2
This is what I have, but I can only get a correct count
df[df["is_correct"]].groupby("question_id")["question_id"].count()
you can use pivot_table function for that:
In [28]: data = """\
....: is_correct question_id
....: t 1
....: t 1
....: f 1
....: f 1
....: t 2
....: t 2
....: """
In [29]: df = pd.read_csv(io.StringIO(data), delim_whitespace=True)
In [30]: df['count'] = 0
In [31]:
In [31]: df
Out[31]:
is_correct question_id count
0 t 1 0
1 t 1 0
2 f 1 0
3 f 1 0
4 t 2 0
5 t 2 0
In [32]:
In [32]: df.pivot_table(index='question_id', columns='is_correct',
....: values='count', aggfunc='count', fill_value=0)\
....: .reset_index()
Out[32]:
is_correct question_id f t
0 1 2 2
1 2 0 2
You can use a groupby after creating another column that you will use for counting:
df = pd.DataFrame({'is_correct':['t','t','f','f','t','t'],'question_id':[1,1,1,1,2,2]})
df['to_sum_up']=1
is_correct question_id to_sum_up
t 1 1
t 1 1
f 1 1
f 1 1
t 2 1
t 2 1
df2 = df.groupby(['question_id','is_correct'],as_index = False).sum()
Once you've made your groupby, you just have to rearrange your data so that it fits the columns you want:
df2['correct_count'] = df2.ix[df2['is_correct']=='t','N']
df2['incorrect_count'] = df2.ix[df2['is_correct']=='f','N']
Then in order to have a nice dataframe as output:
df2.ix[df2['correct_count'].isnull(),'correct_count'] = 0
df2.ix[df2['incorrect_count'].isnull(),'incorrect_count'] = 0
df2 = df2.groupby('question_id',as_index = False).max()
df2 = df2.drop(['N','is_correct'],1)
question_id correct_count incorrect_count
0 1 2 2
1 2 2 0
I have a key-value dataframe:
pd.DataFrame(columns=['X','Y','val'],data= [['a','z',5],['b','g',3],['b','y',6],['e','r',9]])
> X Y val
0 a z 5
1 b g 3
2 b y 6
3 e r 9
Which I'd like to convert into a denser dataframe:
X z g y r
0 a 5 0 0 0
1 b 0 3 6 0
2 e 0 0 0 9
Before I resort to a pure-python I was wondering if there was a simple way to do this with pandas.
You can use get_dummies:
In [11]: dummies = pd.get_dummies(df['Y'])
In [12]: dummies
Out[12]:
g r y z
0 0 0 0 1
1 1 0 0 0
2 0 0 1 0
3 0 1 0 0
and then multiply by the val column:
In [13]: res = dummies.mul(df['val'], axis=0)
In [14]: res
Out[14]:
g r y z
0 0 0 0 5
1 3 0 0 0
2 0 0 6 0
3 0 9 0 0
To fix the index, you could just add the X as this index, you could first apply set_index:
In [21]: df1 = df.set_index('X', append=True)
In [22]: df1
Out[22]:
Y val
X
0 a z 5
1 b g 3
2 b y 6
3 e r 9
In [23]: dummies = pd.get_dummies(df['Y'])
In [24]: dummies.mul(df['val'], axis=0)
Out[24]:
g r y z
X
0 a 0 0 0 5
1 b 3 0 0 0
2 b 0 0 6 0
3 e 0 9 0 0
If you wanted to do this pivot (you can also use pivot_table):
In [31]: df.pivot('X', 'Y').fillna(0)
Out[31]:
val
Y g r y z
X
a 0 0 0 5
b 3 0 6 0
e 0 9 0 0
Perhaps you want to reset_index, to make X a column (I'm not sure whether than makes sense):
In [32]: df.pivot('X', 'Y').fillna(0).reset_index()
Out[32]:
X val
Y g r y z
0 a 0 0 0 5
1 b 3 0 6 0
2 e 0 9 0 0
For completeness, the pivot_table:
In [33]: df.pivot_table('val', 'X', 'Y', fill_value=0)
Out[33]:
Y g r y z
X
a 0 0 0 5
b 3 0 6 0
e 0 9 0 0
In [34]: df.pivot_table('val', 'X', 'Y', fill_value=0).reset_index()
Out[34]:
Y X g r y z
0 a 0 0 0 5
1 b 3 0 6 0
2 e 0 9 0 0
Note: the column name are named Y, after reseting the index, not sure if this makes sense (and easy to rectify via res.columns.name = None).
If you want something that feels more direct. Something akin to DataFrame.lookup but for np.put might make sense.
def lookup_index(self, row_labels, col_labels):
values = self.values
ridx = self.index.get_indexer(row_labels)
cidx = self.columns.get_indexer(col_labels)
if (ridx == -1).any():
raise ValueError('One or more row labels was not found')
if (cidx == -1).any():
raise ValueError('One or more column labels was not found')
flat_index = ridx * len(self.columns) + cidx
return flat_index
flat_index = lookup_index(df, vals.X, vals.Y)
np.put(df.values, flat_index, vals.val.values)
This assumes that df has the appropriate columns and index to hold the X/Y values. Here's an ipython notebook http://nbviewer.ipython.org/6454120