Pandas groupby with MultiIndex columns and different levels - pandas

I want to do a groupby on a MultiIndex dataframe, counting the occurrences for each column for every user2 in df:
>>> df
user1 user2 count
0 1 2
a x d a
0 2 6 0 1 0 0
1 4 6 0 0 0 3
2 21 76 2 0 1 0
3 5 18 0 0 0 0
Note that user1 and user2 are at the same level as count (side effect of merging).
Desired output:
user2 count
0 1 2
a x d a
0 6 0 1 0 1
1 76 1 0 0 0
3 18 0 0 0 0
I've tried
>>> df.groupby(['user2','count'])
but I get
ValueError: Grouper for 'count' not 1-dimensional
GENERATOR CODE:
df = pd.DataFrame({'user1':[2,4,21,21],'user2':[6,6,76,76],'param1':[0,2,0,1],'param2':['x','a','a','d'],'count':[1,3,2,1]}, columns=['user1','user2','param1','param2','count'])
df = df.set_index(['user1','user2','param1','param2'])
df = df.unstack([2,3]).sort_index(axis=1).reset_index()
df2 = pd.DataFrame({'user1':[2,5,21],'user2':[6,18,76]})
df2.columns = pd.MultiIndex.from_product([df2.columns, [''],['']])
final_df = df2.merge(df, on=['user1','user2'], how='outer').fillna(0)

IIUC, you want:
final_df.where(final_df>0).groupby('user2').count().drop('user1', axis=1).reset_index()
Output:
user2 count
0 1 2
a x d a
0 6 0 1 0 1
1 18 0 0 0 0
2 76 1 0 1 0
Avoid dropping columns, select only 'count', and changed function to sum:
final_df.where(final_df>0).groupby('user2').sum()[['count']].reset_index()
Output:
user2 count
0 1 2
a x d a
0 6 0.0 1.0 0.0 3.0
1 18 0.0 0.0 0.0 0.0
2 76 2.0 0.0 1.0 0.0
To void dropping user2 equal to zero values also.
final_df[['count']].where(final_df[['count']]>0)\
.groupby(final_df.user2).sum().reset_index()

Related

How to create and populate pandas columns based on cell values

I have created a dataframe called df as follows:
import pandas as pd
d = {'feature1': [1, 22,45,78,78], 'feature2': [33, 2,2,65,65], 'feature3': [100, 2,359,87,2],}
df = pd.DataFrame(data=d)
print(df)
The dataframe looks like this:
I want to create two new columns called Freq_1 and Freq_2 that count, for each record, how many times the number 1 and number 2 appear respectively. So, I'd like the resulting dataframe to look like this:
So, let's take a look at the column called Freq_1:
for the first record, it's equal to 1 because the number 1 appears only once across the whole first record;
for the other records, it's equal to 0 because the number 1 never appears.
Let's take a look now at the column called Freq_2:
for the first record, Freq_2 is equal to 0 because number 2 doesn't appear;
for second record, Freq_2 is equal to 2 because the number 2 appears twice;
and so on ...
How do I create the columns Freq_1 and Freq_2 in pandas?
Try this:
freq = {
i: df.eq(i).sum(axis=1) for i in range(10)
}
pd.concat([df, pd.DataFrame(freq).add_prefix("Freq_")], axis=1)
Result:
feature1 feature2 feature3 Freq_0 Freq_1 Freq_2 Freq_3 Freq_4 Freq_5 Freq_6 Freq_7 Freq_8 Freq_9
1 33 100 0 1 0 0 0 0 0 0 0 0
22 2 2 0 0 2 0 0 0 0 0 0 0
45 2 359 0 0 1 0 0 0 0 0 0 0
78 65 87 0 0 0 0 0 0 0 0 0 0
78 65 2 0 0 1 0 0 0 0 0 0 0
String pattern matching can be performed when the columns are casted to string columns.
d = {'feature1': [1, 22,45,78,78], 'feature2': [33, 2,2,65,65], 'feature3': [100, 2,359,87,2],}
df = pd.DataFrame(data=d)
df = df.stack().astype(str).unstack()
Now we can iterate for each pattern that we are looking for:
usefull_columns = df.columns
for pattern in ['1', '2']:
df[f'freq_{pattern}'] = df[usefull_columns].stack().str.count(pattern).unstack().max(axis=1)
Printing the output:
feature1 feature2 feature3 freq_1 freq_2
0 1 33 100 1.0 0.0
1 22 2 2 0.0 2.0
2 45 2 359 0.0 1.0
3 78 65 87 0.0 0.0
4 78 65 2 0.0 1.0
We can do
s = df.where(df.isin([1,2])).stack()
out = df.join(pd.crosstab(s.index.get_level_values(0),s).add_prefix('Freq_')).fillna(0)
Out[299]:
feature1 feature2 feature3 Freq_1.0 Freq_2.0
0 1 33 100 1.0 0.0
1 22 2 2 0.0 2.0
2 45 2 359 0.0 1.0
3 78 65 87 0.0 0.0
4 78 65 2 0.0 1.0

How to compare 2 dataframes columns and add a value to a new dataframe based on the result

I have 2 dataframes with the same length, and I'd like to compare specific columns between them. If the value of the first column in one of the dataframe is bigger - i'd like it to take the value in the second column and assign it to a new dataframe.
See example. The first dataframe:
0 class
0 1.9 0
1 9.8 0
2 4.5 0
3 8.1 0
4 1.9 0
The second dataframe:
0 class
0 1.4 1
1 7.8 1
2 8.5 1
3 9.1 1
4 3.9 1
The new dataframe should look like:
class
0 0
1 0
2 1
3 1
4 1
Use numpy.where with DataFrame constructor:
df = pd.DataFrame({'class': np.where(df1[0] > df2[0], df1['class'], df2['class'])})
Or DataFrame.where:
df = df1[['class']].where(df1[0] > df2[0], df2[['class']])
print (df)
class
0 0
1 0
2 1
3 1
4 1
EDIT:
If there is another condition use numpy.select and if necessary numpy.isclose
print (df2)
0 class
0 1.4 1
1 7.8 1
2 8.5 1
3 9.1 1
4 1.9 1
masks = [df1[0] == df2[0], df1[0] > df2[0]]
#if need compare floats in some accuracy
#masks = [np.isclose(df1[0], df2[0]), df1[0] > df2[0]]
vals = ['not_determined', df1['class']]
df = pd.DataFrame({'class': np.select(masks, vals, df2['class'])})
print (df)
class
0 0
1 0
2 1
3 1
4 not_determined
Or:
masks = [df1[0] == df2[0], df1[0] > df2[0]]
vals = ['not_determined', 1]
df = pd.DataFrame({'class': np.select(masks, vals, 1)})
print (df)
class
0 0
1 0
2 1
3 1
4 not_determined
Solution for out of box:
df = np.sign(df1[0].sub(df2[0])).map({1:0, -1:1, 0:'not_determined'}).to_frame('class')
print (df)
class
0 0
1 0
2 1
3 1
4 not_determined
Since class is 0 and 1, you could try,
df1[0].lt(df2[0]).astype(int)
For generic solutions, check jezrael's answer.
Try this one:
>>> import numpy as np
>>> import pandas as pd
>>> df_1
0 class
0 1.9 0
1 9.8 0
2 4.5 0
3 8.1 0
4 1.9 0
>>> df_2
0 class
0 1.4 1
1 7.8 1
2 8.5 1
3 9.1 1
4 3.9 1
>>> df_3=pd.DataFrame()
>>> df_3["class"]=np.where(df_1["0"]>df_2["0"], df_1["class"], df_2["class"])
>>> df_3
class
0 0
1 0
2 1
3 1
4 1

pandas vectorize one valid signal in 5 rows

I want to find the first valid signal in the dataframe. A valid signal is defined that there is no signal in its preceding 5 rows.
The dataframe is like:
entry
0 0
1 1
2 0
3 0
4 1
5 0
6 0
7 0
8 1
9 0
10 0
11 0
12 0
13 0
14 0
The entry signal on row 4 is not valid because there is a signal on row 1. Every signals will negate any signal in the following 5 rows.
I implement this by using an apply function with a parameter recording the signal row counter.
The code is as following
import pandas as pd
def testfun(row, orderinfo):
if orderinfo['countrows'] > orderinfo['maxrows']:
orderinfo['countrows'] = 0
if orderinfo['countrows'] > 0:
orderinfo['countrows'] += 1
row['entry'] = 0
if row['entry'] == 1 and orderinfo['countrows'] == 0:
orderinfo['countrows'] += 1
return row
if __name__ == '__main__':
df = pd.DataFrame({'entry':[0,1,0,1,0,0,0,0,1,0,0,0,0,0,0]})
orderinfo = dict(countrows=0, maxrows=5)
df = df.apply(lambda row: testfun(row, orderinfo), axis=1)
print(df)
output is:
entry
0 0
1 1
2 0
3 0
4 0
5 0
6 0
7 0
8 1
9 0
10 0
11 0
12 0
13 0
14 0
But I am wondering if there is any vectorized way to do this? Because apply is not very efficient.
IIUC,
You need rolling with min_periods=1 and sum less than or equal 1 and compare against entry column
(df.entry.rolling(4, min_periods=1).sum().le(1) & df.entry).astype(int)
Out[595]:
0 0
1 1
2 0
3 0
4 0
5 0
6 0
7 0
8 1
9 0
10 0
11 0
12 0
13 0
14 0
Name: entry, dtype: int32

converting some columns in groupby to multilevel in pandas

I have a groupby object that looks like the following dataframe:
df = pd.DataFrame({'user1':[2,4,21,21],'user2':[6,13,76,76],'param1':[0,2,0,1],'param2':['x','a','a','d'],'count':[1
,3,2,1]}, columns=['user1','user2','param1','param2','count'])
df = df.set_index(['user1','user2','param1','param2'])
which gives
count
user1 user2 param1 param2
2 6 0 x 1
4 13 2 a 3
21 76 0 a 2
1 d 1
I want to have something like this:
param1 0 1 2
param2 a x d a
user1 user2
2 6 0 1 0 0
4 13 0 0 0 3
21 76 2 0 1 0
Use Series.unstack with sorting columns by DataFrame.sort_index:
df = df['count'].unstack([2,3], fill_value=0).sort_index(axis=1)
print (df)
param1 0 1 2
param2 a x d a
user1 user2
2 6 0 1 0 0
4 13 0 0 0 3
21 76 2 0 1 0

create new column based on other columns in pandas dataframe

What is the best way to create a set of new columns based on two other columns? (similar to a crosstab or SQL case statement)
This works but performance is very slow on large dataframes:
for label in labels:
df[label + '_amt'] = df.apply(lambda row: row['amount'] if row['product'] == label else 0, axis=1)
You can use pivot_table
>>> df
amount product
0 6 b
1 3 c
2 3 a
3 7 a
4 7 a
>>> df.pivot_table(index=df.index, values='amount',
... columns='product', fill_value=0)
product a b c
0 0 6 0
1 0 0 3
2 3 0 0
3 7 0 0
4 7 0 0
or,
>>> for label in df['product'].unique():
... df[label + '_amt'] = (df['product'] == label) * df['amount']
...
>>> df
amount product b_amt c_amt a_amt
0 6 b 6 0 0
1 3 c 0 3 0
2 3 a 0 0 3
3 7 a 0 0 7
4 7 a 0 0 7