`groupby` - `qcut` but with condition - pandas

I have a dataframe as follow:
key1 key2 val
0 a x 8
1 a x 6
2 a x 7
3 a x 4
4 a x 9
5 a x 1
6 a x 2
7 a x 3
8 a x 10
9 a x 5
10 a y 4
11 a y 9
12 a y 1
13 a y 2
14 b x 17
15 b x 15
16 b x 18
17 b x 19
18 b x 12
19 b x 20
20 b x 14
21 b x 13
22 b x 16
23 b x 11
24 b y 2
25 b y 3
26 b y 10
27 b y 5
28 b y 4
29 b y 24
30 b y 22
​
What I need to do is:
Access each group by key1
In each group of key1, I need to do qcut on observations that key2 == x
For those observation that is out of bin range, assign them to lowest and highest bins
According to the dataframe above, first group key1 = a is from indx=0-13. However, only the indx from 0-9 are used to create bins(threshold). The bins(threshold) is then applied from indx=0-13
Then for second group key1 = b is from indx=14-30. Only indx from 14-23 are used to creates bins(threshold). The bins(threshold) is then applied from indx=14-30.
However, from indx=24-28 and indx=29-30, they are out of bins range. Then for indx=24-28 assign to smallest bin range, indx=29-30 assign to the largest bin range.
The output looks like this:
key1 key2 val labels
0 a x 8 1
1 a x 6 1
2 a x 7 1
3 a x 4 0
4 a x 9 1
5 a x 1 0
6 a x 2 0
7 a x 3 0
8 a x 10 1
9 a x 5 0
10 a y 4 0
11 a y 9 1
12 a y 1 0
13 a y 2 0
14 b x 17 1
15 b x 15 0
16 b x 18 1
17 b x 19 1
18 b x 12 0
19 b x 20 1
20 b x 14 0
21 b x 13 0
22 b x 16 1
23 b x 11 0
24 b y 2 0
25 b y 3 0
26 b y 10 0
27 b y 5 0
28 b y 4 0
29 b y 24 1
30 b y 22 1
My solution: I creates a dict to contain bins as: (for simplicity, take qcut=2)
dict_bins = {}
key_unique = data['key1'].unique()
for k in key_unique:
sub = data[(data['key1'] == k) & (data['key2'] == 'x')].copy()
dict_bins[k] = pd.qcut(sub['val'], 2, labels=False, retbins=True )[1]
Then, I intend to use groupby with apply, but get stuck on accessing dict_bins
data['sort_key1'] = data.groupby(['key1'])['val'].apply(lambda g: --- stuck---)
Any other solution, or modification to my solution is appreciated.
Thank you

A first approach is to create a custom function:
def discretize(df):
bins = pd.qcut(df.loc[df['key2'] == 'x', 'val'], 2, labels=False, retbins=True)[1]
bins = [-np.inf] + bins[1:-1].tolist() + [np.inf]
return pd.cut(df['val'], bins, labels=False)
df['label'] = df.groupby('key1').apply(discretize).droplevel(0)
Output:
>>> df
key1 key2 val label
0 a x 8 1
1 a x 6 1
2 a x 7 1
3 a x 4 0
4 a x 9 1
5 a x 1 0
6 a x 2 0
7 a x 3 0
8 a x 10 1
9 a x 5 0
10 a y 4 0
11 a y 9 1
12 a y 1 0
13 a y 2 0
14 b x 17 1
15 b x 15 0
16 b x 18 1
17 b x 19 1
18 b x 12 0
19 b x 20 1
20 b x 14 0
21 b x 13 0
22 b x 16 1
23 b x 11 0
24 b y 2 0
25 b y 3 0
26 b y 10 0
27 b y 5 0
28 b y 4 0
29 b y 24 1
30 b y 22 1
You need to drop the first level of index to align indexes:
>>> df.groupby('key1').apply(discretize)
key1 # <- you have to drop this index level
a 0 1
1 1
2 1
3 0
4 1
5 0
6 0
7 0
8 1
9 0
10 0
11 1
12 0
13 0
b 14 1
15 0
16 1
17 1
18 0
19 1
20 0
21 0
22 1
23 0
24 0
25 0
26 0
27 0
28 0
29 1
30 1
Name: val, dtype: int64

Related

iteration calculation based on another dataframe

How to do iteration calculation as shown in df2 as desired output ?
any reference links for this > many thanks for helping
df1
a b c
0 1 0 5
1 9 9 2
2 2 2 8
3 6 3 0
4 6 1 7
df2 :
a b c
0 1 0 5 >> values from df1
1 19 18 9 >> values from (df1.iloc[1] * 2) + df2.iloc[0] *1)
2 23 22 25 >> values from (df1.iloc[2] * 2) + df2.iloc[1] *1)
3 35 28 25 >> values from (df1.iloc[3] * 2) + df2.iloc[2] *1)
4 47 30 39 >> values from (df1.iloc[4] * 2) + df2.iloc[3] *1)
IIUC, you can try:
df2 = df1.mul(2).cumsum().sub(df1.iloc[0])
Output:
a b c
0 1 0 5
1 19 18 9
2 23 22 25
3 35 28 25
4 47 30 39
more complex operation
If you want x[n] = x[n]*2 + x[n-1]*2, you need to iterate:
def process(s):
out = [s[0]]
for x in s[1:]:
out.append(x*2+out[-1]*3)
return out
df1.apply(process)
Output:
a b c
0 1 0 5
1 21 18 19
2 67 58 73
3 213 180 219
4 651 542 671

Pandas Groupby and divide the dataset into subgroups based on user input and label numbers to each subgroup

Here is my data:
ID Mnth Amt Flg
B 1 10 0
B 2 12 0
B 3 14 0
B 4 41 0
B 5 134 0
B 6 14 0
B 7 134 0
B 8 134 0
B 9 12 0
B 10 41 0
B 11 4 0
B 12 14 0
B 12 14 0
A 1 34 0
A 2 22 0
A 3 56 0
A 4 129 0
A 5 40 0
A 6 20 0
A 7 58 0
A 8 123 0
If I give 3 as input, my output should be:
ID Mnth Amt Flg Level_Flag
B 1 10 0 0
B 2 12 0 1
B 3 14 0 1
B 4 41 0 1
B 5 134 0 2
B 6 14 0 2
B 7 134 0 2
B 8 134 0 3
B 9 12 0 3
B 10 41 0 3
B 11 4 0 4
B 12 14 0 4
B 12 14 0 4
A 1 34 0 0
A 2 22 0 0
A 3 56 0 1
A 4 129 0 1
A 5 40 0 1
A 6 20 0 2
A 7 58 0 2
A 8 123 0 2
So basically I want to divide the data into subgroups with 3 rows in each subgroup from bottom up and label those subgroups as mentioned in level_flag column. I have IDs like A,C and so on. So I want to do this for each group of ID.Thanks in Advance.
Edit :- I want the same thing to be done after grouping it by ID
First we decide the unique numbers nums by dividing the length of your df by n. Then we repeat those numbers n times. Finally we reverse the array and chop it of at the length of df and reverse it one more time.
def create_flags(d, n):
nums = np.ceil(len(d) / n)
level_flag = np.repeat(np.arange(nums), n)[::-1][:len(d)][::-1]
return level_flag
df['Level_Flag'] = df.groupby('ID')['ID'].transform(lambda x: create_flags(x, 3))
ID Mnth Amt Flg Level_Flag
0 B 1 10 0 0.0
1 B 2 12 0 1.0
2 B 3 14 0 1.0
3 B 4 41 0 1.0
4 B 5 134 0 2.0
5 B 6 14 0 2.0
6 B 7 134 0 2.0
7 B 8 134 0 3.0
8 B 9 12 0 3.0
9 B 10 41 0 3.0
10 B 11 4 0 4.0
11 B 12 14 0 4.0
12 B 12 14 0 4.0
To remove the incomplete rows, use GroupBy.transform:
m = df.groupby(['ID', 'Level_Flag'])['Level_Flag'].transform('count').ge(3)
df = df[m]
ID Mnth Amt Flg Level_Flag
1 B 2 12 0 1.0
2 B 3 14 0 1.0
3 B 4 41 0 1.0
4 B 5 134 0 2.0
5 B 6 14 0 2.0
6 B 7 134 0 2.0
7 B 8 134 0 3.0
8 B 9 12 0 3.0
9 B 10 41 0 3.0
10 B 11 4 0 4.0
11 B 12 14 0 4.0
12 B 12 14 0 4.0

Select dataframe columns based on column values in Pandas

My dataframe looks like:
A B C D .... Y Z
0 5 12 14 4 2
3 6 15 10 1 30
2 10 20 12 5 15
I want to create another dataframe that only contains the columns with an average value greater than 10:
C D .... Z
12 14 2
15 10 30
20 12 15
Use:
df = df.loc[:, df.mean() > 10]
print (df)
C D Z
0 12 14 2
1 15 10 30
2 20 12 15
Detail:
print (df.mean())
A 1.666667
B 7.000000
C 15.666667
D 12.000000
Y 3.333333
Z 15.666667
dtype: float64
print (df.mean() > 10)
A False
B False
C True
D True
Y False
Z True
dtype: bool
Alternative:
print (df[df.columns[df.mean() > 10]])
C D Z
0 12 14 2
1 15 10 30
2 20 12 15
Detail:
print (df.columns[df.mean() > 10])
Index(['C', 'D', 'Z'], dtype='object')

Column values of multilevel indexed DataFrame are not properly updated

import pandas as pd
import numpy as np
df = pd.DataFrame(np.arange(30).reshape(6,5), index=[list('aaabbb'), list('XYZXYZ')])
print(df)
df.loc[pd.IndexSlice['a'], 3] /= 10
print(df)
From the above code I expected below table:
0 1 2 3 4
a X 0 1 2 0.3 4
Y 5 6 7 0.8 9
Z 10 11 12 0.13 14
b X 15 16 17 18 19
Y 20 21 22 23 24
Z 25 26 27 28 29
But the actual result is as below table:
0 1 2 3 4
a X 0 1 2 NaN 4
Y 5 6 7 NaN 9
Z 10 11 12 NaN 14
b X 15 16 17 18.0 19
Y 20 21 22 23.0 24
Z 25 26 27 28.0 29
What went wrong in the code?
Need specify second level by : for select all values:
df.loc[pd.IndexSlice['a', :], 3] /= 10
print(df)
0 1 2 3 4
a X 0 1 2 0.3 4
Y 5 6 7 0.8 9
Z 10 11 12 1.3 14
b X 15 16 17 18.0 19
Y 20 21 22 23.0 24
Z 25 26 27 28.0 29
Solution with slice:
df.loc[(slice('a'), slice(None)), 3] /= 10
print(df)
0 1 2 3 4
a X 0 1 2 0.3 4
Y 5 6 7 0.8 9
Z 10 11 12 1.3 14
b X 15 16 17 18.0 19
Y 20 21 22 23.0 24
Z 25 26 27 28.0 29

Element wise multiplication of each row

I have two DataFrame objects which I want to apply an element-wise multiplication on each row onto:
df_prob_wc.shape # (3505, 13)
df_prob_c.shape # (13, 1)
I thought I could do it with DataFrame.apply()
df_prob_wc.apply(lambda x: x.multiply(df_prob_c), axis=1)
which gives me:
TypeError: ("'int' object is not iterable", 'occurred at index $')
or with
df_prob_wc.apply(lambda x: x * df_prob_c, axis=1)
which gives me:
TypeError: 'int' object is not iterable
But it's not working.
However, I can do this:
df_prob_wc.apply(lambda x: x * np.asarray([1,2,3,4,5,6,7,8,9,10,11,12,13]), axis=1)
What am I doing wrong here?
It seems you need multiple by Series created with df_prob_c by iloc:
df_prob_wc = pd.DataFrame({'A':[1,2,3],
'B':[4,5,6],
'C':[7,8,9],
'D':[1,3,5],
'E':[5,3,6],
'F':[7,4,3]})
print (df_prob_wc)
A B C D E F
0 1 4 7 1 5 7
1 2 5 8 3 3 4
2 3 6 9 5 6 3
df_prob_c = pd.DataFrame([[4,5,6,1,2,3]])
#for align data same columns in both df
df_prob_c.index = df_prob_wc.columns
print (df_prob_c)
0
A 4
B 5
C 6
D 1
E 2
F 3
print (df_prob_wc.shape)
(3, 6)
print (df_prob_c.shape)
(6, 1)
print (df_prob_c.iloc[:,0])
A 4
B 5
C 6
D 1
E 2
F 3
Name: 0, dtype: int64
print (df_prob_wc.mul(df_prob_c.iloc[:,0], axis=1))
A B C D E F
0 4 20 42 1 10 21
1 8 25 48 3 6 12
2 12 30 54 5 12 9
Another solution is multiple by numpy array, only need [:,0] for select:
print (df_prob_wc.mul(df_prob_c.values[:,0], axis=1))
A B C D E F
0 4 20 42 1 10 21
1 8 25 48 3 6 12
2 12 30 54 5 12 9
And another solution with DataFrame.squeeze:
print (df_prob_wc.mul(df_prob_c.squeeze(), axis=1))
A B C D E F
0 4 20 42 1 10 21
1 8 25 48 3 6 12
2 12 30 54 5 12 9