Replace a particular value based on condition using groupby in pandas - pandas

I have a dataframe as shown below
ID Sector Usage Price
1 A R 20
2 A C 100
3 A R 40
4 A R 1
5 A C 200
6 A C 1
7 A C 1
8 A R 1
1 B R 40
2 B C 200
3 B R 60
4 B R 1
5 B C 400
6 B C 1
7 B C 1
8 B R 1
From the above I would like to replace the Price=1 with average Price of Sector and Usage combination other than 1.
Expected Output:
ID Sector Usage Price
1 A R 20
2 A C 100
3 A R 40
4 A R 30
5 A C 200
6 A C 150
7 A C 150
8 A R 30
1 B R 40
2 B C 200
3 B R 60
4 B R 50
5 B C 400
6 B C 300
7 B C 300
8 B R 50
For example in row 4 Sector = A, Usage=R Price=1 has to be replaced by average of Sector = A and Usage=R, ie (20+40)/2 = 30

Idea is first replace 1 to missing values by Series.mask and then use GroupBy.transform for means per groups used for replaced:
m = df['Price'] == 1
s = df.assign(Price=df['Price'].mask(m)).groupby(['Sector','Usage'])['Price'].transform('mean')
df['Price'] = np.where(m, s, df['Price']).astype(int)
Or:
s = df['Price'].mask(df['Price'] == 1)
mean = df.assign(Price=s).groupby(['Sector','Usage'])['Price'].transform('mean')
df['Price'] = s.fillna(mean).astype(int)
print (df)
ID Sector Usage Price
0 1 A R 20
1 2 A C 100
2 3 A R 40
3 4 A R 30
4 5 A C 200
5 6 A C 150
6 7 A C 150
7 8 A R 30
8 1 B R 40
9 2 B C 200
10 3 B R 60
11 4 B R 50
12 5 B C 400
13 6 B C 300
14 7 B C 300
15 8 B R 50

Related

How to create a rolling unique count by group using pandas

I have a dataframe like the following:
group value
1 a
1 a
1 b
1 b
1 b
1 b
1 c
2 d
2 d
2 d
2 d
2 e
I want to create a column with how many unique values there have been so far for the group. Like below:
group value group_value_id
1 a 1
1 a 1
1 b 2
1 b 2
1 b 2
1 b 2
1 c 3
2 d 1
2 d 1
2 d 1
2 d 1
2 e 2
Use custom lambda function with GroupBy.transform and factorize:
df['group_value_id']=df.groupby('group')['value'].transform(lambda x:pd.factorize(x)[0]) + 1
print (df)
group value group_value_id
0 1 a 1
1 1 a 1
2 1 b 2
3 1 b 2
4 1 b 2
5 1 b 2
6 1 c 3
7 2 d 1
8 2 d 1
9 2 d 1
10 2 d 1
11 2 e 2
because:
df['group_value_id'] = df.groupby('group')['value'].rank('dense')
print (df)
DataError: No numeric types to aggregate
Also cab be solved as :
df['group_val_id'] = (df.groupby('group')['value'].
apply(lambda x:x.astype('category').cat.codes + 1))
df
group value group_val_id
0 1 a 1
1 1 a 1
2 1 b 2
3 1 b 2
4 1 b 2
5 1 b 2
6 1 c 3
7 2 d 1
8 2 d 1
9 2 d 1
10 2 d 1
11 2 e 2

pandas explode dataframe by values of cell

I from a dataframe:
df = C1 C2 C3 from_time to_time
a b c 1 3
q t y 4 9
I want to explode it by the value of from_time , to_time, so it will be:
df = C1 C2 C3 time from_time to_time
a b c 1 1 3
a b c 2 1 3
a b c 3 1 3
q t y 4 4 9
q t y 5 4 9
...
What is the best way to do so?
Use DataFrame.explode with ranges if small DataFrames:
df.insert(3, 'time', df.apply(lambda x: range(x.from_time, x.to_time + 1), axis=1))
df = df.explode('time')
print (df)
C1 C2 C3 time from_time to_time
0 a b c 1 1 3
0 a b c 2 1 3
0 a b c 3 1 3
1 q t y 4 4 9
1 q t y 5 4 9
1 q t y 6 4 9
1 q t y 7 4 9
1 q t y 8 4 9
1 q t y 9 4 9
For better performance use Index.repeat with DataFrame.loc and for new column use GroupBy.cumcount for counter per index values with from_time values:
df = df.loc[df.index.repeat(df.to_time.sub(df.from_time) + 1)]
df.insert(3, 'time', df.groupby(level=0).cumcount().add(df['from_time']))
print (df)
C1 C2 C3 time from_time to_time
0 a b c 1 1 3
0 a b c 2 1 3
0 a b c 3 1 3
1 q t y 4 4 9
1 q t y 5 4 9
1 q t y 6 4 9
1 q t y 7 4 9
1 q t y 8 4 9
1 q t y 9 4 9

Reorder pandas DataFrame based on repetitive set of integer in index

I have a pandas dataframe contains some columns, I didn't find a way to order rows as follows:
I need to order the dataframe by the field label but in sequential order (like groups)
Input
I category tags
1 A #25-74
1 B #26-170
0 C #29-106
2 A #18-109
3 B #26-86
2 A #26-108
2 C #30-125
1 B #28-145
0 B #29-93
0 D #21-102
1 F #26-108
2 F #30-125
3 A #28-145
3 D #29-93
0 B #21-102
Needed Order:
I category tags
0 C #29-106
1 B #25-74
2 F #18-109
3 C #26-86
0 B #29-93
1 D #26-170
2 B #26-108
3 B #28-145
0 C #21-102
1 D #28-145
2 A #30-125
3 A #29-93
0 B #21-102
1 A #26-108
2 C #30-125
I have searched for different ways to sort but couldn't find a way to sort using only pandas.
I appreciate every help!
One idea with helper column by GroupBy.cumcount and DataFrame.sort_values:
df['a'] = df.groupby('I').cumcount()
df = df.sort_values(['a','I'])
print (df)
I category tags a
2 0 C #29-106 0
0 1 A #25-74 0
3 2 A #18-109 0
4 3 B #26-86 0
8 0 B #29-93 1
1 1 B #26-170 1
5 2 A #26-108 1
12 3 A #28-145 1
9 0 D #21-102 2
7 1 B #28-145 2
6 2 C #30-125 2
13 3 D #29-93 2
14 0 B #21-102 3
10 1 F #26-108 3
11 2 F #30-125 3
Or first sorting by column | and then change order with Series.argsort and DataFrame.iloc:
df = df.sort_values('I')
df = df.iloc[df.groupby('I').cumcount().argsort()]
print (df)
I category tags
2 0 C #29-106
0 1 A #25-74
3 2 A #18-109
4 3 B #26-86
8 0 B #29-93
1 1 B #26-170
5 2 A #26-108
12 3 A #28-145
9 0 D #21-102
7 1 B #28-145
6 2 C #30-125
13 3 D #29-93
14 0 B #21-102
10 1 F #26-108
11 2 F #30-125

pandas groupby apply optimizing a loop

For the following data:
index bond stock investor_bond inverstor_stock
0 1 2 A B
1 1 2 A E
2 1 2 A F
3 1 2 B B
4 1 2 B E
5 1 2 B F
6 1 3 A A
7 1 3 A E
8 1 3 A G
9 1 3 B A
10 1 3 B E
11 1 3 B G
12 2 4 C F
13 2 4 C A
14 2 4 C C
15 2 5 B E
16 2 5 B B
17 2 5 B H
bond1 has two investors, A,B. stock2 has three investors, B,E,F. For each investor pair (investor_bond, investor_stock), we want to filter it out if they had ever invested in the same bond/stock.
For example, for a pair of (B,F) of index=5, we want to filter it out because both of them invested in stock 2.
Sample output should be like:
index bond stock investor_bond investor_stock
11 1 3 B G
So far I have tried using two loops.
A1 = A1.groupby('bond').apply(lambda x: x[~x.investor_stock.isin(x.bond)]).reset_index(drop=True)
stock_list=A1.groupby(['bond','stock']).apply(lambda x: x.investor_stock.unique()).reset_index()
stock_list=stock_list.rename(columns={0:'s'})
stock_list=stock_list.groupby('bond').apply(lambda x: list(x.s)).reset_index()
stock_list=stock_list.rename(columns={0:'s'})
A1=pd.merge(A1,stock_list,on='bond',how='left')
A1['in_out']=False
for j in range(0,len(A1)):
for i in range (0,len(A1.s[j])):
A1['in_out'] = A1.in_out | (
A1.investor_bond.isin(A1.s[j][i]) & A1.investor_stock.isin(A1.s[j][i]))
print(j)
The loop is running forever due to the data size, and I am seeking a faster way.

How to do intersection match between 2 DataFrames in Pandas?

Assume exists 2 DataFrames A and B like following
A:
a A
b B
c C
B:
1 2
3 4
How to produce C DataFrame like
a A 1 2
a A 3 4
b B 1 2
b B 3 4
c C 1 2
c C 3 4
Is there some function in Pandas can do this operation?
First all values has to be unique in each DataFrame.
I think you need product:
from itertools import product
A = pd.DataFrame({'a':list('abc')})
B = pd.DataFrame({'a':[1,2]})
C = pd.DataFrame(list(product(A['a'], B['a'])))
print (C)
0 1
0 a 1
1 a 2
2 b 1
3 b 2
4 c 1
5 c 2
Pandas pure solutions with MultiIndex.from_product:
mux = pd.MultiIndex.from_product([A['a'], B['a']])
C = pd.DataFrame(mux.values.tolist())
print (C)
0 1
0 a 1
1 a 2
2 b 1
3 b 2
4 c 1
5 c 2
C = mux.to_frame().reset_index(drop=True)
print (C)
0 1
0 a 1
1 a 2
2 b 1
3 b 2
4 c 1
5 c 2
Solution with cross join with merge and column filled by same scalars by assign:
df = pd.merge(A.assign(tmp=1), B.assign(tmp=1), on='tmp').drop('tmp', 1)
df.columns = ['a','b']
print (df)
a b
0 a 1
1 a 2
2 b 1
3 b 2
4 c 1
5 c 2
EDIT:
A = pd.DataFrame({'a':list('abc'), 'b':list('ABC')})
B = pd.DataFrame({'a':[1,3], 'c':[2,4]})
print (A)
a b
0 a A
1 b B
2 c C
print (B)
a c
0 1 2
1 3 4
C = pd.merge(A.assign(tmp=1), B.assign(tmp=1), on='tmp').drop('tmp', 1)
C.columns = list('abcd')
print (C)
a b c d
0 a A 1 2
1 a A 3 4
2 b B 1 2
3 b B 3 4
4 c C 1 2
5 c C 3 4