Create column values based on multiple conditions in other columns - pandas

having a dataframe, df, with some columns describing some units and their output in a given bin as exemplified below:
df = pd.DataFrame({'bin_dir' : pd.cut(np.rad2deg(np.random.vonmises(np.pi,0.03,100)) % 360,np.arange(0,365,5)),
'Unit' : np.tile(np.arange(1,11),10),
'value' : np.random.randn(100)*1000+3600})
I now wants to create a column col1 which has the value 1 when unit is 1,3,5 and bin_dir is (350,355], (355,360], (0,5], (5,10] and 2 when the units are 2,4,9 and dir_bin is (350,355], (355,360], (0,5], (5,10]
How can one do that? in dplyr I can use mutate with nested ifelse statements.
It would be nice if the solution could be incorporated in a chained command :)
Thanks

you can use nested np.where():
import re
import pandas as pd
In [50]: bins = re.findall(r'\(.*?\]', '(350, 355], (355, 360], (0, 5], (5, 10]')
...: bin_mask = df.bin_dir.isin(bins)
...: unit_mask1 = df.Unit.isin([1,3,5])
...: unit_mask2 = df.Unit.isin([2,4,9])
...:
In [51]: df.assign(col1=
...: np.where(bin_mask & unit_mask1,
...: 1,
...: np.where(bin_mask & unit_mask2, 2, np.nan)
...: )
...: )
...:
Out[51]:
Unit bin_dir value col1
0 1 (195, 200] 1228.056261 NaN
1 2 (125, 130] 3246.052662 NaN
2 3 (150, 155] 3128.356490 NaN
3 4 (215, 220] 2900.812099 NaN
4 5 (110, 115] 4324.152904 NaN
5 6 (150, 155] 4783.110204 NaN
6 7 (240, 245] 4810.120258 NaN
7 8 (210, 215] 4307.576911 NaN
8 9 (15, 20] 3043.099987 NaN
9 10 (0, 5] 4633.435048 NaN
10 1 (145, 150] 3401.690163 NaN
11 2 (320, 325] 4224.314088 NaN
12 3 (350, 355] 4037.081806 1.0
13 4 (295, 300] 3096.652374 NaN
14 5 (235, 240] 4738.227922 NaN
15 6 (235, 240] 1973.561204 NaN
16 7 (270, 275] 3500.619163 NaN
17 8 (45, 50] 4234.621801 NaN
18 9 (255, 260] 4267.575087 NaN
19 10 (320, 325] 3031.733130 NaN
20 1 (235, 240] 3137.832272 NaN
21 2 (330, 335] 4113.654195 NaN
22 3 (265, 270] 3060.886390 NaN
23 4 (290, 295] 2836.105371 NaN
24 5 (255, 260] 2756.894839 NaN
.. ... ... ... ...
75 6 (325, 330] 2471.775169 NaN
76 7 (70, 75] 4463.964881 NaN
77 8 (110, 115] 5681.124294 NaN
78 9 (135, 140] 2500.650717 NaN
79 10 (225, 230] 2936.364153 NaN
80 1 (280, 285] 1138.591459 NaN
81 2 (250, 255] 3121.142300 NaN
82 3 (150, 155] 2991.257906 NaN
83 4 (160, 165] 3078.156743 NaN
84 5 (130, 135] 4335.076559 NaN
85 6 (85, 90] 4970.471290 NaN
86 7 (335, 340] 3207.906304 NaN
87 8 (350, 355] 3605.474926 NaN
88 9 (125, 130] 4922.963220 NaN
89 10 (60, 65] 3121.061944 NaN
90 1 (105, 110] 3092.191627 NaN
91 2 (0, 5] 3693.602055 2.0
92 3 (195, 200] 2291.508096 NaN
93 4 (40, 45] 4628.409801 NaN
94 5 (215, 220] 3327.321452 NaN
95 6 (110, 115] 4347.471046 NaN
96 7 (110, 115] 4494.707840 NaN
97 8 (110, 115] 3545.460851 NaN
98 9 (55, 60] 2831.042251 NaN
99 10 (30, 35] 3705.225870 NaN
[100 rows x 4 columns]
Of course you can do this without precomputed masks:
In [52]: df.assign(col1=
...: np.where(df.bin_dir.isin(bins) & df.Unit.isin([1,3,5]),
...: 1,
...: np.where(df.bin_dir.isin(bins) & df.Unit.isin([2,4,9]),
...: 2,
...: np.nan
...: )
...: )
...: )
...:
Out[52]:
Unit bin_dir value col1
0 1 (195, 200] 1228.056261 NaN
1 2 (125, 130] 3246.052662 NaN
2 3 (150, 155] 3128.356490 NaN
3 4 (215, 220] 2900.812099 NaN
4 5 (110, 115] 4324.152904 NaN
5 6 (150, 155] 4783.110204 NaN
6 7 (240, 245] 4810.120258 NaN
7 8 (210, 215] 4307.576911 NaN
8 9 (15, 20] 3043.099987 NaN
9 10 (0, 5] 4633.435048 NaN
10 1 (145, 150] 3401.690163 NaN
11 2 (320, 325] 4224.314088 NaN
12 3 (350, 355] 4037.081806 1.0
13 4 (295, 300] 3096.652374 NaN
14 5 (235, 240] 4738.227922 NaN
15 6 (235, 240] 1973.561204 NaN
16 7 (270, 275] 3500.619163 NaN
17 8 (45, 50] 4234.621801 NaN
18 9 (255, 260] 4267.575087 NaN
19 10 (320, 325] 3031.733130 NaN
20 1 (235, 240] 3137.832272 NaN
21 2 (330, 335] 4113.654195 NaN
22 3 (265, 270] 3060.886390 NaN
23 4 (290, 295] 2836.105371 NaN
24 5 (255, 260] 2756.894839 NaN
.. ... ... ... ...
75 6 (325, 330] 2471.775169 NaN
76 7 (70, 75] 4463.964881 NaN
77 8 (110, 115] 5681.124294 NaN
78 9 (135, 140] 2500.650717 NaN
79 10 (225, 230] 2936.364153 NaN
80 1 (280, 285] 1138.591459 NaN
81 2 (250, 255] 3121.142300 NaN
82 3 (150, 155] 2991.257906 NaN
83 4 (160, 165] 3078.156743 NaN
84 5 (130, 135] 4335.076559 NaN
85 6 (85, 90] 4970.471290 NaN
86 7 (335, 340] 3207.906304 NaN
87 8 (350, 355] 3605.474926 NaN
88 9 (125, 130] 4922.963220 NaN
89 10 (60, 65] 3121.061944 NaN
90 1 (105, 110] 3092.191627 NaN
91 2 (0, 5] 3693.602055 2.0
92 3 (195, 200] 2291.508096 NaN
93 4 (40, 45] 4628.409801 NaN
94 5 (215, 220] 3327.321452 NaN
95 6 (110, 115] 4347.471046 NaN
96 7 (110, 115] 4494.707840 NaN
97 8 (110, 115] 3545.460851 NaN
98 9 (55, 60] 2831.042251 NaN
99 10 (30, 35] 3705.225870 NaN
[100 rows x 4 columns]
but it'll be slower and it looks bit cumbersome

Using list comprehension :
bin_filt = ['(350, 355]', '(355, 360]', '(0, 5]', '(5, 10]']
# Creates a column 'col1'
df['col1'] = [1 for i in range(df.shape[0]) if df['Unit'][i] in [1, 3, 5] and df['bin_dir'][i] in bin_filt else 0]
# Creates a column 'col2'
df['col2'] = [2 for i in range(df.shape[0]) if df['Unit'][i] in [2, 4, 9] and df['bin_dir'][i] in bin_filt else 0]
# You can replace the 'else' statement a t the end of the list comprehension to put the value you want instead

Related

Use condition in a dataframe to replace values in another dataframe with nan

I have a dataframe that contains concentration values for a set of samples as follows:
Sample
Ethanol
Acetone
Formaldehyde
Methane
A
20
20
20
20
A
30
23
20
nan
A
20
23
nan
nan
A
nan
20
nan
nan
B
21
46
87
54
B
23
74
nan
54
B
23
67
nan
53
B
23
nan
nan
33
C
23
nan
nan
66
C
22
nan
nan
88
C
22
nan
nan
90
C
22
nan
nan
88
I have second dataframe that contains the proportion of concentration values that are not missing in the first dataframe:
Sample
Ethanol
Acetone
Formaldehyde
Methane
A
0.75
1
0.5
0.25
B
1
0.75
0.25
1
C
1
0
0
1
I would like to replace value in the first dataframe with nan when the condition in the second dataframe is 0.5 or less. Hence, the resulting dataframe would look like that below. Any help would be great!
Sample
Ethanol
Acetone
Formaldehyde
Methane
A
20
20
nan
nan
A
30
23
nan
nan
A
20
23
nan
nan
A
nan
20
nan
nan
B
21
46
nan
54
B
23
74
nan
54
B
23
67
nan
53
B
23
nan
nan
33
C
23
nan
nan
66
C
22
nan
nan
88
C
22
nan
nan
90
C
22
nan
nan
88
Is it what your are looking for:
>>> df2.set_index('Sample').mask(lambda x: x <= 0.5) \
.mul(df1.set_index('Sample')).reset_index()
Sample Ethanol Acetone Formaldehyde Methane
0 A 15.0 20.00 NaN NaN
1 A 22.5 23.00 NaN NaN
2 A 15.0 23.00 NaN NaN
3 A NaN 20.00 NaN NaN
4 B 21.0 34.50 NaN 54.0
5 B 23.0 55.50 NaN 54.0
6 B 23.0 50.25 NaN 53.0
7 B 23.0 NaN NaN 33.0
8 C 23.0 NaN NaN 66.0
9 C 22.0 NaN NaN 88.0
10 C 22.0 NaN NaN 90.0
11 C 22.0 NaN NaN 88.0

Prevent pandas interpolate from extrapolating

I am trying to interpolate a some data containing NaN's. I would like to fill 1-3 consecutive NaN's, but I cannot figure out to do so with pd.interpolate()
data_chunk = np.array([np.nan, np.nan, np.nan, 4, 5, np.nan, np.nan, np.nan, np.nan, 10, np.nan, np.nan, np.nan, 14])
data_chunk = pd.DataFrame(data_chunk)[0]
print(data_chunk)
print(data_chunk.interpolate(method='linear', limit_direction='both', limit=3, limit_area='inside'))
Original data:
0 NaN
1 NaN
2 NaN
3 4.0
4 5.0
5 NaN
6 NaN
7 NaN
8 NaN
9 10.0
10 NaN
11 NaN
12 NaN
13 14.0
Attempt at interpolating:
0 NaN
1 NaN
2 NaN
3 4.0
4 5.0
5 6.0
6 7.0
7 8.0
8 9.0
9 10.0
10 11.0
11 12.0
12 13.0
13 14.0
Expected result:
0 NaN
1 NaN
2 NaN
3 4.0
4 5.0
5 NaN
6 NaN
7 NaN
8 NaN
9 10.0
10 11.0
11 12.0
12 13.0
13 14.0
Any help would be appreciated :)
Create a boolean mask to see which NA-groups have less than 4 consecutive NA's.
mask = (data_chunk.notnull() != data_chunk.shift().notnull()).cumsum().reset_index().groupby(0).transform('count') < 4
Select interpolated values if mask == True and otherwise keep the original values.
pd.concat([interpolated[mask.values[:,0] ==True], data_chunk[mask.values[:,0] == False]]).sort_index()

Grouping by and applying lambda with condition for the first row - Pandas

I have a data frame with IDs, and choices that have made by those IDs.
The alternatives (choices) set is a list of integers: [10, 20, 30, 40].
Note: That's important to use this list. Let's call it 'choice_list'.
This is the data frame:
ID Choice
1 10
1 30
1 10
2 40
2 40
2 40
3 20
3 40
3 10
I want to create a variable for each alternative: '10_Var', '20_Var', '30_Var', '40_Var'.
At the first row of each ID, if the first choice was '10' for example, so the variable '10_Var' will get the value 0.6 (some parameter), and each of the other variables ('20_Var', '30_Var', '40_Var') will get the value (1 - 0.6) / 4.
The number 4 stands for the number of alternatives.
Expected result:
ID Choice 10_Var 20_Var 30_Var 40_Var
1 10 0.6 0.1 0.1 0.1
1 30
1 10
2 40 0.1 0.1 0.1 0.6
2 40
2 40
3 20 0.1 0.6 0.1 0.1
3 40
3 10
you can use np.where to do this. It is efficient that df.where
df = pd.DataFrame([['1', 10], ['1', 30], ['1', 10], ['2', 40], ['2', 40], ['2', 40], ['3', 20], ['3', 40], ['3', 10]], columns=('ID', 'Choice'))
choices = np.unique(df.Choice)
for choice in choices:
df[f"var_{choice}"] = np.where(df.Choice==choice, 0.6, (1 - 0.6) / 4)
df
Result
ID Choice var_10 var_20 var_30 var_40
0 1 10 0.6 0.1 0.1 0.1
1 1 30 0.1 0.1 0.6 0.1
2 1 10 0.6 0.1 0.1 0.1
3 2 40 0.1 0.1 0.1 0.6
4 2 40 0.1 0.1 0.1 0.6
5 2 40 0.1 0.1 0.1 0.6
6 3 20 0.1 0.6 0.1 0.1
7 3 40 0.1 0.1 0.1 0.6
8 3 10 0.6 0.1 0.1 0.1
Edit
To set values to 1st row of group only
df = pd.DataFrame([['1', 10], ['1', 30], ['1', 10], ['2', 40], ['2', 40], ['2', 40], ['3', 20], ['3', 40], ['3', 10]], columns=('ID', 'Choice'))
df=df.set_index("ID")
## create unique index for each row if not already
df = df.reset_index()
choices = np.unique(df.Choice)
## get unique id of 1st row of each group
grouped = df.loc[df.reset_index().groupby("ID")["index"].first()]
## set value for each new variable
for choice in choices:
grouped[f"var_{choice}"] = np.where(grouped.Choice==choice, 0.6, (1 - 0.6) / 4)
pd.concat([df, grouped.iloc[:, -len(choices):]], axis=1)
We can use insert o create the rows based on the unique ID values ​​obtained through Series.unique.We can also create a mask to fill only the first row using np.where.
At the beginning sort_values ​​is used to sort the values ​​based on the ID. You can skip this step if your data frame is already well sorted (like the one shown in the example):
df=df.sort_values('ID')
n=df['Choice'].nunique()
mask=df['ID'].ne(df['ID'].shift())
for choice in df['Choice'].sort_values(ascending=False).unique():
df.insert(2,column=f'{choice}_Var',value=np.nan)
df.loc[mask,f'{choice}_Var']=np.where(df.loc[mask,'Choice'].eq(choice),0.6,0.4/n)
print(df)
ID Choice 10_Var 20_Var 30_Var 40_Var
0 1 10 0.6 0.1 0.1 0.1
1 1 30 NaN NaN NaN NaN
2 1 10 NaN NaN NaN NaN
3 2 40 0.1 0.1 0.1 0.6
4 2 40 NaN NaN NaN NaN
5 2 40 NaN NaN NaN NaN
6 3 20 0.1 0.6 0.1 0.1
7 3 40 NaN NaN NaN NaN
8 3 10 NaN NaN NaN NaN
A mix of numpy and pandas solution:
rows = np.unique(df.ID.values, return_index=1)[1]
df1 = df.loc[rows].assign(val=0.6)
df2 = (pd.crosstab([df1.index, df1.ID, df1.Choice], df1.Choice, df1.val, aggfunc='first')
.reindex(choice_list, axis=1)
.fillna((1-0.6)/len(choice_list)).reset_index(level=[1,2], drop=True))
pd.concat([df, df2], axis=1)
Out[217]:
ID Choice 10 20 30 40
0 1 10 0.6 0.1 0.1 0.1
1 1 30 NaN NaN NaN NaN
2 1 10 NaN NaN NaN NaN
3 2 40 0.1 0.1 0.1 0.6
4 2 40 NaN NaN NaN NaN
5 2 40 NaN NaN NaN NaN
6 3 20 0.1 0.6 0.1 0.1
7 3 40 NaN NaN NaN NaN
8 3 10 NaN NaN NaN NaN

pandas how to get row index satisfying certain condition in a vectorized way?

I have a timeseries dataframe containing market price and order information. For every entry, there is a stoploss accordingly. I want to find out the stoploss triggered bar index in the dataframe for each entry order. If the market price >= stoploss , then stop is triggered and I want to record that the stop belongs to which entry order. Each entry is recorded according to its entry bar index. For example, order with entry price 99 at bar 1 is recorded as entry order 1. Entry price 98 at bar 2 is entry order 2 and entry price 103 at bar 5 is entry order 5 etc.
The original dataframe is like:
entry price index entryprice stoploss
0 0 100 0 NaN NaN
1 1 99 1 99.0 102.0
2 1 98 2 98.0 101.0
3 0 100 3 NaN NaN
4 0 101 4 NaN NaN
5 1 103 5 103.0 106.0
6 0 105 6 NaN NaN
7 0 104 7 NaN NaN
8 0 106 8 NaN NaN
9 1 103 9 103.0 106.0
10 0 100 10 NaN NaN
11 0 104 11 NaN NaN
12 0 108 12 NaN NaN
13 0 110 13 NaN NaN
code is :
import pandas as pd
df = pd.DataFrame(
{'price':[100,99,98,100,101,103,105,104,106,103,100,104,108,110],
'entry': [0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0],})
df['index'] = df.index
df['entryprice'] = df['price'].where(df.entry==1)
df['stoploss'] = df['entryprice'] + 3
In order to find out where stoploss is triggered for each order, I do it in an apply way. I defined an outside parameter stoplist which is recording all the stoploss orders and their corresponding entry order index which are not triggered yet. Then I pass every row of the df to the function and compare the market price with the stoploss in the stoplist, whenever condition is met, assign the entry order index to this row and remove it from the stoplist variable.
The code is like:
def Stop(row, stoplist):
output = None
for i in range(len(stoplist)-1, -1, -1):
(ix, stop) = stoplist[i]
if row['price'] >= stop:
output = ix
stoplist.pop(i)
if row['stoploss'] != None:
stoplist.append( (row['index'], row['stoploss']) )
return output
import pandas as pd
df = pd.DataFrame(
{'price':[100,99,98,100,101,103,105,104,106,103,100,104,108,110],
'entry': [0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0],})
df['index'] = df.index
df['entryprice'] = df['price'].where(df.entry==1)
df['stoploss'] = df['entryprice'] + 3
stoplist = []
df['stopix'] = df.apply(lambda row: Stop(row, stoplist), axis=1)
print(df)
The final output is:
entry price index entryprice stoploss stopix
0 0 100 0 NaN NaN NaN
1 1 99 1 99.0 102.0 NaN
2 1 98 2 98.0 101.0 NaN
3 0 100 3 NaN NaN NaN
4 0 101 4 NaN NaN 2.0
5 1 103 5 103.0 106.0 1.0
6 0 105 6 NaN NaN NaN
7 0 104 7 NaN NaN NaN
8 0 106 8 NaN NaN 5.0
9 1 103 9 103.0 106.0 NaN
10 0 100 10 NaN NaN NaN
11 0 104 11 NaN NaN NaN
12 0 108 12 NaN NaN 9.0
13 0 110 13 NaN NaN NaN
The last column stopix is what I wanted. But the only problem of this solution is that apply is not very efficient and I am wondering if there is a vectorized way to do this? Or if there is any better solution boosting the performance would be helpful. Because efficiency is critical to me.
Thanks
Here's my take:
# mark the block starting by entry
blocks = df.stoploss.notna().cumsum()
# mark where the prices are higher than or equal to entry price
higher = df['stoploss'].ffill().le(df.price)
# group higher by entries
g = higher.groupby(blocks)
# where the entry occurs in each group
idx = g.transform('idxmin')
# transform the idx to where the first higher occurs
df['stopix'] = np.where(g.cumsum().eq(1), idx, np.nan)
Output:
entry price index entryprice stoploss stopix
0 0 100 0 NaN NaN NaN
1 1 99 1 99.0 102.0 NaN
2 1 98 2 98.0 101.0 NaN
3 0 100 3 NaN NaN NaN
4 0 101 4 NaN NaN 2.0
5 1 103 5 103.0 106.0 NaN
6 0 105 6 NaN NaN NaN
7 0 104 7 NaN NaN NaN
8 0 106 8 NaN NaN 5.0
9 1 103 9 103.0 106.0 NaN
10 0 100 10 NaN NaN NaN
11 0 104 11 NaN NaN NaN
12 0 108 12 NaN NaN 9.0
13 0 110 13 NaN NaN NaN

Pandas cut results in Nan values

I have the following column with many missing values '?' in store_data dataframe
>>>store_data['trestbps']
0 140
1 130
2 132
3 142
4 110
5 120
6 150
7 180
8 120
9 160
10 126
11 140
12 110
13 ?
I replaced all missing values with -999
store_data.replace('?', -999, inplace = True)
>>>store_data['trestbps']
0 140
1 130
2 132
3 142
4 110
5 120
6 150
7 180
8 120
9 160
10 126
11 140
12 110
13 -999
Now I want to bin the values, I used this code but the output appears all as Nan:
trestbps = store_data['trestbps']
trestbps_bins = [-999,120,140,200]
store_data['trestbps'] = pd.cut(trestbps,trestbps_bins)
>>>store_data['trestbps']
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
8 NaN
9 NaN
10 NaN
11 NaN
12 NaN
13 NaN
The categories work fine when there is no missing values.
I want my output to be categorized from (0-12) and only 13 is replaced by -999. How can I achieve this?
IIUC, you may do:
bins=[0,120,140,200] #set bins
df.trestbps=pd.cut(df.trestbps,bins) #do the cut
df.trestbps=df.trestbps.values.add_categories(999) #add category as 999
df.trestbps.fillna(999) #fillna with 999
0 (120, 140]
1 (120, 140]
2 (120, 140]
3 (140, 200]
4 (0, 120]
5 (0, 120]
6 (140, 200]
7 (140, 200]
8 (0, 120]
9 (140, 200]
10 (120, 140]
11 (120, 140]
12 (0, 120]
13 999