Local maximum identification within a set % (trigger True on the downside) - pandas

Looking to identify (flag True) when current values in col1 drop below the most recent local maximum in col1 achieved by a set percent such that there will be multiple such signals as maxima are achieved and current values drop by the set percent (i.e. resets automatically without a set threshold other than the percent). Note that flag True should occur only on downside, not upside.
percent = 0.7
df = pd.DataFrame({'col0':[1,2,3,4,5,6,7,8,9]
,'col1':[5,4.9,5.5,3.5,3.1,4.5,5.5,1.2,5.8]
,'col2':[3.5, 3.43, 3.85, 2.45, 2.17, 3.15, 3.85, 0.84, 4.06]
,'col3':[np.nan, 3.43, 3.85, 3.85, 3.85, 3.85, 3.85, 3.85, 4.06]
})
df['col2'] = df['col1'] * percent
df['col3'] = df['col2'].shift(-1).cummax().shift()
Current form of col3 generates cummax but desired result will find local maxima to the downside and col4 to flag True every time col1 breaches col3 to the downside. Here's one example of the resulting col3 and col4:
col0 col1 col2 col3 col4
0 1 5.0 3.50 NaN False
1 2 4.9 3.43 3.43 False
2 3 5.5 3.85 3.85 False
3 4 3.5 2.45 3.85 True
4 5 3.1 2.17 3.85 False
5 6 4.5 3.15 3.85 False
6 7 5.5 3.85 3.85 False
7 8 1.2 0.84 3.85 True
8 9 5.8 2.90 4.06 False

Related

one column as denominator and many as nominator pandas

I have a data frame including many columns. I want the col1 as the denominator and all other columns as the nominator. I have done this for just col2 (see below code). I want to do this for all other columns in shortcode.
df
Town col1 col2 col3 col4
A 8 7 5 2
B 8 4 2 3
C 8 5 8 5
here is my code for col2:
df['col2'] = df['col2'] / df['col1'
here is my result:
df
A 8 0.875000 1.0 5 2
B 8 0.500000 0.0 2 3
C 8 0.625000 1.0 8 5
I want to do the same with all cols (i.e. col3, col4....)
If this could be done in pivot_table then it will be awsome
Thanks for your help
Use df.iloc with df.div:
In [2084]: df.iloc[:, 2:] = df.iloc[:, 2:].div(df.col1, axis=0)
In [2085]: df
Out[2085]:
Town col1 col2 col3 col4
0 A 8 0.875 0.625 0.250
1 B 8 0.500 0.250 0.375
2 C 8 0.625 1.000 0.625
OR use df.filter , pd.concat with df.div
In [2073]: x = df.filter(like='col').set_index('col1')
In [2078]: out = pd.concat([df.Town, x.div(x.index).reset_index()], 1)
In [2079]: out
Out[2079]:
Town col1 col2 col3 col4
0 A 8 0.875 0.625 0.250
1 B 8 0.500 0.250 0.375
2 C 8 0.625 1.000 0.625

Data Imputation in Pandas Dataframe column

I have 2 tables which I am merging( Left Join) based on common column but other column does not have exact column values and hence some of the column values are blank. I want to fill the missing value with closest tenth . for example I have these two dataframes:
d = {'col1': [1.31, 2.22,3.33,4.44,5.55,6.66], 'col2': ['010100', '010101','101011','110000','114000','120000']}
df1=pd.DataFrame(data=d)
d2 = {'col2': ['010100', '010102','010144','114218','121212','166110'],'col4': ['a','b','c','d','e','f']}
df2=pd.DataFrame(data=d2)
# df1
col1 col2
0 1.31 010100
1 2.22 010101
2 3.33 101011
3 4.44 110000
4 5.55 114000
5 6.66 120000
# df2
col2 col4
0 010100 a
1 010102 b
2 010144 c
3 114218 d
4 121212 e
5 166110 f
After left merging on col2,
I get:
df1.merge(df2,how='left',on='col2')
col1 col2 col4
0 1.31 010100 a
1 2.22 010101 NaN
2 3.33 101011 NaN
3 4.44 111100 NaN
4 5.55 114100 NaN
5 6.66 166100 NaN
Vs what I want, for for all values where NaN, my col2 value firstly converts to closest 10 and then matches in col2 of table 1 if there is a match, place col4 accordingly, if not then closest 100, then closest thousand, ten thousand..
Ideally my answer should be:
col1 col2 col4
0 1.31 010100 a
1 2.22 010101 a
2 3.33 101011 f
3 4.44 111100 d
4 5.55 114100 d
5 6.66 166100 f
Please help me in coding this

resample data within each group in pandas

I have a dataframe with different id and possible overlapping time with the time step of 0.4 second. I would like to resample the average speed for each id with the time step of 0.8 second.
time id speed
0 0.0 1 0
1 0.4 1 3
2 0.8 1 6
3 1.2 1 9
4 0.8 2 12
5 1.2 2 15
6 1.6 2 18
An example can be created by the following code
x = np.hstack((np.array([1] * 10), np.array([3] * 15)))
a = np.arange(10)*0.4
b = np.arange(15)*0.4 + 2
t = np.hstack((a, b))
df = pd.DataFrame({"time": t, "id": x})
df["speed"] = pd.DataFrame(np.arange(25) * 3)
The time column is transferred to datetime type by
df["re_time"] = pd.to_datetime(df["time"], unit='s')
Try with groupby:
block_size = int(0.8//0.4)
blocks = df.groupby('id').cumcount() // block_size
df.groupby(['id',blocks]).agg({'time':'first', 'speed':'mean'})
Output:
time speed
id
1 0 0.0 1.5
1 0.8 7.5
2 1.6 13.5
3 2.4 19.5
4 3.2 25.5
3 0 2.0 31.5
1 2.8 37.5
2 3.6 43.5
3 4.4 49.5
4 5.2 55.5
5 6.0 61.5
6 6.8 67.5
7 7.6 72.0

How to manipulate data in arrays using pandas (and resetting evaluations)

I've revised the question for clarity and removed artifacts and inconsistencies - please reopen for consideration by the community. One contributor already thinks a solution might be possible with groupby in combination with cummax.
I have a dataframe in which the max between prior value of col3 and current value of col2 is evaluated through a cummax function recently offered by Scott Boston (thanks!) as follows:
df['col3'] = df['col2'].shift(-1).cummax().shift().
The resulting dataframe is shown below. Also added the desired logic that compares col2 to a setpoint that is a result of float type value.
result of operating cummax:
col0 col1 col2 col3
0 1 5.0 2.50 NaN
1 2 4.9 2.45 2.45
2 3 5.5 2.75 2.75
3 4 3.5 1.75 2.75
4 5 3.1 1.55 2.75
5 6 4.5 2.25 2.75
6 7 5.5 2.75 2.75
7 8 1.2 0.6 2.75
8 9 5.8 2.90 2.90
The desire is to flag True when col3 >= setpoint or 2.71 in the above example such that every time col3's most recent row exceeds setpoint.
The problem: cummax solution does not reset when setpoint is reached. Need a solution that resets the cummax calculation every time it breaches setpoint. For example in the table above, after the first True when col3 exceeds the setpoint, i.e. col2 value is 2.75, there is a second time when it should satisfy the same condition, i.e. shown as in the extended data table where I’ve deleted col3's value in row 4 to illustrate the need to ‘reset’ the cummax calc. In the if statement, I am using subscript [-1] to target the last row in the df (i.e. most recent). Note: col2=current value of col1*constant1 where constant1 == 0.5
Code tried so far (note that col3 is not resetting properly):
if self.constant is not None: setpoint = self.constant * (1-self.temp) # suppose setpoint == 2.71
df = pd.DataFrame({'col0':[1,2,3,4,5,6,7,8,9]
,'col1':[5,4.9,5.5,3.5,3.1,4.5,5.5,1.2,5.8]
,'col2':[2.5,2.45,2.75,1.75,1.55,2.25,2.75,0.6,2.9]
,'col3':[NaN,2.45,2.75,2.75,2.75,2.75,2.75,2.75,2.9]
})
if df[‘col3’][-1] >= setpoint:
self.log(‘setpoint hit')
return True
Cummax solution needs tweaking: col3 is supposed to evaluate based value of col2 and col3 and once the setpoint is breached (2.71 for col3), the next col3 value should reset to NaN and start a new cummax. The correct output for col3 should be:[NaN,2.45,2.75,NaN,1.55,2.25,2.75,NaN,2.9] and return True again and again when the last row of col3 breaches setpoint value 2.71.
Desired result of operating cummax and additional tweaking for col3 (possibly with groupby that references col2?): return True every time setpoint is breached. Here's one example of the resulting col3:
col0 col1 col2 col3
0 1 5.0 2.50 NaN
1 2 4.9 2.45 2.45
2 3 5.5 2.75 2.75
3 4 3.5 1.75 NaN
4 5 3.1 1.55 1.55
5 6 4.5 2.25 2.25
6 7 5.5 2.75 2.75
7 8 1.2 0.60 NaN
8 9 5.8 2.90 2.90
Open to suggestions on whether NaN is returned on the row the breach occurs or on next row shown as above (key desire is for if statement to resolve True as soon as setpoint is breached).
Try:
import pandas as pd
import numpy as np
df = pd.DataFrame({'col0':[1,2,3,4,5,6,7,8,9]
,'col1':[5,4.9,5.5,3.5,3.1,4.5,5.5,1.2,5.8]
,'col2':[2.5,2.45,2.75,1.75,1.55,2.25,2.75,0.6,2.9]
,'col3':[np.nan,2.45,2.75,2.75,2.75,2.75,2.75,2.75,2.9]
})
threshold = 2.71
grp = df['col2'].ge(threshold).cumsum().shift().bfill()
df['col3'] = df['col2'].groupby(grp).transform(lambda x: x.shift(-1).cummax().shift())
print(df)
Output:
col0 col1 col2 col3
0 1 5.0 2.50 NaN
1 2 4.9 2.45 2.45
2 3 5.5 2.75 2.75
3 4 3.5 1.75 NaN
4 5 3.1 1.55 1.55
5 6 4.5 2.25 2.25
6 7 5.5 2.75 2.75
7 8 1.2 0.60 NaN
8 9 5.8 2.90 2.90
Details:
Create grouping using greater or equal to threshold, then apply the same logic to each group withn at the dataframe using groupby with transform.

Grouping by and applying lambda with condition for the first row - Pandas

I have a data frame with IDs, and choices that have made by those IDs.
The alternatives (choices) set is a list of integers: [10, 20, 30, 40].
Note: That's important to use this list. Let's call it 'choice_list'.
This is the data frame:
ID Choice
1 10
1 30
1 10
2 40
2 40
2 40
3 20
3 40
3 10
I want to create a variable for each alternative: '10_Var', '20_Var', '30_Var', '40_Var'.
At the first row of each ID, if the first choice was '10' for example, so the variable '10_Var' will get the value 0.6 (some parameter), and each of the other variables ('20_Var', '30_Var', '40_Var') will get the value (1 - 0.6) / 4.
The number 4 stands for the number of alternatives.
Expected result:
ID Choice 10_Var 20_Var 30_Var 40_Var
1 10 0.6 0.1 0.1 0.1
1 30
1 10
2 40 0.1 0.1 0.1 0.6
2 40
2 40
3 20 0.1 0.6 0.1 0.1
3 40
3 10
you can use np.where to do this. It is efficient that df.where
df = pd.DataFrame([['1', 10], ['1', 30], ['1', 10], ['2', 40], ['2', 40], ['2', 40], ['3', 20], ['3', 40], ['3', 10]], columns=('ID', 'Choice'))
choices = np.unique(df.Choice)
for choice in choices:
df[f"var_{choice}"] = np.where(df.Choice==choice, 0.6, (1 - 0.6) / 4)
df
Result
ID Choice var_10 var_20 var_30 var_40
0 1 10 0.6 0.1 0.1 0.1
1 1 30 0.1 0.1 0.6 0.1
2 1 10 0.6 0.1 0.1 0.1
3 2 40 0.1 0.1 0.1 0.6
4 2 40 0.1 0.1 0.1 0.6
5 2 40 0.1 0.1 0.1 0.6
6 3 20 0.1 0.6 0.1 0.1
7 3 40 0.1 0.1 0.1 0.6
8 3 10 0.6 0.1 0.1 0.1
Edit
To set values to 1st row of group only
df = pd.DataFrame([['1', 10], ['1', 30], ['1', 10], ['2', 40], ['2', 40], ['2', 40], ['3', 20], ['3', 40], ['3', 10]], columns=('ID', 'Choice'))
df=df.set_index("ID")
## create unique index for each row if not already
df = df.reset_index()
choices = np.unique(df.Choice)
## get unique id of 1st row of each group
grouped = df.loc[df.reset_index().groupby("ID")["index"].first()]
## set value for each new variable
for choice in choices:
grouped[f"var_{choice}"] = np.where(grouped.Choice==choice, 0.6, (1 - 0.6) / 4)
pd.concat([df, grouped.iloc[:, -len(choices):]], axis=1)
We can use insert o create the rows based on the unique ID values ​​obtained through Series.unique.We can also create a mask to fill only the first row using np.where.
At the beginning sort_values ​​is used to sort the values ​​based on the ID. You can skip this step if your data frame is already well sorted (like the one shown in the example):
df=df.sort_values('ID')
n=df['Choice'].nunique()
mask=df['ID'].ne(df['ID'].shift())
for choice in df['Choice'].sort_values(ascending=False).unique():
df.insert(2,column=f'{choice}_Var',value=np.nan)
df.loc[mask,f'{choice}_Var']=np.where(df.loc[mask,'Choice'].eq(choice),0.6,0.4/n)
print(df)
ID Choice 10_Var 20_Var 30_Var 40_Var
0 1 10 0.6 0.1 0.1 0.1
1 1 30 NaN NaN NaN NaN
2 1 10 NaN NaN NaN NaN
3 2 40 0.1 0.1 0.1 0.6
4 2 40 NaN NaN NaN NaN
5 2 40 NaN NaN NaN NaN
6 3 20 0.1 0.6 0.1 0.1
7 3 40 NaN NaN NaN NaN
8 3 10 NaN NaN NaN NaN
A mix of numpy and pandas solution:
rows = np.unique(df.ID.values, return_index=1)[1]
df1 = df.loc[rows].assign(val=0.6)
df2 = (pd.crosstab([df1.index, df1.ID, df1.Choice], df1.Choice, df1.val, aggfunc='first')
.reindex(choice_list, axis=1)
.fillna((1-0.6)/len(choice_list)).reset_index(level=[1,2], drop=True))
pd.concat([df, df2], axis=1)
Out[217]:
ID Choice 10 20 30 40
0 1 10 0.6 0.1 0.1 0.1
1 1 30 NaN NaN NaN NaN
2 1 10 NaN NaN NaN NaN
3 2 40 0.1 0.1 0.1 0.6
4 2 40 NaN NaN NaN NaN
5 2 40 NaN NaN NaN NaN
6 3 20 0.1 0.6 0.1 0.1
7 3 40 NaN NaN NaN NaN
8 3 10 NaN NaN NaN NaN