How to manipulate data in arrays using pandas (and resetting evaluations) - pandas

I've revised the question for clarity and removed artifacts and inconsistencies - please reopen for consideration by the community. One contributor already thinks a solution might be possible with groupby in combination with cummax.
I have a dataframe in which the max between prior value of col3 and current value of col2 is evaluated through a cummax function recently offered by Scott Boston (thanks!) as follows:
df['col3'] = df['col2'].shift(-1).cummax().shift().
The resulting dataframe is shown below. Also added the desired logic that compares col2 to a setpoint that is a result of float type value.
result of operating cummax:
col0 col1 col2 col3
0 1 5.0 2.50 NaN
1 2 4.9 2.45 2.45
2 3 5.5 2.75 2.75
3 4 3.5 1.75 2.75
4 5 3.1 1.55 2.75
5 6 4.5 2.25 2.75
6 7 5.5 2.75 2.75
7 8 1.2 0.6 2.75
8 9 5.8 2.90 2.90
The desire is to flag True when col3 >= setpoint or 2.71 in the above example such that every time col3's most recent row exceeds setpoint.
The problem: cummax solution does not reset when setpoint is reached. Need a solution that resets the cummax calculation every time it breaches setpoint. For example in the table above, after the first True when col3 exceeds the setpoint, i.e. col2 value is 2.75, there is a second time when it should satisfy the same condition, i.e. shown as in the extended data table where I’ve deleted col3's value in row 4 to illustrate the need to ‘reset’ the cummax calc. In the if statement, I am using subscript [-1] to target the last row in the df (i.e. most recent). Note: col2=current value of col1*constant1 where constant1 == 0.5
Code tried so far (note that col3 is not resetting properly):
if self.constant is not None: setpoint = self.constant * (1-self.temp) # suppose setpoint == 2.71
df = pd.DataFrame({'col0':[1,2,3,4,5,6,7,8,9]
,'col1':[5,4.9,5.5,3.5,3.1,4.5,5.5,1.2,5.8]
,'col2':[2.5,2.45,2.75,1.75,1.55,2.25,2.75,0.6,2.9]
,'col3':[NaN,2.45,2.75,2.75,2.75,2.75,2.75,2.75,2.9]
})
if df[‘col3’][-1] >= setpoint:
self.log(‘setpoint hit')
return True
Cummax solution needs tweaking: col3 is supposed to evaluate based value of col2 and col3 and once the setpoint is breached (2.71 for col3), the next col3 value should reset to NaN and start a new cummax. The correct output for col3 should be:[NaN,2.45,2.75,NaN,1.55,2.25,2.75,NaN,2.9] and return True again and again when the last row of col3 breaches setpoint value 2.71.
Desired result of operating cummax and additional tweaking for col3 (possibly with groupby that references col2?): return True every time setpoint is breached. Here's one example of the resulting col3:
col0 col1 col2 col3
0 1 5.0 2.50 NaN
1 2 4.9 2.45 2.45
2 3 5.5 2.75 2.75
3 4 3.5 1.75 NaN
4 5 3.1 1.55 1.55
5 6 4.5 2.25 2.25
6 7 5.5 2.75 2.75
7 8 1.2 0.60 NaN
8 9 5.8 2.90 2.90
Open to suggestions on whether NaN is returned on the row the breach occurs or on next row shown as above (key desire is for if statement to resolve True as soon as setpoint is breached).

Try:
import pandas as pd
import numpy as np
df = pd.DataFrame({'col0':[1,2,3,4,5,6,7,8,9]
,'col1':[5,4.9,5.5,3.5,3.1,4.5,5.5,1.2,5.8]
,'col2':[2.5,2.45,2.75,1.75,1.55,2.25,2.75,0.6,2.9]
,'col3':[np.nan,2.45,2.75,2.75,2.75,2.75,2.75,2.75,2.9]
})
threshold = 2.71
grp = df['col2'].ge(threshold).cumsum().shift().bfill()
df['col3'] = df['col2'].groupby(grp).transform(lambda x: x.shift(-1).cummax().shift())
print(df)
Output:
col0 col1 col2 col3
0 1 5.0 2.50 NaN
1 2 4.9 2.45 2.45
2 3 5.5 2.75 2.75
3 4 3.5 1.75 NaN
4 5 3.1 1.55 1.55
5 6 4.5 2.25 2.25
6 7 5.5 2.75 2.75
7 8 1.2 0.60 NaN
8 9 5.8 2.90 2.90
Details:
Create grouping using greater or equal to threshold, then apply the same logic to each group withn at the dataframe using groupby with transform.

Related

Data Imputation in Pandas Dataframe column

I have 2 tables which I am merging( Left Join) based on common column but other column does not have exact column values and hence some of the column values are blank. I want to fill the missing value with closest tenth . for example I have these two dataframes:
d = {'col1': [1.31, 2.22,3.33,4.44,5.55,6.66], 'col2': ['010100', '010101','101011','110000','114000','120000']}
df1=pd.DataFrame(data=d)
d2 = {'col2': ['010100', '010102','010144','114218','121212','166110'],'col4': ['a','b','c','d','e','f']}
df2=pd.DataFrame(data=d2)
# df1
col1 col2
0 1.31 010100
1 2.22 010101
2 3.33 101011
3 4.44 110000
4 5.55 114000
5 6.66 120000
# df2
col2 col4
0 010100 a
1 010102 b
2 010144 c
3 114218 d
4 121212 e
5 166110 f
After left merging on col2,
I get:
df1.merge(df2,how='left',on='col2')
col1 col2 col4
0 1.31 010100 a
1 2.22 010101 NaN
2 3.33 101011 NaN
3 4.44 111100 NaN
4 5.55 114100 NaN
5 6.66 166100 NaN
Vs what I want, for for all values where NaN, my col2 value firstly converts to closest 10 and then matches in col2 of table 1 if there is a match, place col4 accordingly, if not then closest 100, then closest thousand, ten thousand..
Ideally my answer should be:
col1 col2 col4
0 1.31 010100 a
1 2.22 010101 a
2 3.33 101011 f
3 4.44 111100 d
4 5.55 114100 d
5 6.66 166100 f
Please help me in coding this

Classify a value under certain conditions in pandas dataframe

I have this dataframe:
value limit_1 limit_2 limit_3 limit_4
10 2 3 7 10
11 5 6 11 13
2 0.3 0.9 2.01 2.99
I want to add another column called class that classifies the value column this way:
if value <= limit1.value then 1
if value > limit1.value and <= limit2.value then 2
if value > limit2.value and <= limit3.value then 3
if value > limit3.value then 4
to get this result:
value limit_1 limit_2 limit_3 limit_4 CLASS
10 2 3 7 10 4
11 5 6 11 13 3
2 0.3 0.9 2.01 2.99 3
I know I could work to get these 'if's to work but my dataframe has 2kk rows and I need the fasted way to perform such classification.
I tried to use .cut function but the result was not what I expected/wanted
Thanks
We can use the rank method over the column axis (axis=1):
df["CLASS"] = df.rank(axis=1, method="first").iloc[:, 0].astype(int)
value limit_1 limit_2 limit_3 limi_4 CLASS
0 10 2.0 3.0 7.00 10.00 4
1 11 5.0 6.0 11.00 13.00 3
2 2 0.3 0.9 2.01 2.99 3
We can use np.select:
import numpy as np
conditions = [df["value"]<df["limit_1"],
df["value"].between(df["limit_1"], df["limit_2"]),
df["value"].between(df["limit_2"], df["limit_3"]),
df["value"]>df["limit_3"]]
df["CLASS"] = np.select(conditions, [1,2,3,4])
>>> df
value limit_1 limit_2 limit_3 limit_4 CLASS
0 10 2.0 3.0 7.00 10.00 4
1 11 5.0 6.0 11.00 13.00 3
2 2 0.3 0.9 2.01 2.99 3

How to use SQL minus query equivalent between two dataframes properly

I have two dataframes each having 1000 rows. The dataframes are same, however, row by row is not same. The following examples can be assumed as truncated version of the dataframes.
df1:
col1 col2 col3
1 2 3
2 3 4
5 6 6
8 9 9
df2:
col1 col2 col3
5 6 6
8 9 9
1 2 3
2 3 4
The dataframes don't have indices and I expect null returned when I implement sql minus query on these. I used the following query, but did not obtain the result as expected. Is there any way to achieve my desired result ?
df3 = df1.merge(df2.drop_duplicates(),how='right', indicator=True)
print(df3)
For instance, if I consider df1 as table1 and df2 as table2, and if I ran the following query in SQL server, I would get null returned (empty table).
SELECT * FROM table1
EXCEPT
SELECT * FROM table2
Yes, you can use the indicator like this:
df1.merge(df2, how='left', indicator='ind').query('ind=="left_only"')
Where df1 is:
col1 col2 col3
0 1.0 2.0 3.0
1 2.0 3.0 4.0
2 5.0 6.0 6.0
3 8.0 9.0 9.0
4 10.0 10.0 10.0
and df2 is:
col1 col2 col3
0 5 6 6
1 8 9 9
2 1 2 3
3 2 3 4
Output:
col1 col2 col3 ind
4 10.0 10.0 10.0 left_only

Local maximum identification within a set % (trigger True on the downside)

Looking to identify (flag True) when current values in col1 drop below the most recent local maximum in col1 achieved by a set percent such that there will be multiple such signals as maxima are achieved and current values drop by the set percent (i.e. resets automatically without a set threshold other than the percent). Note that flag True should occur only on downside, not upside.
percent = 0.7
df = pd.DataFrame({'col0':[1,2,3,4,5,6,7,8,9]
,'col1':[5,4.9,5.5,3.5,3.1,4.5,5.5,1.2,5.8]
,'col2':[3.5, 3.43, 3.85, 2.45, 2.17, 3.15, 3.85, 0.84, 4.06]
,'col3':[np.nan, 3.43, 3.85, 3.85, 3.85, 3.85, 3.85, 3.85, 4.06]
})
df['col2'] = df['col1'] * percent
df['col3'] = df['col2'].shift(-1).cummax().shift()
Current form of col3 generates cummax but desired result will find local maxima to the downside and col4 to flag True every time col1 breaches col3 to the downside. Here's one example of the resulting col3 and col4:
col0 col1 col2 col3 col4
0 1 5.0 3.50 NaN False
1 2 4.9 3.43 3.43 False
2 3 5.5 3.85 3.85 False
3 4 3.5 2.45 3.85 True
4 5 3.1 2.17 3.85 False
5 6 4.5 3.15 3.85 False
6 7 5.5 3.85 3.85 False
7 8 1.2 0.84 3.85 True
8 9 5.8 2.90 4.06 False

Pandas rolling function with specific numeric span?

As of Pandas 0.18.0, it is possible to have a variable rolling window size for time-series by specifying a time span. For example, the code for summation over a 2-second window in dataframe dft looks like this:
dft.rolling('2s').sum()
It is possible to do the same with non-datetime spans?
For example, given a dataframe that looks like this:
A B
0 1 1
1 2 2
2 3 3
3 5 5
4 6 6
5 7 7
6 10 10
Is it possible to specify a window span of say 3 on column 'A' and have the sum of column 'B' calculated, so that the output looks something like:
A B
0 1 NaN
1 2 NaN
2 3 5
3 5 10
4 6 14
5 7 18
6 10 17
Not with rolling(). See the documentation for the window argument:
[A variable-sized window] is only valid for datetimelike indexes.
Full text:
window : int, or offset
Size of the moving window. This is the number of observations used for calculating the statistic. Each window will be a fixed size.
If its an offset then this will be the time period of each window. Each window will be a variable sized based on the observations included in the time-period. This is only valid for datetimelike indexes.
Here's a workaround if you're interested.
df = pd.DataFrame({'A' : np.arange(10),
'B' : np.arange(10,20)},
index=[1,2,3,5,8,9,11,14,19,20])
def var_window(df, size, min_periods=None):
"""Operates on the index."""
result = []
df = df.sort_index()
for i in df.index:
start = i - size + 1
res = df.loc[start:i].sum().tolist()
result.append(res)
result = pd.DataFrame(result, index=df.index)
if min_periods:
result.loc[:min_periods - 1] = np.nan
return result
print(var_window(df, size=3, min_periods=3, inclusive=True))
0 1
1 NaN NaN
2 NaN NaN
3 3.0 33.0
5 5.0 25.0
8 4.0 14.0
9 9.0 29.0
11 11.0 31.0
14 7.0 17.0
19 8.0 18.0
20 17.0 37.0
Explanation: loop through the index. At each value, truncate the DataFrame to the trailing window size. Here 'size' is not a count, but rather a range as you have defined it.
In the above, at the index value of 8, you're summing the values of A for which the index is 8, 7, or 6. (I.e. > 8 - 3 + 1). The only index value that falls within that range is 8, so the sum is simply the value from the original frame. Comparatively, for the index value of 11, the sum will include values for 9 and 11 (5 + 6 = 11, the resulting sum for A).
Compare this with standard rolling ops:
print(df.rolling(window=3).sum())
A B
1 NaN NaN
2 NaN NaN
3 3.0 33.0
5 6.0 36.0
8 9.0 39.0
9 12.0 42.0
11 15.0 45.0
14 18.0 48.0
19 21.0 51.0
20 24.0 54.0
If I'm misinterpreting your question, let me know how. It's admittedly significantly slower:
%timeit df.rolling(window=3).sum()
1000 loops, best of 3: 627 µs per loop
%timeit var_window(df, size=3, min_periods=3)
100 loops, best of 3: 3.59 ms per loop