Restart cumulative sum in pandas dataframe - pandas

I am trying to start a cumulative sum in a pandas dataframe, restarting everytime the absolute value is higher than 0.009. Could give you a excerpt of my tries but I assume they would just distract you. Have tried several things with np.where but at a certain point they start to overlap and it takes wrong things out.
Column b is the desired output.
df = pd.DataFrame({'values':(49.925,49.928,49.945,49.928,49.925,49.935,49.938,49.942,49.931,49.952)})
df['a']=df.diff()
values a b
0 49.925 NaN 0.000
1 49.928 0.003 0.003
2 49.945 0.017 0.020 (restart cumsum next row)
3 49.928 -0.017 -0.017 (restart cumsum next row)
4 49.925 -0.003 -0.003
5 49.935 0.010 0.007
6 49.938 0.003 0.010 (restart cumsum next row)
7 49.942 0.004 0.004
8 49.931 -0.011 -0.007
9 49.952 0.021 0.014 (restart cumsum next row)
So the actual objective is for python to understand that I want to restart the cumulative sum when it exceeds the absolute value of 0.009

I couldn't solve this in a vectorized manner, however applying a stateful function appears to work.
import pandas as pd
from pandas.compat import StringIO
print(pd.__version__)
df = pd.DataFrame({'values':(49.925,49.928,49.945,49.928,49.925,49.935,49.938,49.942,49.931,49.952)})
df['a']=df.diff()
accumulator = 0.0
reset = False
def myfunc(x):
global accumulator, reset
if(reset):
accumulator = 0.0
reset = False
accumulator += x
if abs(accumulator) > .009:
reset = True
return accumulator
df['a'].fillna(value=0, inplace=True)
df['b'] = df['a'].apply(myfunc)
print(df)
Produces
0.24.2
values a b
0 49.925 0.000 0.000
1 49.928 0.003 0.003
2 49.945 0.017 0.020
3 49.928 -0.017 -0.017
4 49.925 -0.003 -0.003
5 49.935 0.010 0.007
6 49.938 0.003 0.010
7 49.942 0.004 0.004
8 49.931 -0.011 -0.007
9 49.952 0.021 0.014

Related

How to remove rows so that the values in a column match a sequence

I'm looking for a more efficient method to deal with the following problem. I have a Dataframe with a column filled with values that randomly range from 1 to 4, I need to remove all the rows of the data frame that do not follow the sequence (1-2-3-4-1-2-3-...).
This is what I have:
A B
12/2/2022 0.02 2
14/2/2022 0.01 1
15/2/2022 0.04 4
16/2/2022 -0.02 3
18/2/2022 -0.01 2
20/2/2022 0.04 1
21/2/2022 0.02 3
22/2/2022 -0.01 1
24/2/2022 0.04 4
26/2/2022 -0.02 2
27/2/2022 0.01 3
28/2/2022 0.04 1
01/3/2022 -0.02 3
03/3/2022 -0.01 2
05/3/2022 0.04 1
06/3/2022 0.02 3
08/3/2022 -0.01 1
10/3/2022 0.04 4
12/3/2022 -0.02 2
13/3/2022 0.01 3
15/3/2022 0.04 1
...
This is what I need:
A B
14/2/2022 0.01 1
18/2/2022 -0.01 2
21/2/2022 0.02 3
24/2/2022 0.04 4
28/2/2022 0.04 1
03/3/2022 -0.01 2
06/3/2022 0.02 3
10/3/2022 0.04 4
15/3/2022 0.04 1
...
Since the data frame is quite big I need some sort of NumPy-based operation to accomplish this, the more efficient the better. My solution is very ugly and inefficient, basically, I made 4 loops like the following to check for every part of the sequence (4-1,1-2,2-3,3-4):
df_len = len(df)
df_len2 = 0
while df_len != df_len2:
df_len = len(df)
df.loc[(df.B.shift(1) == 4) & (df.B != 1), 'B'] = 0
df = df[df['B'] != 0]
df_len2 = len(df)
By means of itertools.cycle (to define cycled range):
from itertools import cycle
c_rng = cycle(range(1, 5)) # cycled range
start = next(c_rng) # starting point
df[[(v == start) and bool(start := next(c_rng)) for v in df.B]]
A B
14/2/2022 0.01 1
18/2/2022 -0.01 2
21/2/2022 0.02 3
24/2/2022 0.04 4
28/2/2022 0.04 1
03/3/2022 -0.01 2
06/3/2022 0.02 3
10/3/2022 0.04 4
15/3/2022 0.04 1
A simple improvement to speed this up is to not touch the dataframe within the loop, but just iterate over the values of B to construct a Boolean index, like this:
is_in_sequence = []
next_target = 1
for b in df.B:
if b == next_target:
is_in_sequence.append(True)
next_target = next_target % 4 + 1
else:
is_in_sequence.append(False)
print(df[is_in_sequence])
A B
14/2/2022 0.01 1
18/2/2022 -0.01 2
21/2/2022 0.02 3
24/2/2022 0.04 4
28/2/2022 0.04 1
03/3/2022 -0.01 2
06/3/2022 0.02 3
10/3/2022 0.04 4
15/3/2022 0.04 1

Find first and last positive value of every season over 50 years

i've seen some similar question but can't figure out how to handle my problem.
I have a dataset with evereyday total snow values from 1970 till 2015.
Now i want to find out when there was the first and the last day with snow.
I want to do this for every season.
One season should be from, for example 01.06.2000 - 30.5.2001, this season is then Season 2000/2001.
I have already set my date column as index(format year-month-day, 2006-04-24)
When I select a specific range with
df_s = df["2006-04-04" : "2006-04-15"]
I am able to find out the first and last day with snow in this period with
firstsnow = df_c[df_c['Height'] > 0].head(1)
lastsnow = df_c[df_c['Height'] > 0].tail(1)
I want to do this now for the whole dataset, so that I'm able to compare each season and see how the time of first snow changed.
My dataframe looks like this(here you see a selected period with values),Height is Snowheight, Diff is the difference to the previous day. Height and Diff are Float64.
Height Diff
Date
2006-04-04 0.000 NaN
2006-04-05 0.000 0.000
2006-04-06 0.000 0.000
2006-04-07 16.000 16.000
2006-04-08 6.000 -10.000
2006-04-09 0.001 -5.999
2006-04-10 0.000 -0.001
2006-04-11 0.000 0.000
2006-04-12 0.000 0.000
2006-04-13 0.000 0.000
2006-04-14 0.000 0.000
2006-04-15 0.000 0.000
(12, 2)
<class 'pandas.core.frame.DataFrame'>
I think i have to work with the groupby function, but i don't know how to apply this function in this case.
You can use the trick to create new column with only positive value, and None otherwise. Then use ffill and bfill to get the head and tail
Sample data:
df = pd.DataFrame({'name': ['a1','a2','a3','a4','a5','b1','b2','b3','b4','b5'],
'gr':[1]*5+[2]*5,
'val1':[None,-1,2,1,None,-1,4,7,3,-2]})
Input:
name gr val1
0 a1 1 NaN
1 a2 1 -1.0
2 a3 1 2.0
3 a4 1 1.0
4 a5 1 NaN
5 b1 2 -1.0
6 b2 2 4.0
7 b3 2 7.0
8 b4 2 3.0
9 b5 2 -2.0
Set positive then ffill and bfill:
df['positive'] = np.where(df['val1']>0, df['val1'], None)
df['positive'] = df.groupby('gr')['positive'].apply(lambda g: g.ffill())
df['positive'] = df.groupby('gr')['positive'].apply(lambda g: g.bfill())
Check result:
df.groupby('gr').head(1)
df.groupby('gr').tail(1)
name gr val1 positive
0 a1 1 NaN 2.0
5 b1 2 -1.0 4.0
name gr val1 positive
4 a5 1 NaN 1.0
9 b5 2 -2.0 3.0

comparison between two dataframes and find highest difference

I have two dataframes df1 and df2. Both are indexed the same with [i_batch, i_example]
The columns are different rmse errors. I would like to find [i_batch, i_example] that df1 is a lot lower than df2, or find the rows that df1 has less error than df2 based on the common [i_batch, i_example].
Note that it is possible that a specific [i_batch, i_example] only happens in one of the df1 or df2. But I need to only consider [i_batch, i_example] that exists in both df1 and df2.
df1 =
rmse_ACCELERATION rmse_CENTER_X rmse_CENTER_Y rmse_HEADING rmse_LENGTH rmse_TURN_RATE rmse_VELOCITY rmse_WIDTH
i_batch i_example
0 0.0 1.064 1.018 0.995 0.991 1.190 0.967 1.029 1.532
1 0.0 1.199 1.030 1.007 1.048 1.278 0.967 1.156 1.468
1.0 1.101 1.026 1.114 2.762 0.967 0.967 1.083 1.186
2 0.0 1.681 1.113 1.090 1.001 1.670 0.967 1.205 1.160
1.0 1.637 1.122 1.183 0.987 1.521 0.967 1.191 1.278
2.0 1.252 1.035 1.035 2.507 1.108 0.967 1.210 1.595
3 0.0 1.232 1.014 1.019 1.627 1.143 0.967 1.080 1.583
1.0 1.195 1.028 1.019 1.151 1.097 0.967 1.071 1.549
2.0 1.233 1.010 1.004 1.616 1.135 0.967 1.082 1.573
3.0 1.179 1.017 1.014 1.368 1.132 0.967 1.099 1.518
and
df2 =
rmse_ACCELERATION rmse_CENTER_X rmse_CENTER_Y rmse_HEADING rmse_LENGTH rmse_TURN_RATE rmse_VELOCITY rmse_WIDTH
i_batch i_example
1 0.0 0.071 0.034 0.048 0.114 0.006 1.309e-03 0.461 0.004
1.0 0.052 0.055 0.062 2.137 0.023 8.232e-04 0.357 0.011
2 0.0 1.665 0.156 0.178 0.112 0.070 3.751e-03 2.326 0.016
1.0 0.880 0.210 0.088 0.055 0.202 1.449e-03 0.899 0.047
2.0 0.199 0.072 0.078 1.686 0.010 6.240e-04 0.239 0.008
3 0.0 0.332 0.068 0.097 1.211 0.022 5.127e-04 0.167 0.016
1.0 0.252 0.075 0.070 0.368 0.013 5.295e-04 0.136 0.008
2.0 0.268 0.067 0.064 1.026 0.010 5.564e-04 0.175 0.010
3.0 0.171 0.051 0.054 0.473 0.011 4.150e-04 0.220 0.009
5 0.0 0.014 0.099 0.119 0.389 0.123 3.846e-04 0.313 0.037
For instance how can I get the [i_batch, i_example] that `df1[rmse_ACCELERATION] < df1[rmse_ACCELERATION]'?
Do a merge and then just filter according to your needs
df_merge = df_1.merge(df_2,
left_index=True,
right_index=True,
suffixes=('_1','_2'))
df_merge[
df_merge['rmse_ACCELERATION_1'] < df_merge['rmse_ACCELERATION_2']
].index
However I don't see any records with same [i_batch, i_example] in both dataframes that passes the condition
Use .sub(), that directly matches the indices and subtracts matches.
df3=df1.sub(df2)
df3[(df3<0).any(1)]
Or go specific and try searching in df1 by
df1[(df1.sub(df2)<0).any(1)]
rmse_ACCELERATION rmse_CENTER_X rmse_CENTER_Y \
i_batch i_example
2 0.0 0.016 0.957 0.912
rmse_HEADING rmse_LENGTH rmse_TURN_RATE rmse_VELOCITY \
i_batch i_example
2 0.0 0.889 1.6 0.963249 -1.121
rmse_WIDTH
i_batch i_example
2 0.0 1.144

Fill missing values in DataFrame

I have a dataframe that is either missing two values in two columns, or one value in one column.
Date 30 45 60 90
0 2004-01-02 0.88 0.0 0.0 0.93
1 2004-01-05 0.88 0.0 0.0 0.91
...
20 2019-12-24 1.55 0 1.58 1.58
21 2019-12-26 1.59 0 1.60 1.58
I would like to compute all the zero values in the dataframe by some simple linear method. Here is the thing, if there is a value in the 60 column, use the average of the 60 and the 30 for the 45. Otherwise use some simple method to compute both the 45 and the 60.
What is the pandas way to do this? [Prefer no loops]
EDIT 1
As per the suggestions in the comment, I tried
df.replace(0, np.nan, inplace=True)
df=df.interpolate(method='linear', limit_direction='forward', axis=0)
But the df still contains all the np.nan

Kuwahara filter with performance issues

On implementing an edge preserving filter similar to ImageJ's Kuwahara filter, which assigns each pixel to the mean of the area with the smallest deviation around it, I'm struggling with performance issues.
Counterintuitively, the calculation of means and deviations to separate matrices is fast compared to the final resorting to compile the output array. The ImageJ implementation above seems to expect about 70% of total processing time for this step though.
Given two arrays means and stds, whose sizes are 2 kernel sizes p bigger than the output array 'res' in each axis, I want to assign a pixel to the mean of the area with the smallest deviation:
#vector to middle of surrounding area (approx.)
p2 = p/2
# x and y components of vector to each quadrant
index2quadrant = np.array([[1, 1, -1, -1],[1, -1, 1, -1]]) * p2
Iterate over all pixels of output array of shape (asize, asize):
for col in np.arange(asize) + p:
for row in np.arange(asize) + p:
Searching for the minimum std dev in the 4 quadrants around the current coordinate, and using the corresponding index to assign the previously computed mean:
minidx = np.argmin(stds[index2quadrant[0] + col, index2quadrant[1] + row])
#assign mean
res[col - p, row - p] = means[index2quadrant[:,minidx][0] + col,index2quadrant[:,minidx][1] + row]
The Python profiler gives the following results on filtering a 1024x1024 array with a 8x8 pixel kernel:
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 30.024 30.024 <string>:1(<module>)
1048576 2.459 0.000 4.832 0.000 fromnumeric.py:740(argmin)
1 23.720 23.720 30.024 30.024 kuwahara.py:4(kuwahara)
2 0.000 0.000 0.012 0.006 numeric.py:65(zeros_like)
2 0.000 0.000 0.000 0.000 {math.log}
1048576 2.373 0.000 2.373 0.000 {method 'argmin' of 'numpy.ndarray' objects}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
2 0.012 0.006 0.012 0.006 {method 'fill' of 'numpy.ndarray' objects}
8256 0.189 0.000 0.189 0.000 {method 'mean' of 'numpy.ndarray' objects}
16512 0.524 0.000 0.524 0.000 {method 'reshape' of 'numpy.ndarray' objects}
8256 0.730 0.000 0.730 0.000 {method 'std' of 'numpy.ndarray' objects}
1042 0.012 0.000 0.012 0.000 {numpy.core.multiarray.arange}
1 0.000 0.000 0.000 0.000 {numpy.core.multiarray.array}
2 0.000 0.000 0.000 0.000 {numpy.core.multiarray.empty_like}
2 0.003 0.002 0.003 0.002 {numpy.core.multiarray.zeros}
8 0.002 0.000 0.002 0.000 {zip}
For me, there is not much of an indication (-> numpy?), where the time is lost, since except for argmin, the total time seems to be negligible.
Do you have any suggestions, how to improve performance?