rolling windows defined by backward cumulative sums - pandas

I have got a pandas DataFrame like this:
A B
0 3 ...
1 2
2 4
3 4
4 1
5 7
6 5
7 3
I would like to compute a rolling along column A summing its elements backwards until I reach at least 10. The resulting windows should be:
A B window_indices
0 3 ... NA
1 2 NA
2 4 NA
3 4 --> [3,2,1]
4 1 [4,3,2,1]
5 7 [5,4,3]
6 5 [6,5]
7 3 [7,6,5]
Next, I want to compute some statistics on column B, something like that:
df.my_rolling(on='A', func='sum', threshold=10).B.mean()
I have got an idea: we could think of the elements of column A as seconds. Transform A in a datetime column and perform a standard rolling on it. But I don't know how to do that.

This is no able to do with rolling since the rolling window is not fixed
l = [[df.index[(df.A.loc[:x].iloc[::-1].cumsum()>=10).idxmax():x+1].tolist()[::-1]
if (df.A.loc[:x].sum()>=10) else np.nan] for x in df.A.index]
Out[46]:
[[nan],
[nan],
[nan],
[[3, 2, 1]],
[[4, 3, 2, 1]],
[[5, 4, 3]],
[[6, 5]],
[[7, 6, 5]]]
df['new'] = l

Related

How to do a conditional rolling mean in Pandas?

I have this data frame available. It has a timestamp for start, a timestamp for end and a duration column.
start
end
duration
1
5
4
2
5
3
3
4
1
4
6
2
5
9
4
6
7
1
7
10
3
I'd like to add a column 'rolling_mean' to the dataframe that calculates a rolling average on all previous rows (ordered by start) with this condition: only previous rows can be used for mean calculation where the event has already ended (so end date should be equal to or lower than the start date of the row for which the rolling mean is being calculated). So for row number 4, the rolling_mean is 1 because we look at all previous rows and only the previous one fulfills the condition of the event having ended.
This is the dataframe I'd like to get with a Pandas rolling mean:
start
end
duration
rolling_mean
1
5
4
Nan
2
5
3
Nan
3
4
1
Nan
4
6
2
1
5
9
4
2.666667
6
7
1
2.500000
7
10
3
2.200000
Here is the code to reproduce my example:
d = [[1, 5],
[2, 5],
[3, 4],
[4, 6],
[5, 9],
[6, 7],
[7, 10]]
df = pd.DataFrame(d, columns=['start_time', 'end_time'])
df['duration'] = df.end_time - df.start_time
I've tried to merge the dataframe with itself to then filter out the irrelevant rows, but the data frame is too big to take this approach.
So I'm looking for a rolling mean but where I can specify the extra condition.
Does anyone have any ideas for this one?
A for loop will do the job:
rolling_mean = np.repeat(np.nan, len(df))
start, end, duration = df[["start_time", "end_time", "duration"]].to_numpy().T
for i in range(len(df)):
matches = duration[:i][end[:i] <= start[i]]
if matches.any():
rolling_mean[i] = matches.mean()

Find common values within groupby in pandas Dataframe based on two columns

I have following dataframe:
period symptoms recovery
1 4 2
1 5 2
1 6 2
2 3 1
2 5 2
2 8 4
2 12 6
3 4 2
3 5 2
3 6 3
3 8 5
4 5 2
4 8 4
4 12 6
I'm trying to find the common values of df['period'] groups (1, 2, 3, 4) based on value
of two columns 'symptoms' and 'recovery'
Result should be :
symptoms recovery period
5 2 [1, 2, 3, 4]
8 4 [2, 4]
where each same two columns values has the periods occurrence in a list or column.
I'm I approaching the problem in the wrong way ? Appreciate your help.
I tried to turn each period into dict and loop through to find values but didn't work for me. Also tried to use grouby().apply() but I'm not getting a meaningful data frame.
Tried sorting values based on 3 columns but couldn't get the common ones between each period section.
Last attempt :
df2 = df[['period', 'how_long', 'days_to_ex']].copy()
#s = df.groupby(["period", "symptoms", "recovery"]).size()
s = df.groupby(["symptoms", "recovery"]).size()
You were almost there:
from io import StringIO
import pandas as pd
# setup sample data
data = StringIO("""
period;symptoms;recovery
1;4;2
1;5;2
1;6;2
2;3;1
2;5;2
2;8;4
2;12;6
3;4;2
3;5;2
3;6;3
3;8;5
4;5;2
4;8;4
4;12;6
""")
df = pd.read_csv(data, sep=";")
# collect unique periods
df.groupby(['symptoms','recovery'])[['period']].agg(list).reset_index()
This gives
symptoms recovery period
0 3 1 [2]
1 4 2 [1, 3]
2 5 2 [1, 2, 3, 4]
3 6 2 [1]
4 6 3 [3]
5 8 4 [2, 4]
6 8 5 [3]
7 12 6 [2, 4]

Find pattern in pandas dataframe, reorder it row-wise, and reset index

This is a multipart problem. I have found solutions for each separate part, but when I try to combine these solutions, I don't get the outcome I want.
Let's say this is my dataframe:
df = pd.DataFrame(list(zip([1, 3, 6, 7, 7, 8, 4], [6, 7, 7, 9, 5, 3, 1])), columns = ['Values', 'Vals'])
df
Values Vals
0 1 6
1 3 7
2 6 7
3 7 9
4 7 5
5 8 3
6 4 1
Let's say I want to find the pattern [6, 7, 7] in the 'Values' column.
I can use a modified version of the second solution given here:
Pandas: How to find a particular pattern in a dataframe column?
pattern = [6, 7, 7]
pat_i = [df[i-len(pattern):i] # Get the index
for i in range(len(pattern), len(df)) # for each 3 consequent elements
if all(df['Values'][i-len(pattern):i] == pattern)] # if the pattern matched
pat_i
[ Values Vals
2 6 7
3 7 9
4 7 5]
The only way I've found to narrow this down to just index values is the following:
pat_i = [df.index[i-len(pattern):i] # Get the index
for i in range(len(pattern), len(df)) # for each 3 consequent elements
if all(df['Values'][i-len(pattern):i] == pattern)] # if the pattern matched
pat_i
[RangeIndex(start=2, stop=5, step=1)]
Once I've found the pattern, what I want to do, within the original dataframe, is reorder the pattern to [7, 7, 6], moving the entire associated rows as I do this. In other words, going by the index, I want to get output that looks like this:
df.reindex([0, 1, 3, 4, 2, 5, 6])
Values Vals
0 1 6
1 3 7
3 7 9
4 7 5
2 6 7
5 8 3
6 4 1
Then, finally, I want to reset the index so that the values in all the columns stay in the new re-ordered place;
Values Vals
0 1 6
1 3 7
2 7 9
3 7 5
4 6 7
5 8 3
6 4 1
In order to use pat_i as a basis for re-ordering, I've tried to modify the second solution given here:
Python Pandas: How to move one row to the first row of a Dataframe?
target_row = 2
# Move target row to first element of list.
idx = [target_row] + [i for i in range(len(df)) if i != target_row]
However, I can't figure out how to exploit the pat_i RangeIndex object to use it with this code. The solution, when I find it, will be applied to hundreds of dataframes, each one of which will contain the [6, 7, 7] pattern that needs to be re-ordered in one place, but not the same place in each dataframe.
Any help appreciated...and I'm sure there must be an elegant, pythonic way of doing this, as it seems like it should be a common enough challenge. Thank you.
I just sort of rewrote your code. I held the first and last indexes to the side, reordered the indexes of interest, and put everything together in a new index. Then I just use the new index to reorder the data.
import pandas as pd
from pandas import RangeIndex
df = pd.DataFrame(list(zip([1, 3, 6, 7, 7, 8, 4], [6, 7, 7, 9, 5, 3, 1])), columns = ['Values', 'Vals'])
pattern = [6, 7, 7]
new_order = [1, 2, 0] # new order of pattern
for i in list(df[df['Values'] == pattern[0]].index):
if all(df['Values'][i:i+len(pattern)] == pattern):
pat_i = df[i:i+len(pattern)]
front_ind = list(range(0, pat_i.index[0]))
back_ind = list(range(pat_i.index[-1]+1, len(df)))
pat_ind = [pat_i.index[i] for i in new_order]
new_ind = front_ind + pat_ind + back_ind
df = df.loc[new_ind].reset_index(drop=True)
df
Out[82]:
Values Vals
0 1 6
1 3 7
2 7 9
3 7 5
4 6 7
5 8 3
6 4 1

Reshaping column values into rows with Identifier column at the end

I have measurements for Power related to different sensors i.e A1_Pin, A2_Pin and so on. These measurements are recorded in file as columns. The data is uniquely recorded with timestamps.
df1 = pd.DataFrame({'DateTime': ['12/12/2019', '12/13/2019', '12/14/2019',
'12/15/2019', '12/16/2019'],
'A1_Pin': [2, 8, 8, 3, 9],
'A2_Pin': [1, 2, 3, 4, 5],
'A3_Pin': [85, 36, 78, 32, 75]})
I want to reform the table so that each row corresponds to one sensor. The last column indicates the sensor ID to which the row data belongs to.
The final table should look like:
df2 = pd.DataFrame({'DateTime': ['12/12/2019', '12/12/2019', '12/12/2019',
'12/13/2019', '12/13/2019','12/13/2019', '12/14/2019', '12/14/2019',
'12/14/2019', '12/15/2019','12/15/2019', '12/15/2019', '12/16/2019',
'12/16/2019', '12/16/2019'],
'Power': [2, 1, 85,8, 2, 36, 8,3,78, 3, 4, 32, 9, 5, 75],
'ModID': ['A1_PiN','A2_PiN','A3_PiN','A1_PiN','A2_PiN','A3_PiN',
'A1_PiN','A2_PiN','A3_PiN','A1_PiN','A2_PiN','A3_PiN',
'A1_PiN','A2_PiN','A3_PiN']})
I have tried Groupby, Melt, Reshape, Stack and loops but could not do that. If anyone could help? Thanks
When you tried stack, you were on one good track. you need to set_index first and reset_index after such as:
df2 = df1.set_index('DateTime').stack().reset_index(name='Power')\
.rename(columns={'level_1':'ModID'}) #to fit the names your expected output
And you get:
print (df2)
DateTime ModID Power
0 12/12/2019 A1_Pin 2
1 12/12/2019 A2_Pin 1
2 12/12/2019 A3_Pin 85
3 12/13/2019 A1_Pin 8
4 12/13/2019 A2_Pin 2
5 12/13/2019 A3_Pin 36
6 12/14/2019 A1_Pin 8
7 12/14/2019 A2_Pin 3
8 12/14/2019 A3_Pin 78
9 12/15/2019 A1_Pin 3
10 12/15/2019 A2_Pin 4
11 12/15/2019 A3_Pin 32
12 12/16/2019 A1_Pin 9
13 12/16/2019 A2_Pin 5
14 12/16/2019 A3_Pin 75
I'd try something like this:
df1.set_index('DateTime').unstack().reset_index()

Aggregating a time series in Pandas given a window size

Lets say I have this data
a = pandas.Series([1,2,3,4,5,6,7,8])
a
Out[313]:
0 1
1 2
2 3
3 4
4 5
5 6
6 7
7 8
dtype: int64
I would like aggregate data which groups data n rows at a time and sums them up. So if n=2 the new series would look like {3,7,11,15}.
try this:
In [39]: a.groupby(a.index//2).sum()
Out[39]:
0 3
1 7
2 11
3 15
dtype: int64
In [41]: a.index//2
Out[41]: Int64Index([0, 0, 1, 1, 2, 2, 3, 3], dtype='int64')
n=3
In [42]: n=3
In [43]: a.groupby(a.index//n).sum()
Out[43]:
0 6
1 15
2 15
dtype: int64
In [44]: a.index//n
Out[44]: Int64Index([0, 0, 0, 1, 1, 1, 2, 2], dtype='int64')
you can use pandas rolling mean and get it like the following:
if n is your interval:
sums = list(a.rolling(n).sum()[n-1::n])
# Optional !!!
rem = len(a)%n
if rem != 0:
sums.append(a[-rem:].sum())
The first line perfectly adds the rows if the data can be properly divided into groups, else, we also can add the remaining sum (depends on your preference).
For e.g., in the above case, if n=3, then you may want to get either {6, 15, 15} or just {6, 15}. The code above is for the former case. And skipping the optional part gives you just {6, 15}.