Pandas rolling function with specific numeric span? - pandas

As of Pandas 0.18.0, it is possible to have a variable rolling window size for time-series by specifying a time span. For example, the code for summation over a 2-second window in dataframe dft looks like this:
dft.rolling('2s').sum()
It is possible to do the same with non-datetime spans?
For example, given a dataframe that looks like this:
A B
0 1 1
1 2 2
2 3 3
3 5 5
4 6 6
5 7 7
6 10 10
Is it possible to specify a window span of say 3 on column 'A' and have the sum of column 'B' calculated, so that the output looks something like:
A B
0 1 NaN
1 2 NaN
2 3 5
3 5 10
4 6 14
5 7 18
6 10 17

Not with rolling(). See the documentation for the window argument:
[A variable-sized window] is only valid for datetimelike indexes.
Full text:
window : int, or offset
Size of the moving window. This is the number of observations used for calculating the statistic. Each window will be a fixed size.
If its an offset then this will be the time period of each window. Each window will be a variable sized based on the observations included in the time-period. This is only valid for datetimelike indexes.

Here's a workaround if you're interested.
df = pd.DataFrame({'A' : np.arange(10),
'B' : np.arange(10,20)},
index=[1,2,3,5,8,9,11,14,19,20])
def var_window(df, size, min_periods=None):
"""Operates on the index."""
result = []
df = df.sort_index()
for i in df.index:
start = i - size + 1
res = df.loc[start:i].sum().tolist()
result.append(res)
result = pd.DataFrame(result, index=df.index)
if min_periods:
result.loc[:min_periods - 1] = np.nan
return result
print(var_window(df, size=3, min_periods=3, inclusive=True))
0 1
1 NaN NaN
2 NaN NaN
3 3.0 33.0
5 5.0 25.0
8 4.0 14.0
9 9.0 29.0
11 11.0 31.0
14 7.0 17.0
19 8.0 18.0
20 17.0 37.0
Explanation: loop through the index. At each value, truncate the DataFrame to the trailing window size. Here 'size' is not a count, but rather a range as you have defined it.
In the above, at the index value of 8, you're summing the values of A for which the index is 8, 7, or 6. (I.e. > 8 - 3 + 1). The only index value that falls within that range is 8, so the sum is simply the value from the original frame. Comparatively, for the index value of 11, the sum will include values for 9 and 11 (5 + 6 = 11, the resulting sum for A).
Compare this with standard rolling ops:
print(df.rolling(window=3).sum())
A B
1 NaN NaN
2 NaN NaN
3 3.0 33.0
5 6.0 36.0
8 9.0 39.0
9 12.0 42.0
11 15.0 45.0
14 18.0 48.0
19 21.0 51.0
20 24.0 54.0
If I'm misinterpreting your question, let me know how. It's admittedly significantly slower:
%timeit df.rolling(window=3).sum()
1000 loops, best of 3: 627 µs per loop
%timeit var_window(df, size=3, min_periods=3)
100 loops, best of 3: 3.59 ms per loop

Related

Random sampling from a dataframe

I want to generate 2x6 dataframe which represents a Rack.Half of this dataframe are filled with storage items and the other half is with retrieval items.
I want to do is random chosing half of these 12 items and say that they are storage and others are retrieval.
How can I randomly choose?
I tried random.sample but this chooses random columns.Actually I want to choose random items individually.
Assuming this input:
0 1 2 3 4 5
0 0 1 2 3 4 5
1 6 7 8 9 10 11
You can craft a random numpy array to select/mask half of the values:
a = np.repeat([True,False], df.size//2)
np.random.shuffle(a)
a = a.reshape(df.shape)
Then select your two groups:
df.mask(a)
0 1 2 3 4 5
0 NaN NaN NaN 3.0 4 NaN
1 6.0 NaN 8.0 NaN 10 11.0
df.where(a)
0 1 2 3 4 5
0 0.0 1 2.0 NaN NaN 5.0
1 NaN 7 NaN 9.0 NaN NaN
If you simply want 6 random elements, use nummy.random.choice:
np.random.choice(df.to_numpy(). ravel(), 6, replace=False)
Example:
array([ 4, 5, 11, 7, 8, 3])

Pandas: Get rolling mean with a add operation in between

My Pandas df is like:
ID delta price
1 -2 4
2 2 5
3 -3 3
4 0.8
5 0.9
6 -2.3
7 2.8
8 1
9 1
10 1
11 1
12 1
Pandas already has robust mean calculation method in built. I need to use it slightly differently.
So, in my df, price at row 4 would be sum of (a) rolling mean of price in row 1, 2, 3 and (b) delta at row 4.
Once, this is computed: I would move to row 5 for this: (a) rolling mean of price in row 2, 3, 4 and (b) delta at row 5. This would give price at row 5.....
I can iterate over rows to get this but my actual dataframe in quite big and iterating over row would slow things up....any better way to achieve?
I do not think we have method in panda can use the pervious calculated value in the next calculation
n = 3
for x in df.index[df.price.isna()]:
df.loc[x,'price'] = (df.loc[x-n:x,'price'].sum() + df.loc[x,'delta'])/4
df
Out[150]:
ID delta price
0 1 -2.0 4.000000
1 2 2.0 5.000000
2 3 -3.0 3.000000
3 4 0.8 3.200000
4 5 0.9 3.025000
5 6 -2.3 1.731250
6 7 2.8 2.689062
7 8 1.0 2.111328
8 9 1.0 1.882910
9 10 1.0 1.920825
10 11 1.0 1.728766
11 12 1.0 1.633125

if (columnArow1= columnArow2, columnBrow2, "") excel if(logic_test, [value_if_true],[value_if_false]) how can I write this in python?

I would like to write Excel code into a Python (pandas)
I have filtered the df.loc[df.Activity_Mailbox.isnull()], now the na values must be calculated using
if (columnArow1 = columnArow2, columnBrow2, "")
This formula is according to Excel.
Please provide next time some demo data, like in your other question :-)
If I understand you correctly. Your data looks like:
df = pd.DataFrame({"A":[1,2,3,np.nan,5,np.nan],
"B":[10,11,12,13,14,15]})
df
A B
0 1.0 10
1 2.0 11
2 3.0 12
3 NaN 13
4 5.0 14
5 NaN 15
And now you want to fill the NaN with value from the other column. This can easily be done with:
df["A"] = df["A"].fillna(df["B"])
Output:
df
A B
0 1.0 10
1 2.0 11
2 3.0 12
3 13.0 13
4 5.0 14
5 15.0 15

pandas aggregating frames by largest common column denominator and filling missing values

I have been struggling with this issue for a bit and even though there are some workarounds i would assume, I would love to know if there is an elegant way to achieve this result:
import pandas as pd
import numpy as np
data = np.array([
[1,10],
[2,12],
[4,13],
[5,14],
[8,15]])
df1 = pd.DataFrame(data=data, index=range(0,5), columns=['x','a'])
data = np.array([
[2,100,101],
[3,120,122],
[4,130,132],
[7,140,142],
[9,150,151],
[12,160,152]])
df2 = pd.DataFrame(data=data, index=range(0,6), columns=['x','b','c'])
Now I would like to have a data frame that concatenate those 2 and fill the missing values with the previous value
or the first value otherwise. Both data frames can have differnet sizes, what we are interested in here is the unique column x.
That would be my desired output frame df_result.
x is the aggregated unique "x" between the 2 frames
x a b c
0 1 10 100 101
1 2 12 100 101
2 3 12 120 122
3 4 13 130 132
4 5 14 130 132
5 7 14 140 142
6 8 15 140 142
7 9 15 150 151
8 12 15 160 152
Any help or hint would be much appreciated, thank you very much
You can simply use merge operation on 2 dataframes, after that you can apply a sorting, forward fill and backward filling for null values fillling.
df1.merge(df2,on='x',how='outer').sort_values('x').ffill().bfill()
Out:
x a b c
0 1 10.0 100.0 101.0
1 2 12.0 100.0 101.0
5 3 12.0 120.0 122.0
2 4 13.0 130.0 132.0
3 5 14.0 130.0 132.0
6 7 14.0 140.0 142.0
4 8 15.0 140.0 142.0
7 9 15.0 150.0 151.0
8 12 15.0 160.0 152.0

Interpolate proportionally with duplicate index

I have a table like
df = pd.DataFrame([1,np.nan,3,1,np.nan,3,50,np.nan,52], index=[7, 8, 9, 7, 12, 27, 7, 8, 9]):
index values
7 1
8 NaN
9 3
7 1
12 NaN
27 3
7 50
8 NaN
9 52
Rows are correctly sorted. However, index here is not ordered, and has duplicates by design.
How to interpolate values here proportionally to index (method="index")?
If I try to interpolate using index, resulting Series is messed up because of duplicate index:
df.interpolate(method='index'):
index values desired actual
7 1 1 1
8 NaN 2 2
9 3 3 3
7 1 1 1
12 NaN 1.5 52 <-- wat
27 3 3 3
7 50 50 50
8 NaN 51 1.1 <-- wat
9 52 52 52
If not reproducible: Pandas 0.23.3, Numpy: 1.14.5, Python: 3.6.5
Try to add a grouping the dataframe based on index:
df.groupby(df.index.to_series().diff().lt(0).cumsum())\
.apply(lambda x: x.interpolate(method='index'))
Output:
0
7 1.0
8 2.0
9 3.0
7 1.0
12 1.5
27 3.0
7 50.0
8 51.0
9 52.0
More complicated way if you have situation like I mentioned above in scott 's comment
np.where(df['values'].isnull(),df['values'].shift()+(df['values'].shift(-1)-df['values'].shift())*(df['index']-df['index'].shift())/(df['index'].shift(-1)-df['index'].shift()),df['values'])
Out[219]: array([ 1. , 2. , 3. , 1. , 1.5, 3. , 50. , 51. , 52. ])
This is to check the distance of each null value between two valid value , and fill the value with the distance of index(different).
tolerance : only one missing value between two values