Interpolate proportionally with duplicate index - pandas

I have a table like
df = pd.DataFrame([1,np.nan,3,1,np.nan,3,50,np.nan,52], index=[7, 8, 9, 7, 12, 27, 7, 8, 9]):
index values
7 1
8 NaN
9 3
7 1
12 NaN
27 3
7 50
8 NaN
9 52
Rows are correctly sorted. However, index here is not ordered, and has duplicates by design.
How to interpolate values here proportionally to index (method="index")?
If I try to interpolate using index, resulting Series is messed up because of duplicate index:
df.interpolate(method='index'):
index values desired actual
7 1 1 1
8 NaN 2 2
9 3 3 3
7 1 1 1
12 NaN 1.5 52 <-- wat
27 3 3 3
7 50 50 50
8 NaN 51 1.1 <-- wat
9 52 52 52
If not reproducible: Pandas 0.23.3, Numpy: 1.14.5, Python: 3.6.5

Try to add a grouping the dataframe based on index:
df.groupby(df.index.to_series().diff().lt(0).cumsum())\
.apply(lambda x: x.interpolate(method='index'))
Output:
0
7 1.0
8 2.0
9 3.0
7 1.0
12 1.5
27 3.0
7 50.0
8 51.0
9 52.0

More complicated way if you have situation like I mentioned above in scott 's comment
np.where(df['values'].isnull(),df['values'].shift()+(df['values'].shift(-1)-df['values'].shift())*(df['index']-df['index'].shift())/(df['index'].shift(-1)-df['index'].shift()),df['values'])
Out[219]: array([ 1. , 2. , 3. , 1. , 1.5, 3. , 50. , 51. , 52. ])
This is to check the distance of each null value between two valid value , and fill the value with the distance of index(different).
tolerance : only one missing value between two values

Related

python rolling product on non-adjacent row

I would like to calculate rolling product of non-adjacent row, such as product of values in every fifth row as shown in the photo (result in blue cell is the product of data in blue cell etc.)
The best way I can do now is the following;
temp = pd.DataFrame([range(20)]).transpose()
df = temp.copy()
df['shift1'] = temp.shift(5)
df['shift2'] = temp.shift(10)
df['shift3'] = temp.shift(15)
result = df.product(axis=1)
however, it looks to be cumbersome as I want to change the row step dynamically.
can anyone tell me if there is a better way to navigate this?
Thank you
You can use groupby.cumprod/groupby.prod with the modulo 5 as grouper:
import numpy as np
m = np.arange(len(df)) % 5
# option 1
df['result'] = df.groupby(m)['data'].cumprod()
# option 2
df.loc[~m.duplicated(keep='last'), 'result2'] = df.groupby(m)['data'].cumprod()
# or
# df.loc[~m.duplicated(keep='last'),
# 'result2'] = df.groupby(m)['data'].prod().to_numpy()
output:
data result result2
0 0 0 NaN
1 1 1 NaN
2 2 2 NaN
3 3 3 NaN
4 4 4 NaN
5 5 0 NaN
6 6 6 NaN
7 7 14 NaN
8 8 24 NaN
9 9 36 NaN
10 10 0 NaN
11 11 66 NaN
12 12 168 NaN
13 13 312 NaN
14 14 504 NaN
15 15 0 0.0
16 16 1056 1056.0
17 17 2856 2856.0
18 18 5616 5616.0
19 19 9576 9576.0

Random sampling from a dataframe

I want to generate 2x6 dataframe which represents a Rack.Half of this dataframe are filled with storage items and the other half is with retrieval items.
I want to do is random chosing half of these 12 items and say that they are storage and others are retrieval.
How can I randomly choose?
I tried random.sample but this chooses random columns.Actually I want to choose random items individually.
Assuming this input:
0 1 2 3 4 5
0 0 1 2 3 4 5
1 6 7 8 9 10 11
You can craft a random numpy array to select/mask half of the values:
a = np.repeat([True,False], df.size//2)
np.random.shuffle(a)
a = a.reshape(df.shape)
Then select your two groups:
df.mask(a)
0 1 2 3 4 5
0 NaN NaN NaN 3.0 4 NaN
1 6.0 NaN 8.0 NaN 10 11.0
df.where(a)
0 1 2 3 4 5
0 0.0 1 2.0 NaN NaN 5.0
1 NaN 7 NaN 9.0 NaN NaN
If you simply want 6 random elements, use nummy.random.choice:
np.random.choice(df.to_numpy(). ravel(), 6, replace=False)
Example:
array([ 4, 5, 11, 7, 8, 3])

Pandas: Get rolling mean with a add operation in between

My Pandas df is like:
ID delta price
1 -2 4
2 2 5
3 -3 3
4 0.8
5 0.9
6 -2.3
7 2.8
8 1
9 1
10 1
11 1
12 1
Pandas already has robust mean calculation method in built. I need to use it slightly differently.
So, in my df, price at row 4 would be sum of (a) rolling mean of price in row 1, 2, 3 and (b) delta at row 4.
Once, this is computed: I would move to row 5 for this: (a) rolling mean of price in row 2, 3, 4 and (b) delta at row 5. This would give price at row 5.....
I can iterate over rows to get this but my actual dataframe in quite big and iterating over row would slow things up....any better way to achieve?
I do not think we have method in panda can use the pervious calculated value in the next calculation
n = 3
for x in df.index[df.price.isna()]:
df.loc[x,'price'] = (df.loc[x-n:x,'price'].sum() + df.loc[x,'delta'])/4
df
Out[150]:
ID delta price
0 1 -2.0 4.000000
1 2 2.0 5.000000
2 3 -3.0 3.000000
3 4 0.8 3.200000
4 5 0.9 3.025000
5 6 -2.3 1.731250
6 7 2.8 2.689062
7 8 1.0 2.111328
8 9 1.0 1.882910
9 10 1.0 1.920825
10 11 1.0 1.728766
11 12 1.0 1.633125

Pandas rolling function with specific numeric span?

As of Pandas 0.18.0, it is possible to have a variable rolling window size for time-series by specifying a time span. For example, the code for summation over a 2-second window in dataframe dft looks like this:
dft.rolling('2s').sum()
It is possible to do the same with non-datetime spans?
For example, given a dataframe that looks like this:
A B
0 1 1
1 2 2
2 3 3
3 5 5
4 6 6
5 7 7
6 10 10
Is it possible to specify a window span of say 3 on column 'A' and have the sum of column 'B' calculated, so that the output looks something like:
A B
0 1 NaN
1 2 NaN
2 3 5
3 5 10
4 6 14
5 7 18
6 10 17
Not with rolling(). See the documentation for the window argument:
[A variable-sized window] is only valid for datetimelike indexes.
Full text:
window : int, or offset
Size of the moving window. This is the number of observations used for calculating the statistic. Each window will be a fixed size.
If its an offset then this will be the time period of each window. Each window will be a variable sized based on the observations included in the time-period. This is only valid for datetimelike indexes.
Here's a workaround if you're interested.
df = pd.DataFrame({'A' : np.arange(10),
'B' : np.arange(10,20)},
index=[1,2,3,5,8,9,11,14,19,20])
def var_window(df, size, min_periods=None):
"""Operates on the index."""
result = []
df = df.sort_index()
for i in df.index:
start = i - size + 1
res = df.loc[start:i].sum().tolist()
result.append(res)
result = pd.DataFrame(result, index=df.index)
if min_periods:
result.loc[:min_periods - 1] = np.nan
return result
print(var_window(df, size=3, min_periods=3, inclusive=True))
0 1
1 NaN NaN
2 NaN NaN
3 3.0 33.0
5 5.0 25.0
8 4.0 14.0
9 9.0 29.0
11 11.0 31.0
14 7.0 17.0
19 8.0 18.0
20 17.0 37.0
Explanation: loop through the index. At each value, truncate the DataFrame to the trailing window size. Here 'size' is not a count, but rather a range as you have defined it.
In the above, at the index value of 8, you're summing the values of A for which the index is 8, 7, or 6. (I.e. > 8 - 3 + 1). The only index value that falls within that range is 8, so the sum is simply the value from the original frame. Comparatively, for the index value of 11, the sum will include values for 9 and 11 (5 + 6 = 11, the resulting sum for A).
Compare this with standard rolling ops:
print(df.rolling(window=3).sum())
A B
1 NaN NaN
2 NaN NaN
3 3.0 33.0
5 6.0 36.0
8 9.0 39.0
9 12.0 42.0
11 15.0 45.0
14 18.0 48.0
19 21.0 51.0
20 24.0 54.0
If I'm misinterpreting your question, let me know how. It's admittedly significantly slower:
%timeit df.rolling(window=3).sum()
1000 loops, best of 3: 627 µs per loop
%timeit var_window(df, size=3, min_periods=3)
100 loops, best of 3: 3.59 ms per loop

easy multidimensional numpy ndarray to pandas dataframe method?

Having a 4-D numpy.ndarray, e.g.
myarr = np.random.rand(10,4,3,2)
dims={'time':1:10,'sub':1:4,'cond':['A','B','C'],'measure':['meas1','meas2']}
But with possible higher dimensions. How can I create a pandas.dataframe with multiindex, just passing the dimensions as indexes, without further manual adjustments (reshaping the ndarray into 2D shape)?
I can't wrap my head around the reshaping, not even really in 3 dimensions quite yet, so I'm searching for an 'automatic' method if possible.
What would be a function to which to pass the column/row indexes and create a dataframe? Something like:
df=nd2df(myarr,dim2row=[0,1],dim2col=[2,3],rowlab=['time','sub'],collab=['cond','measure'])
And and up with something like:
meas1 meas2
A B C A B C
sub time
1 1
2
3
.
.
2 1
2
...
If it is not possible/feasible to do it automatized, an explanation that is less terse than the Multiindexing manual is appreciated.
I can't even get it right when I don't care about the order of the dimensions, e.g. I would expect this to work:
a=np.arange(24).reshape((3,2,2,2))
iterables=[[1,2,3],[1,2],['m1','m2'],['A','B']]
pd.MultiIndex.from_product(iterables, names=['time','sub','meas','cond'])
pd.DataFrame(a.reshape(2*3*1,2*2),index)
gives:
ValueError: Shape of passed values is (4, 6), indices imply (4, 24)
You're getting the error because you've reshaped the ndarray as 6x4 and applying an index intended to capture all dimensions in a single series. The following is a setup to get the pet example working:
a=np.arange(24).reshape((3,2,2,2))
iterables=[[1,2,3],[1,2],['m1','m2'],['A','B']]
index = pd.MultiIndex.from_product(iterables, names=['time','sub','meas','cond'])
pd.DataFrame(a.reshape(24, 1),index=index)
Solution
Here's a generic DataFrame creator that should get the job done:
def produce_df(rows, columns, row_names=None, column_names=None):
"""rows is a list of lists that will be used to build a MultiIndex
columns is a list of lists that will be used to build a MultiIndex"""
row_index = pd.MultiIndex.from_product(rows, names=row_names)
col_index = pd.MultiIndex.from_product(columns, names=column_names)
return pd.DataFrame(index=row_index, columns=col_index)
Demonstration
Without named index levels
produce_df([['a', 'b'], ['c', 'd']], [['1', '2'], ['3', '4']])
1 2
3 4 3 4
a c NaN NaN NaN NaN
d NaN NaN NaN NaN
b c NaN NaN NaN NaN
d NaN NaN NaN NaN
With named index levels
produce_df([['a', 'b'], ['c', 'd']], [['1', '2'], ['3', '4']],
row_names=['alpha1', 'alpha2'], column_names=['number1', 'number2'])
number1 1 2
number2 3 4 3 4
alpha1 alpha2
a c NaN NaN NaN NaN
d NaN NaN NaN NaN
b c NaN NaN NaN NaN
d NaN NaN NaN NaN
From the structure of your data,
names=['sub','time','measure','cond'] #ind1,ind2,col1,col2
labels=[[1,2,3],[1,2],['meas1','meas2'],list('ABC')]
A straightforward way to your goal:
index = pd.MultiIndex.from_product(labels,names=names)
data=arange(index.size) # or myarr.flatten()
df=pd.DataFrame(data,index=index)
df22=df.reset_index().pivot_table(values=0,index=names[:2],columns=names[2:])
"""
measure meas1 meas2
cond A B C A B C
sub time
1 1 0 1 2 3 4 5
2 6 7 8 9 10 11
2 1 12 13 14 15 16 17
2 18 19 20 21 22 23
3 1 24 25 26 27 28 29
2 30 31 32 33 34 35
"""
I still don't know how to do it directly, but here is an easy-to-follow step by step way:
# Create 4D-array
a=np.arange(24).reshape((3,2,2,2))
# Set only one row index
rowiter=[[1,2,3]]
row_ind=pd.MultiIndex.from_product(rowiter, names=[u'time'])
# put the rest of dimenstion into columns
coliter=[[1,2],['m1','m2'],['A','B']]
col_ind=pd.MultiIndex.from_product(coliter, names=[u'sub',u'meas',u'cond'])
ncols=np.prod([len(coliter[x]) for x in range(len(coliter))])
b=pd.DataFrame(a.reshape(len(rowiter[0]),ncols),index=row_ind,columns=col_ind)
print(b)
# Reshape columns to rows as pleased:
b=b.stack('sub')
# switch levels and order in rows (level goes from inner to outer):
c=b.swaplevel(0,1,axis=0).sortlevel(0,axis=0)
To check the correct assignment of dimensions:
print(a[:,0,0,0])
[ 0 8 16]
print(a[0,:,0,0])
[0 4]
print(a[0,0,:,0])
[0 2]
print(b)
meas m1 m2
cond A B A B
time sub
1 1 0 1 2 3
2 4 5 6 7
2 1 8 9 10 11
2 12 13 14 15
3 1 16 17 18 19
2 20 21 22 23
print(c)
meas m1 m2
cond A B A B
sub time
1 1 0 1 2 3
2 8 9 10 11
3 16 17 18 19
2 1 4 5 6 7
2 12 13 14 15
3 20 21 22 23