Rolling Second highest in a pandas dataframe - pandas

I am trying to find the top and second highest value
I can get the highest using
df['B'] = df['a'].rolling(window=3).max()
But how do I get the second highest please?
Such that df['C'] will display as per below
A B C
1
6
5 6 5
4 6 5
12 12 5

Generic n-highest values in rolling/sliding windows
Here's one using np.lib.stride_tricks.as_strided to create sliding windows that lets us choose any generic N highest value in sliding windows -
# https://stackoverflow.com/a/40085052/ #Divakar
def strided_app(a, L, S ): # Window len = L, Stride len/stepsize = S
nrows = ((a.size-L)//S)+1
n = a.strides[0]
return np.lib.stride_tricks.as_strided(a, shape=(nrows,L), strides=(S*n,n))
# Return N highest nums in rolling windows of length W off array ar
def N_highest(ar, W, N=1):
# ar : Input array
# W : Window length
# N : Get us the N-highest in sliding windows
A2D = strided_app(ar,W,1)
idx = (np.argpartition(A2D, -N, axis=1) == A2D.shape[1]-N).argmax(1)
return A2D[np.arange(len(idx)), idx]
Sample runs -
In [634]: a = np.array([1,6,5,4,12]) # input array
In [635]: N_highest(a, W=3, N=1) # highest in W=3
Out[635]: array([ 6, 6, 12])
In [636]: N_highest(a, W=3, N=2) # second highest
Out[636]: array([5, 5, 5])
In [637]: N_highest(a, W=3, N=3) # third highest
Out[637]: array([1, 4, 4])
Another shorter way based on strides, would be with direct sorting, like so -
np.sort(strided_app(ar,W,1), axis=1)[:,-N]]
Solving our case
Hence, to solve our case, we need to concatenate with NaNs alongwith the result from the above mentioned function, like so -
W = 3
df['C'] = np.r_[ [np.nan]*(W-1), N_highest(df.A.values, W=W, N=2)]
Based on direct sorting, we would have -
df['C'] = np.r_[ [np.nan]*(W-1), np.sort(strided_app(df.A,W,1), axis=1)[:,-2]]
Sample run -
In [578]: df
Out[578]:
A
0 1
1 6
2 5
3 4
4 3 # <== Different from given sample, for variety
In [619]: W = 3
In [620]: df['C'] = np.r_[ [np.nan]*(W-1), N_highest(df.A.values, W=W, N=2)]
In [621]: df
Out[621]:
A C
0 1 NaN
1 6 NaN
2 5 5.0
3 4 5.0
4 3 4.0 # <== Second highest from the last group off : [5,4,3]

Related

Replace values inside list by generic numbers to group and reference for statistical computing

I usually use "${:,.2f}". format(prices) to round numbers before commas, but what I'm looking for is different, I want to change values numbers to group them and reference them by mode:
Let say I have this list:
0 34,123.45
1 34,456.78
2 34,567.89
3 33,222.22
4 30,123.45
And the replace function will turn the list to:
0 34,500.00
1 34,500.00
2 34,500.00
3 33,200.00
4 30,100.00
Like this when I use stats.mode(prices_rounded) it will show as a result:
Mode Value = 34500.00
Mode Count = 3
Is there a conversion function already available that does the job? I did search for days without luck...
EDIT - WORKING CODE:
#create list
df3 = df_array
print('########## df3: ',df3)
#convert to float
df4 = df3.astype(float)
print('########## df4: ',df4)
#convert list to string
#df5 = ''.join(map(str, df4))
#print('########## df5: ',df5)
#round values
df6 = np.round(df4 /100) * 100
print('######df6',df6)
#get mode stats
df7 = stats.mode(df6)
print('######df7',df7)
#get mode value
df8 = df7[0][0]
print('######df8',df8)
#convert to integer
df9 = int(df8)
print('######df9',df9)
This is exactly what I wanted, thanks!
You can use:
>>> sr
0 34123.45 # <- why 34500.00?
1 34456.78
2 34567.89 # <- why 34500.00?
3 33222.22
4 30123.45
dtype: float64
>>> np.round(sr / 100) * 100
0 34100.0
1 34500.0
2 34600.0
3 33200.0
4 30100.0
dtype: float64

Pandas check that a list is is_monotonic_increasing but with specific step

Lets say that we have these columns in a df:
A B C
0 1 0 1
1 2 2 2
2 3 4 4
3 4 6 6
4 5 8 8
I know that I can check that every specific columns with monotonic_increasing like that
df['A'].is_monotonic_increasing.
I was wondering if there is a way to check/validate that the monotonic_increasing has a specific step. Let me explain. I would like for example to check that df['A'] is monotonic_increasing with step 1, the df['B'] is monotonic_increasing with step 2.
Is there a way to check that out ?
I don't think there's a function for that. We can build a two-line function:
def step_incr(series, step=1):
tmp = np.arange(len(series)) * step
return series.eq(series.iloc[0]+tmp).all()
step_incr(df['A'], step=1)
# True
step_incr(df['B'], step=1)
# False
Another way to check is looking at the values of differences:
def is_step(series):
uniques = series.diff().iloc[1:].unique()
if len(uniques) == 1:
return True, uniques[0]
return False, None
is_step(df['A'])
# (True, 1.0)
is_step(df['B'])
# (True, 2.0)
is_step(df['C'])
# (False, None)
One liner to get all columns at once:
df.diff().iloc[1] * (df.diff().nunique() == 1)
Output:
A 1.0
B 2.0
C 0.0
Name: 1, dtype: float64
Output is the step size, or 0 if not monotonic increasing.

How to use the diff() function in pandas but enter the difference values in a new column?

I have a dataframe df:
df
x-value
frame
1 15
2 20
3 19
How can I get:
df
x-value delta-x
frame
1 15 0
2 20 5
3 19 -1
Not to say there is anything wrong with what #Wen posted as a comment, but I want to post a more complete answer.
The Problem
There are 3 things going on that need to be addressed:
Calculating the values that are the differences from one row to the next.
Handling the fact that the "difference" will be one less value than the original length of the dataframe and we'll have to fill in a value for the missing bit.
How do we assign this to a new column.
Option #1
The most natural way to do the diff would be to use pd.Series.diff (as #Wen suggested). But in order to produce the stated results, which are integers, I recommend using the pd.Series.fillna parameter, downcast='infer'. Finally, I don't like editing the dataframe unless there is a need for it. So I use pd.DataFrame.assign to produce a new dataframe that is a copy of the old one with a new column associated.
df.assign(**{'delta-x': df['x-value'].diff().fillna(0, downcast='infer')})
x-value delta-x
frame
1 15 0
2 20 5
3 19 -1
Option #2
Similar to #1 but I'll use numpy.diff to preserve int type in addition to picking up some performance.
df.assign(**{'delta-x': np.append(0, np.diff(df['x-value'].values))})
x-value delta-x
frame
1 15 0
2 20 5
3 19 -1
Testing
pir1 = lambda d: d.assign(**{'delta-x': d['x-value'].diff().fillna(0, downcast='infer')})
pir2 = lambda d: d.assign(**{'delta-x': np.append(0, np.diff(d['x-value'].values))})
res = pd.DataFrame(
index=[10, 300, 1000, 3000, 10000, 30000],
columns=['pir1', 'pir2'], dtype=float)
for i in res.index:
d = pd.concat([df] * i, ignore_index=True)
for j in res.columns:
stmt = '{}(d)'.format(j)
setp = 'from __main__ import d, {}'.format(j)
res.at[i, j] = timeit(stmt, setp, number=1000)
res.plot(loglog=True)
res.div(res.min(1), 0)
pir1 pir2
10 2.069498 1.0
300 2.123017 1.0
1000 2.397373 1.0
3000 2.804214 1.0
10000 4.559525 1.0
30000 7.058344 1.0

Add column of .75 quantile based off groupby

I have df with index as date and also column called scores. Now I want to maintain the df as it is but add column which gives the 0.7 quantile of scores for that day. Method of quantile would need to be midpoint and also be rounded to nearest whole number.
I've outlined one approach you could take, below.
Note that to round a value to the nearest whole number you should use Python's built-in round() function. See round() in the Python documentation for details.
import pandas as pd
import numpy as np
# set random seed for reproducibility
np.random.seed(748)
# initialize base example dataframe
df = pd.DataFrame({"date":np.arange(10),
"score":np.random.uniform(size=10)})
duplicate_dates = np.random.choice(df.index, 5)
df_dup = pd.DataFrame({"date":np.random.choice(df.index, 5),
"score":np.random.uniform(size=5)})
# finish compiling example data
df = df.append(df_dup, ignore_index=True)
# calculate 0.7 quantile result with specified parameters
result = df.groupby("date").quantile(q=0.7, axis=0, interpolation='midpoint')
# print resulting dataframe
# contains one unique 0.7 quantile value per date
print(result)
"""
0.7 score
date
0 0.585087
1 0.476404
2 0.426252
3 0.363376
4 0.165013
5 0.927199
6 0.575510
7 0.576636
8 0.831572
9 0.932183
"""
# to apply the resulting quantile information to
# a new column in our original dataframe `df`
# we can apply a dictionary to our "date" column
# create dictionary
mapping = result.to_dict()["score"]
# apply to `df` to produce desired new column
df["quantile_0.7"] = [mapping[x] for x in df["date"]]
print(df)
"""
date score quantile_0.7
0 0 0.920895 0.585087
1 1 0.476404 0.476404
2 2 0.380771 0.426252
3 3 0.363376 0.363376
4 4 0.165013 0.165013
5 5 0.927199 0.927199
6 6 0.340008 0.575510
7 7 0.695818 0.576636
8 8 0.831572 0.831572
9 9 0.932183 0.932183
10 7 0.457455 0.576636
11 6 0.650666 0.575510
12 6 0.500353 0.575510
13 0 0.249280 0.585087
14 2 0.471733 0.426252
"""

Python 3: handling numpy arrays and export via openpyxl

I am working with an array consisting of several lists. Of each sublist, I want to take the mean and the std. deviation, and write them in an excel sheet.
The code I have does its job, but it gives me headache as I feel I'm not using python efficiently at all, especially in step (2), where I use numpy in a step-by-step manner. Also, I don't get why I have to do the modification in step (3) in order to bring the data ("total") in a form that I can feed to the openpyxl writer ("total_list"). I would appreciate any help in making it more elegant, here is my code:
import numpy as np
from openpyxl import Workbook
from itertools import chain
# (1) Make up sample array:
arr = [[1,1,3], [3,4,2], [4,4,5], [6,6,5]]
# (2) Make up lists containing average values and std. deviations
avg = []
dev = []
for i in arr:
avg.append(np.mean(i))
dev.append(np.std(i))
# (3) Make an alternating list (avg 1, dev 1, avg 2, dev 2, ...)
total = chain.from_iterable( zip( avg, dev ) )
# (4) Make an alternative list that can be fed to the xlsx writer
total_list = []
for i in total:
total_list.append(i)
# Write to Excel file
wb = Workbook()
ws = wb.active
ws.append(total_list)
wb.save("temp.xlsx")
I would like to have the format shown in the picture attached. It is important, that all data are in one row.
Improvements on the numpy code:
In [272]: arr = [[1,1,3], [3,4,2], [4,4,5], [6,6,5]]
Make an array from this list. This isn't required since np.mean does it under the covers, but it should help visualize the action.
In [273]: arr = np.array(arr)
In [274]: arr
Out[274]:
array([[1, 1, 3],
[3, 4, 2],
[4, 4, 5],
[6, 6, 5]])
Now calculate mean and std for the whole array; use axis=1 to act on rows. So you don't to iterate on the sublists of arr.
In [277]: m=np.mean(arr, axis=1)
In [278]: s=np.std(arr, axis=1)
In [279]: m
Out[279]: array([ 1.66666667, 3. , 4.33333333, 5.66666667])
In [280]: s
Out[280]: array([ 0.94280904, 0.81649658, 0.47140452, 0.47140452])
There are various ways of turning these 2 arrays into the interleaved array. One is to stack them vertically, and then transpose. This is the numpy answer to the list zip(*...) trick.
In [281]: data=np.vstack([m,s])
In [282]: data
Out[282]:
array([[ 1.66666667, 3. , 4.33333333, 5.66666667],
[ 0.94280904, 0.81649658, 0.47140452, 0.47140452]])
In [283]: data=data.T.ravel()
In [284]: data
Out[284]:
array([ 1.66666667, 0.94280904, 3. , 0.81649658, 4.33333333,
0.47140452, 5.66666667, 0.47140452])
I don't have openpyxl', but can write a csv withsavetxt`:
In [296]: np.savetxt('test.txt',[data],fmt='%f', delimiter=',',header='#mean1 std1 ...')
In [297]: cat test.txt
# #mean1 std1 ...
1.666667,0.942809,3.000000,0.816497,4.333333,0.471405,5.666667,0.471405
I used [data] because data, as calculated is 1d, and savetxt would save that as a column. It iterates on the 'rows' of data.
I would use Pandas module, as it can do all mentioned tasks pretty easy:
import pandas as pd
df = pd.DataFrame(arr)
In [250]: df
Out[250]:
0 1 2
0 1 1 3
1 3 4 2
2 4 4 5
3 6 6 5
In [251]: df.T
Out[251]:
0 1 2 3
0 1 3 4 6
1 1 4 4 6
2 3 2 5 5
In [252]: df.T.mean()
Out[252]:
0 1.666667
1 3.000000
2 4.333333
3 5.666667
dtype: float64
In [253]: df.T.std(ddof=0)
Out[253]:
0 0.942809
1 0.816497
2 0.471405
3 0.471405
dtype: float64
you can also easily save your DataFrame as Excel file:
df.to_excel(r'/path/to/file.xlsx', index=False)
Altogether:
In [260]: df['avg'] = df.mean(axis=1)
In [261]: df['dev'] = df.std(axis=1, ddof=0)
In [262]: df
Out[262]:
0 1 2 avg dev
0 1 1 3 1.666667 0.816497
1 3 4 2 3.000000 0.707107
2 4 4 5 4.333333 0.408248
3 6 6 5 5.666667 0.408248
In [263]: df.to_excel('d:/temp/result.xlsx', index=False)
result.xlsx: