rank data over a rolling window in pandas DataFrame - pandas

I am new to Python and the Pandas library, so apologies if this is a trivial question. I am trying to rank a Timeseries over a rolling window of N days. I know there is a rank function but this function ranks the data over the entire timeseries. I don't seem to be able to find a rolling rank function.
Here is an example of what I am trying to do:
A
01-01-2013 100
02-01-2013 85
03-01-2013 110
04-01-2013 60
05-01-2013 20
06-01-2013 40
If I wanted to rank the data over a rolling window of 3 days, the answer should be:
Ranked_A
01-01-2013 NaN
02-01-2013 Nan
03-01-2013 1
04-01-2013 3
05-01-2013 3
06-01-2013 2
Is there a built-in function in Python that can do this? Any suggestion?
Many thanks.

If you want to use the Pandas built-in rank method (with some additional semantics, such as the ascending option), you can create a simple function wrapper for it
def rank(array):
s = pd.Series(array)
return s.rank(ascending=False)[len(s)-1]
that can then be used as a custom rolling-window function.
pd.rolling_apply(df['A'], 3, rank)
which outputs
Date
01-01-2013 NaN
02-01-2013 NaN
03-01-2013 1
04-01-2013 3
05-01-2013 3
06-01-2013 2
(I'm assuming the df data structure from Rutger's answer)

You can write a custom function for a rolling_window in Pandas. Using numpy's argsort() in that function can give you the rank within the window:
import pandas as pd
import StringIO
testdata = StringIO.StringIO("""
Date,A
01-01-2013,100
02-01-2013,85
03-01-2013,110
04-01-2013,60
05-01-2013,20
06-01-2013,40""")
df = pd.read_csv(testdata, header=True, index_col=['Date'])
rollrank = lambda data: data.size - data.argsort().argsort()[-1]
df['rank'] = pd.rolling_apply(df, 3, rollrank)
print df
results in:
A rank
Date
01-01-2013 100 NaN
02-01-2013 85 NaN
03-01-2013 110 1
04-01-2013 60 3
05-01-2013 20 3
06-01-2013 40 2

Related

How can I compute a rolling sum using groupby in pandas?

I'm working on a fun side project and would like to compute a moving sum for number of wins for NBA teams over 2 year periods. Consider the sample pandas dataframe below,
pd.DataFrame({'Team':['Hawks','Hawks','Hawks','Hawks','Hawks'], 'Season':[1970,1971,1972,1973,1974],'Wins':[40,34,30,46,42]})
I would ideally like to compute the sum of the number of wins between 1970 and 1971, 1971 and 1972, 1972 and 1973, etc. An inefficient way would be to use a loop, is there a way to do this using the .groupby function?
This is a little bit of a hack, but you could group by df['Season'] // 2 * 2, which means dividing by two, taking a floor operation, then multiplying by two again. The effect is to round each year to a multiple of two.
df_sum = pd.DataFrame(df.groupby(['Team', df['Season'] // 2 * 2])['Wins'].sum()).reset_index()
Output:
Team Season Wins
0 Hawks 1970 74
1 Hawks 1972 76
2 Hawks 1974 42
If you have years ordered for each team you can just use rolling with groupby on command. For example:
import pandas as pd
df = pd.DataFrame({'Team':['Hawks','Hawks','Hawks','Hawks','Hawks'], 'Season':[1970,1971,1972,1973,1974],'Wins':[40,34,30,46,42]})
res = df.groupby('Team')['Wins'].rolling(2).sum()
print(res)
Out:
Team
Hawks 0 NaN
1 74.0
2 64.0
3 76.0
4 88.0

Understanding the method apply() in Pandas series and dataframe

I am trying to understand how the method apply() can be used with series and dataframes.
As shown below, when the np.max() function is used with the apply() method with the dataframe it is returning the max value for each column. But when used with the series, it is just returning the series. My expectation was that it would return the max value of the series. That is, the result would be similar to series.max(). Why is apply() performing differently on series and on dataframes?
import pandas as pd
import numpy as np
my_df = pd.DataFrame(np.random.randint(10, size=(4,3)), columns = list('ABC'))
my_df
Output:
A B C
0 2 4 7
1 9 6 6
2 4 4 8
3 8 8 1
df_max = my_df.apply(np.max)
df_max
Output:
A 9
B 8
C 8
dtype: int32
se_max = my_df['A'].apply(np.max)
se_max
Output:
0 2
1 9
2 4
3 8
Name: A, dtype: int32
By default, apply works along the first dimension of the object. In a dataframe, the first dimension is vertical, and apply applies the function to each column. In a series, the first (and the only) dimension is horizontal, and apply applies the function to each row.

New column based on values from other columns in python

I have a dataframe df which looks like this
min
max
value
3
9
7
3
4
10
4
4
4
4
10
3
I want to create a new column df['accuracy'] which tells me the accuracy if the df['value'] is in between df['min'] and df['max'] such that the new dataframe looks like
min
max
value
Accuracy
3
9
7
Accurate
3
4
10
Not Accurate
4
4
4
Accurate
4
10
3
Not Accurate
Use apply() method of pandas, refer link
def accurate(row):
if row['value'] >= row['min'] and row['value'] <= row['max']:
return 'Accurate'
return 'Not Accurate'
df['Accuracy'] = df.apply(lambda row: accurate(row), axis=1)
print(df)

Panda: multiindex vs groupby [duplicate]

So I learned that I can use DataFrame.groupby without having a MultiIndex to do subsampling/cross-sections.
On the other hand, when I have a MultiIndex on a DataFrame, I still need to use DataFrame.groupby to do sub-sampling/cross-sections.
So what is a MultiIndex good for apart from the quite helpful and pretty display of the hierarchies when printing?
Hierarchical indexing (also referred to as “multi-level” indexing) was introduced in the pandas 0.4 release.
This opens the door to some quite sophisticated data analysis and manipulation, especially for working with higher dimensional data. In essence, it enables you to effectively store and manipulate arbitrarily high dimension data in a 2-dimensional tabular structure (DataFrame), for example.
Imagine constructing a dataframe using MultiIndex like this:-
import pandas as pd
import numpy as np
np.arrays = [['one','one','one','two','two','two'],[1,2,3,1,2,3]]
df = pd.DataFrame(np.random.randn(6,2),index=pd.MultiIndex.from_tuples(list(zip(*np.arrays))),columns=['A','B'])
df # This is the dataframe we have generated
A B
one 1 -0.732470 -0.313871
2 -0.031109 -2.068794
3 1.520652 0.471764
two 1 -0.101713 -1.204458
2 0.958008 -0.455419
3 -0.191702 -0.915983
This df is simply a data structure of two dimensions
df.ndim
2
But we can imagine it, looking at the output, as a 3 dimensional data structure.
one with 1 with data -0.732470 -0.313871.
one with 2 with data -0.031109 -2.068794.
one with 3 with data 1.520652 0.471764.
A.k.a.: "effectively store and manipulate arbitrarily high dimension data in a 2-dimensional tabular structure"
This is not just a "pretty display". It has the benefit of easy retrieval of data since we now have a hierarchal index.
For example.
In [44]: df.ix["one"]
Out[44]:
A B
1 -0.732470 -0.313871
2 -0.031109 -2.068794
3 1.520652 0.471764
will give us a new data frame only for the group of data belonging to "one".
And we can narrow down our data selection further by doing this:-
In [45]: df.ix["one"].ix[1]
Out[45]:
A -0.732470
B -0.313871
Name: 1
And of course, if we want a specific value, here's an example:-
In [46]: df.ix["one"].ix[1]["A"]
Out[46]: -0.73247029752040727
So if we have even more indexes (besides the 2 indexes shown in the example above), we can essentially drill down and select the data set we are really interested in without a need for groupby.
We can even grab a cross-section (either rows or columns) from our dataframe...
By rows:-
In [47]: df.xs('one')
Out[47]:
A B
1 -0.732470 -0.313871
2 -0.031109 -2.068794
3 1.520652 0.471764
By columns:-
In [48]: df.xs('B', axis=1)
Out[48]:
one 1 -0.313871
2 -2.068794
3 0.471764
two 1 -1.204458
2 -0.455419
3 -0.915983
Name: B
Great post by #Calvin Cheng, but thought I'd take a stab at this as well.
When to use a MultiIndex:
When a single column’s value isn’t enough to uniquely identify a row.
When data is logically hierarchical - meaning that it has multiple dimensions or “levels.”
Why (your core question) - at least these are the biggest benefits IMO:
Easy manipulation via stack() and unstack()
Easy math when there are multiple column levels
Syntactic sugar for slicing/filtering
Example:
Dollars Units
Date Store Category Subcategory UPC EAN
2018-07-10 Store 1 Alcohol Liqour 80480280024 154.77 7
Store 2 Alcohol Liqour 80480280024 82.08 4
Store 3 Alcohol Liqour 80480280024 259.38 9
Store 1 Alcohol Liquor 80432400630 477.68 14
674545000001 139.68 4
Store 2 Alcohol Liquor 80432400630 203.88 6
674545000001 377.13 13
Store 3 Alcohol Liquor 80432400630 239.19 7
674545000001 432.32 14
Store 1 Beer Ales 94922755711 65.17 7
702770082018 174.44 14
736920111112 50.70 5
Store 2 Beer Ales 94922755711 129.60 12
702770082018 107.40 10
736920111112 59.65 5
Store 3 Beer Ales 94922755711 154.00 14
702770082018 137.40 10
736920111112 107.88 12
Store 1 Beer Lagers 702770081011 156.24 12
Store 2 Beer Lagers 702770081011 137.06 11
Store 3 Beer Lagers 702770081011 119.52 8
1) If we want to easily compare sales across stores, we can use df.unstack('Store') to line everything up side-by-side:
Dollars Units
Store Store 1 Store 2 Store 3 Store 1 Store 2 Store 3
Date Category Subcategory UPC EAN
2018-07-10 Alcohol Liqour 80480280024 154.77 82.08 259.38 7 4 9
Liquor 80432400630 477.68 203.88 239.19 14 6 7
674545000001 139.68 377.13 432.32 4 13 14
Beer Ales 94922755711 65.17 129.60 154.00 7 12 14
702770082018 174.44 107.40 137.40 14 10 10
736920111112 50.70 59.65 107.88 5 5 12
Lagers 702770081011 156.24 137.06 119.52 12 11 8
2) We can also easily do math on multiple columns. For example, df['Dollars'] / df['Units'] will then divide each store's dollars by its units, for every store without multiple operations:
Store Store 1 Store 2 Store 3
Date Category Subcategory UPC EAN
2018-07-10 Alcohol Liqour 80480280024 22.11 20.52 28.82
Liquor 80432400630 34.12 33.98 34.17
674545000001 34.92 29.01 30.88
Beer Ales 94922755711 9.31 10.80 11.00
702770082018 12.46 10.74 13.74
736920111112 10.14 11.93 8.99
Lagers 702770081011 13.02 12.46 14.94
3) If we then want to filter to just specific rows, instead of using the
df[(df[col1] == val1) and (df[col2] == val2) and (df[col3] == val3)]
format, we can instead .xs or .query (yes these work for regular dfs, but it's not very useful). The syntax would instead be:
df.xs((val1, val2, val3), level=(col1, col2, col3))
More examples can be found in this tutorial notebook I put together.
The alternative to using a multiindex is to store your data using multiple columns of a dataframe. One would expect multiindex to provide a performance boost over naive column storage, but as of Pandas v 1.1.4, that appears not to be the case.
Timinigs
import numpy as np
import pandas as pd
np.random.seed(2020)
inv = pd.DataFrame({
'store_id': np.random.choice(10000, size=10**7),
'product_id': np.random.choice(1000, size=10**7),
'stock': np.random.choice(100, size=10**7),
})
# Create a DataFrame with a multiindex
inv_multi = inv.groupby(['store_id', 'product_id'])[['stock']].agg('sum')
print(inv_multi)
stock
store_id product_id
0 2 48
4 18
5 58
7 149
8 158
... ...
9999 992 132
995 121
996 105
998 99
999 16
[6321869 rows x 1 columns]
# Create a DataFrame without a multiindex
inv_cols = inv_multi.reset_index()
print(inv_cols)
store_id product_id stock
0 0 2 48
1 0 4 18
2 0 5 58
3 0 7 149
4 0 8 158
... ... ... ...
6321864 9999 992 132
6321865 9999 995 121
6321866 9999 996 105
6321867 9999 998 99
6321868 9999 999 16
[6321869 rows x 3 columns]
%%timeit
inv_multi.xs(key=100, level='store_id')
10 loops, best of 3: 20.2 ms per loop
%%timeit
inv_cols.loc[inv_cols.store_id == 100]
The slowest run took 8.79 times longer than the fastest. This could mean that an intermediate result is being cached.
100 loops, best of 3: 11.5 ms per loop
%%timeit
inv_multi.xs(key=100, level='product_id')
100 loops, best of 3: 9.08 ms per loop
%%timeit
inv_cols.loc[inv_cols.product_id == 100]
100 loops, best of 3: 12.2 ms per loop
%%timeit
inv_multi.xs(key=(100, 100), level=('store_id', 'product_id'))
10 loops, best of 3: 29.8 ms per loop
%%timeit
inv_cols.loc[(inv_cols.store_id == 100) & (inv_cols.product_id == 100)]
10 loops, best of 3: 28.8 ms per loop
Conclusion
The benefits from using a MultiIndex are about syntactic sugar, self-documenting data, and small conveniences from functions like unstack() as mentioned in #ZaxR's answer; Performance is not a benefit, which seems like a real missed opportunity.
Based on the comment on this
answer it seems the
experiment was flawed. Here is my attempt at a correct experiment.
Timings
import pandas as pd
import numpy as np
from timeit import timeit
random_data = np.random.randn(16, 4)
multiindex_lists = [["A", "B", "C", "D"], [1, 2, 3, 4]]
multiindex = pd.MultiIndex.from_product(multiindex_lists)
dfm = pd.DataFrame(random_data, multiindex)
df = dfm.reset_index()
print("dfm:\n", dfm, "\n")
print("df\n", df, "\n")
dfm_selection = dfm.loc[("B", 4), 3]
print("dfm_selection:", dfm_selection, type(dfm_selection))
df_selection = df[(df["level_0"] == "B") & (df["level_1"] == 4)][3].iat[0]
print("df_selection: ", df_selection, type(df_selection), "\n")
print("dfm_selection timeit:",
timeit(lambda: dfm.loc[("B", 4), 3], number=int(1e6)))
print("df_selection timeit: ",
timeit(
lambda: df[(df["level_0"] == "B") & (df["level_1"] == 4)][3].iat[0],
number=int(1e6)))
dfm:
0 1 2 3
A 1 -1.055128 -0.845019 -2.853027 0.521738
2 0.397804 0.385045 -0.121294 -0.696215
3 -0.551836 -0.666953 -0.956578 1.929732
4 -0.154780 1.778150 0.183104 -0.013989
B 1 -0.315476 0.564419 0.492496 -1.052432
2 -0.695300 0.085265 0.701724 -0.974168
3 -0.879915 -0.206499 1.597701 1.294885
4 0.653261 0.279641 -0.800613 1.050241
C 1 1.004199 -1.377520 -0.672913 1.491793
2 -0.453452 0.367264 -0.002362 0.411193
3 2.271958 0.240864 -0.923934 -0.572957
4 0.737893 -0.523488 0.485497 -2.371977
D 1 1.133661 -0.584973 -0.713320 -0.656315
2 -1.173231 -0.490667 0.634677 1.711015
3 -0.050371 -0.175644 0.124797 0.703672
4 1.349595 0.122202 -1.498178 0.013391
df
level_0 level_1 0 1 2 3
0 A 1 -1.055128 -0.845019 -2.853027 0.521738
1 A 2 0.397804 0.385045 -0.121294 -0.696215
2 A 3 -0.551836 -0.666953 -0.956578 1.929732
3 A 4 -0.154780 1.778150 0.183104 -0.013989
4 B 1 -0.315476 0.564419 0.492496 -1.052432
5 B 2 -0.695300 0.085265 0.701724 -0.974168
6 B 3 -0.879915 -0.206499 1.597701 1.294885
7 B 4 0.653261 0.279641 -0.800613 1.050241
8 C 1 1.004199 -1.377520 -0.672913 1.491793
9 C 2 -0.453452 0.367264 -0.002362 0.411193
10 C 3 2.271958 0.240864 -0.923934 -0.572957
11 C 4 0.737893 -0.523488 0.485497 -2.371977
12 D 1 1.133661 -0.584973 -0.713320 -0.656315
13 D 2 -1.173231 -0.490667 0.634677 1.711015
14 D 3 -0.050371 -0.175644 0.124797 0.703672
15 D 4 1.349595 0.122202 -1.498178 0.013391
dfm_selection: 1.0502406808918188 <class 'numpy.float64'>
df_selection: 1.0502406808918188 <class 'numpy.float64'>
dfm_selection timeit: 63.92458086000079
df_selection timeit: 450.4555013199997
Conclusion
MultiIndex single-value retrieval is over 7 times faster than conventional
dataframe single-value retrieval.
The syntax for MultiIndex retrieval is much cleaner.

How to unite several results of a dataframe columns describe() into one dataframe?

I am applying describe() to several columns of my dataframe, for example:
raw_data.groupby("user_id").size().describe()
raw_data.groupby("business_id").size().describe()
And several more, because I want to find out how many data points are there per user on average/median/etc..
My question is, each of those calls returns something that seems to be an unstructured output. Is there an easy way to combine them all to a single new dataframe which columns will be: [count,mean,std,min,25%,50%,75%,max] and the index will be the various columns described?
Thanks!
I might simply build a new DataFrame manually. If you have
>>> raw_data
user_id business_id data
0 10 1 5
1 20 10 6
2 20 100 7
3 30 100 8
Then the results of groupby(smth).size().describe() are just another Series:
>>> raw_data.groupby("user_id").size().describe()
count 3.000000
mean 1.333333
std 0.577350
min 1.000000
25% 1.000000
50% 1.000000
75% 1.500000
max 2.000000
dtype: float64
>>> type(_)
<class 'pandas.core.series.Series'>
and so:
>>> descrs = ((col, raw_data.groupby(col).size().describe()) for col in raw_data)
>>> pd.DataFrame.from_items(descrs).T
count mean std min 25% 50% 75% max
user_id 3 1.333333 0.57735 1 1 1 1.5 2
business_id 3 1.333333 0.57735 1 1 1 1.5 2
data 4 1.000000 0.00000 1 1 1 1.0 1
Instead of from_items I could have passed a dictionary, e.g.
pd.DataFrame({col: raw_data.groupby(col).size().describe() for col in raw_data}).T, but this way the column order is preserved without having to think about it.
If you don't want all the columns, instead of for col in raw_data, you could define columns_to_describe = ["user_id", "business_id"] etc and use for col in columns_to_describe, or use for col in raw_data if col.endswith("_id"), or whatever you like.