Apply rolling function to groupby over several columns - pandas
I'd like to apply rolling functions to a dataframe grouped by two columns with repeated date entries. Specifically, with both "freq" and "window" as datetime values, not simply ints.
In principle, I'm try to combine the methods from How to apply rolling functions in a group by object in pandas and pandas rolling sum of last five minutes.
Input
Here is a sample of the data, with one id=33 although we expect several id's.
X = [{'date': '2017-02-05', 'id': 33, 'item': 'A', 'points': 20},
{'date': '2017-02-05', 'id': 33, 'item': 'B', 'points': 10},
{'date': '2017-02-06', 'id': 33, 'item': 'B', 'points': 10},
{'date': '2017-02-11', 'id': 33, 'item': 'A', 'points': 1},
{'date': '2017-02-11', 'id': 33, 'item': 'A', 'points': 1},
{'date': '2017-02-11', 'id': 33, 'item': 'A', 'points': 1},
{'date': '2017-02-13', 'id': 33, 'item': 'A', 'points': 4}]
# df = pd.DataFrame(X) and reindex df to pd.to_datetime(df['date'])
df
id item points
date
2017-02-05 33 A 20
2017-02-05 33 B 10
2017-02-06 33 B 10
2017-02-11 33 A 1
2017-02-11 33 A 1
2017-02-11 33 A 1
2017-02-13 33 A 4
Goal
Sample each 'id' every 2 days (freq='2d') and return the sum of total points for each item over the previous three days (window='3D'), end-date inclusive
Desired Output
id A B
date
2017-02-05 33 20 10
2017-02-07 33 20 30
2017-02-09 33 0 10
2017-02-11 33 3 0
2017-02-13 33 7 0
E.g. on the right-inclusive end-date 2017-02-13, we sample the 3-day period 2017-02-11 to 2017-02-13. In this period, id=33 had a sum of A points equal to 1+1+1+4 = 7
Attempts
An attempt of groupby with a pd.rolling_sum as follows didn't work, due to repeated dates
df.groupby(['id', 'item'])['points'].apply(pd.rolling_sum, freq='4D', window=3)
ValueError: cannot reindex from a duplicate axis
Also note that from the documentation http://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.rolling_apply.html 'window' is an int representing the size sample period, not the number of days to sample.
We can also try resampling and using last, however the desired look-back of 3 days doesn't seem to be used
df.groupby(['id', 'item'])['points'].resample('2D', label='right', closed='right').\
apply(lambda x: x.last('3D').sum())
id item date
33 A 2017-02-05 20
2017-02-07 0
2017-02-09 0
2017-02-11 3
2017-02-13 4
B 2017-02-05 10
2017-02-07 10
Of course,setting up a loop over unique id's ID, selecting df_id = df[df['id']==ID], and summing over the periods does work but is computationally-intensive and doesn't exploit groupby's nice vectorization.
Thanks to #jezrael for good suggestions so far
Notes
Pandas version = 0.20.1
I'm a little confused as to why the documentation on rolling() here:https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.rolling.html
suggests that the "window" parameter can be in an int or offset but on attempting df.rolling(window='3D',...) I getraise ValueError("window must be an integer")
It appears that the above documentation is not consistent with the latest code for rolling's window from ./core/window.py :
https://github.com/pandas-dev/pandas/blob/master/pandas/core/window.py
elif not is_integer(self.window):
raise ValueError("window must be an integer")
It's easiest to handle resample and rolling with date frequencies when we have a single level datetime index.
However, I can't pivot/unstack appropriately without dealing with duplicate A/Bs so I groupby and sum
I unstack one level date so I can fill_value=0. Currently, I can't fill_value=0 when I unstack more than one level at a time. I make up for it with a transpose T
Now that I've got a single level in the index, I reindex with a date range from the min to max values in the index
Finally, I do a rolling 3 day sum and resample that result every 2 days with resample
I clean this up with a bit of renaming indices and one more pivot.
s = df.set_index(['id', 'item'], append=True).points
s = s.groupby(level=['date', 'id', 'item']).sum()
d = s.unstack('date', fill_value=0).T
tidx = pd.date_range(d.index.min(), d.index.max())
d = d.reindex(tidx, fill_value=0)
d1 = d.rolling('3D').sum().resample('2D').first().astype(d.dtypes).stack(0)
d1 = d1.rename_axis(['date', 'id']).rename_axis(None, 1)
print(d1)
A B
date id
2017-02-05 33 20 10
2017-02-07 33 20 20
2017-02-09 33 0 0
2017-02-11 33 3 0
2017-02-13 33 7 0
df = pd.DataFrame(X)
# group sum by day
df = df.groupby(['date', 'id', 'item'])['points'].sum().reset_index().sort_values(['date', 'id', 'item'])
# convert index to datetime index
df = df.set_index('date')
df.index = DatetimeIndex(df.index)
# rolloing sum by 3D
df['pointsum'] = df.groupby(['id', 'item']).transform(lambda x: x.rolling(window='3D').sum())
# reshape dataframe
df = df.reset_index().set_index(['date', 'id', 'item'])['pointsum'].unstack().reset_index().set_index('date').fillna(0)
df
Related
Pandas equivalent of partition by [duplicate]
Trying to create a new column from the groupby calculation. In the code below, I get the correct calculated values for each date (see group below) but when I try to create a new column (df['Data4']) with it I get NaN. So I am trying to create a new column in the dataframe with the sum of Data3 for the all dates and apply that to each date row. For example, 2015-05-08 is in 2 rows (total is 50+5 = 55) and in this new column I would like to have 55 in both of the rows. import pandas as pd import numpy as np from pandas import DataFrame df = pd.DataFrame({ 'Date' : ['2015-05-08', '2015-05-07', '2015-05-06', '2015-05-05', '2015-05-08', '2015-05-07', '2015-05-06', '2015-05-05'], 'Sym' : ['aapl', 'aapl', 'aapl', 'aapl', 'aaww', 'aaww', 'aaww', 'aaww'], 'Data2': [11, 8, 10, 15, 110, 60, 100, 40], 'Data3': [5, 8, 6, 1, 50, 100, 60, 120] }) group = df['Data3'].groupby(df['Date']).sum() df['Data4'] = group
You want to use transform this will return a Series with the index aligned to the df so you can then add it as a new column: In [74]: df = pd.DataFrame({'Date': ['2015-05-08', '2015-05-07', '2015-05-06', '2015-05-05', '2015-05-08', '2015-05-07', '2015-05-06', '2015-05-05'], 'Sym': ['aapl', 'aapl', 'aapl', 'aapl', 'aaww', 'aaww', 'aaww', 'aaww'], 'Data2': [11, 8, 10, 15, 110, 60, 100, 40],'Data3': [5, 8, 6, 1, 50, 100, 60, 120]}) df['Data4'] = df['Data3'].groupby(df['Date']).transform('sum') df Out[74]: Data2 Data3 Date Sym Data4 0 11 5 2015-05-08 aapl 55 1 8 8 2015-05-07 aapl 108 2 10 6 2015-05-06 aapl 66 3 15 1 2015-05-05 aapl 121 4 110 50 2015-05-08 aaww 55 5 60 100 2015-05-07 aaww 108 6 100 60 2015-05-06 aaww 66 7 40 120 2015-05-05 aaww 121
How do I create a new column with Groupby().Sum()? There are two ways - one straightforward and the other slightly more interesting. Everybody's Favorite: GroupBy.transform() with 'sum' #Ed Chum's answer can be simplified, a bit. Call DataFrame.groupby rather than Series.groupby. This results in simpler syntax. # The setup. df[['Date', 'Data3']] Date Data3 0 2015-05-08 5 1 2015-05-07 8 2 2015-05-06 6 3 2015-05-05 1 4 2015-05-08 50 5 2015-05-07 100 6 2015-05-06 60 7 2015-05-05 120 df.groupby('Date')['Data3'].transform('sum') 0 55 1 108 2 66 3 121 4 55 5 108 6 66 7 121 Name: Data3, dtype: int64 It's a tad faster, df2 = pd.concat([df] * 12345) %timeit df2['Data3'].groupby(df['Date']).transform('sum') %timeit df2.groupby('Date')['Data3'].transform('sum') 10.4 ms ± 367 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 8.58 ms ± 559 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) Unconventional, but Worth your Consideration: GroupBy.sum() + Series.map() I stumbled upon an interesting idiosyncrasy in the API. From what I tell, you can reproduce this on any major version over 0.20 (I tested this on 0.23 and 0.24). It seems like you consistently can shave off a few milliseconds of the time taken by transform if you instead use a direct function of GroupBy and broadcast it using map: df.Date.map(df.groupby('Date')['Data3'].sum()) 0 55 1 108 2 66 3 121 4 55 5 108 6 66 7 121 Name: Date, dtype: int64 Compare with df.groupby('Date')['Data3'].transform('sum') 0 55 1 108 2 66 3 121 4 55 5 108 6 66 7 121 Name: Data3, dtype: int64 My tests show that map is a bit faster if you can afford to use the direct GroupBy function (such as mean, min, max, first, etc). It is more or less faster for most general situations upto around ~200 thousand records. After that, the performance really depends on the data. (Left: v0.23, Right: v0.24) Nice alternative to know, and better if you have smaller frames with smaller numbers of groups. . . but I would recommend transform as a first choice. Thought this was worth sharing anyway. Benchmarking code, for reference: import perfplot perfplot.show( setup=lambda n: pd.DataFrame({'A': np.random.choice(n//10, n), 'B': np.ones(n)}), kernels=[ lambda df: df.groupby('A')['B'].transform('sum'), lambda df: df.A.map(df.groupby('A')['B'].sum()), ], labels=['GroupBy.transform', 'GroupBy.sum + map'], n_range=[2**k for k in range(5, 20)], xlabel='N', logy=True, logx=True )
I suggest in general to use the more powerful apply, with which you can write your queries in single expressions even for more complicated uses, such as defining a new column whose values are defined are defined as operations on groups, and that can have also different values within the same group! This is more general than the simple case of defining a column with the same value for every group (like sum in this question, which varies by group by is the same within the same group). Simple case (new column with same value within a group, different across groups): # I'm assuming the name of your dataframe is something long, like # `my_data_frame`, to show the power of being able to write your # data processing in a single expression without multiple statements and # multiple references to your long name, which is the normal style # that the pandas API naturally makes you adopt, but which make the # code often verbose, sparse, and a pain to generalize or refactor my_data_frame = pd.DataFrame({ 'Date': ['2015-05-08', '2015-05-07', '2015-05-06', '2015-05-05', '2015-05-08', '2015-05-07', '2015-05-06', '2015-05-05'], 'Sym': ['aapl', 'aapl', 'aapl', 'aapl', 'aaww', 'aaww', 'aaww', 'aaww'], 'Data2': [11, 8, 10, 15, 110, 60, 100, 40], 'Data3': [5, 8, 6, 1, 50, 100, 60, 120]}) (my_data_frame # create groups by 'Date' .groupby(['Date']) # for every small Group DataFrame `gdf` with the same 'Date', do: # assign a new column 'Data4' to it, with the value being # the sum of 'Data3' for the small dataframe `gdf` .apply(lambda gdf: gdf.assign(Data4=lambda gdf: gdf['Data3'].sum())) # after groupby operations, the variable(s) you grouped by on # are set as indices. In this case, 'Date' was set as an additional # level for the (multi)index. But it is still also present as a # column. Thus, we drop it from the index: .droplevel(0) ) ### OR # We don't even need to define a variable for our dataframe. # We can chain everything in one expression (pd .DataFrame({ 'Date': ['2015-05-08', '2015-05-07', '2015-05-06', '2015-05-05', '2015-05-08', '2015-05-07', '2015-05-06', '2015-05-05'], 'Sym': ['aapl', 'aapl', 'aapl', 'aapl', 'aaww', 'aaww', 'aaww', 'aaww'], 'Data2': [11, 8, 10, 15, 110, 60, 100, 40], 'Data3': [5, 8, 6, 1, 50, 100, 60, 120]}) .groupby(['Date']) .apply(lambda gdf: gdf.assign(Data4=lambda gdf: gdf['Data3'].sum())) .droplevel(0) ) Out: Date Sym Data2 Data3 Data4 3 2015-05-05 aapl 15 1 121 7 2015-05-05 aaww 40 120 121 2 2015-05-06 aapl 10 6 66 6 2015-05-06 aaww 100 60 66 1 2015-05-07 aapl 8 8 108 5 2015-05-07 aaww 60 100 108 0 2015-05-08 aapl 11 5 55 4 2015-05-08 aaww 110 50 55 (Why are the python expression within parentheses? So that we don't need to sprinkle our code with backslashes all over the place, and we can put comments within our expression code to describe every step.) What is powerful about this? It's that it is harnessing the full power of the "split-apply-combine paradigm". It is allowing you to think in terms of "splitting your dataframe into blocks" and "running arbitrary operations on those blocks" without reducing/aggregating, i.e., without reducing the number of rows. (And without writing explicit, verbose loops and resorting to expensive joins or concatenations to glue the results back.) Let's consider a more complex example. One in which you have multiple time series of data in your dataframe. You have a column that represents a kind of product, a column that has timestamps, and a column that contains the number of items sold for that product at some time of the year. You would like to group by product and obtain a new column, that contains the cumulative total for the items that are sold for each category. We want a column that, within every "block" with the same product, is still a time series, and is monotonically increasing (only within a block). How can we do this? With groupby + apply! (pd .DataFrame({ 'Date': ['2021-03-11','2021-03-12','2021-03-13','2021-03-11','2021-03-12','2021-03-13'], 'Product': ['shirt','shirt','shirt','shoes','shoes','shoes'], 'ItemsSold': [300, 400, 234, 80, 10, 120], }) .groupby(['Product']) .apply(lambda gdf: (gdf # sort by date within a group .sort_values('Date') # create new column .assign(CumulativeItemsSold=lambda df: df['ItemsSold'].cumsum()))) .droplevel(0) ) Out: Date Product ItemsSold CumulativeItemsSold 0 2021-03-11 shirt 300 300 1 2021-03-12 shirt 400 700 2 2021-03-13 shirt 234 934 3 2021-03-11 shoes 80 80 4 2021-03-12 shoes 10 90 5 2021-03-13 shoes 120 210 Another advantage of this method? It works even if we have to group by multiple fields! For example, if we had a 'Color' field for our products, and we wanted the cumulative series grouped by (Product, Color), we can: (pd .DataFrame({ 'Date': ['2021-03-11','2021-03-12','2021-03-13','2021-03-11','2021-03-12','2021-03-13', '2021-03-11','2021-03-12','2021-03-13','2021-03-11','2021-03-12','2021-03-13'], 'Product': ['shirt','shirt','shirt','shoes','shoes','shoes', 'shirt','shirt','shirt','shoes','shoes','shoes'], 'Color': ['yellow','yellow','yellow','yellow','yellow','yellow', 'blue','blue','blue','blue','blue','blue'], # new! 'ItemsSold': [300, 400, 234, 80, 10, 120, 123, 84, 923, 0, 220, 94], }) .groupby(['Product', 'Color']) # We group by 2 fields now .apply(lambda gdf: (gdf .sort_values('Date') .assign(CumulativeItemsSold=lambda df: df['ItemsSold'].cumsum()))) .droplevel([0,1]) # We drop 2 levels now Out: Date Product Color ItemsSold CumulativeItemsSold 6 2021-03-11 shirt blue 123 123 7 2021-03-12 shirt blue 84 207 8 2021-03-13 shirt blue 923 1130 0 2021-03-11 shirt yellow 300 300 1 2021-03-12 shirt yellow 400 700 2 2021-03-13 shirt yellow 234 934 9 2021-03-11 shoes blue 0 0 10 2021-03-12 shoes blue 220 220 11 2021-03-13 shoes blue 94 314 3 2021-03-11 shoes yellow 80 80 4 2021-03-12 shoes yellow 10 90 5 2021-03-13 shoes yellow 120 210 (This possibility of easily extending to grouping over multiple fields is the reason why I like to put the arguments of groupby always in a list, even if it's a single name, like 'Product' in the previous example.) And you can do all of this synthetically in a single expression. (Sure, if python's lambdas were a bit nicer to look at, it would look even nicer.) Why did I go over a general case? Because this is one of the first SO questions that pops up when googling for things like "pandas new column groupby". Additional thoughts on the API for this kind of operation Adding columns based on arbitrary computations made on groups is much like the nice idiom of defining new column using aggregations over Windows in SparkSQL. For example, you can think of this (it's Scala code, but the equivalent in PySpark looks practically the same): val byDepName = Window.partitionBy('depName) empsalary.withColumn("avg", avg('salary) over byDepName) as something like (using pandas in the way we have seen above): empsalary = pd.DataFrame(...some dataframe...) (empsalary # our `Window.partitionBy('depName)` .groupby(['depName']) # our 'withColumn("avg", avg('salary) over byDepName) .apply(lambda gdf: gdf.assign(avg=lambda df: df['salary'].mean())) .droplevel(0) ) (Notice how much synthetic and nicer the Spark example is. The pandas equivalent looks a bit clunky. The pandas API doesn't make writing these kinds of "fluent" operations easy). This idiom in turns comes from SQL's Window Functions, which the PostgreSQL documentation gives a very nice definition of: (emphasis mine) A window function performs a calculation across a set of table rows that are somehow related to the current row. This is comparable to the type of calculation that can be done with an aggregate function. But unlike regular aggregate functions, use of a window function does not cause rows to become grouped into a single output row — the rows retain their separate identities. Behind the scenes, the window function is able to access more than just the current row of the query result. And gives a beautiful SQL one-liner example: (ranking within groups) SELECT depname, empno, salary, rank() OVER (PARTITION BY depname ORDER BY salary DESC) FROM empsalary; depname empno salary rank develop 8 6000 1 develop 10 5200 2 develop 11 5200 2 develop 9 4500 4 develop 7 4200 5 personnel 2 3900 1 personnel 5 3500 2 sales 1 5000 1 sales 4 4800 2 sales 3 4800 2 Last thing: you might also be interested in pandas' pipe, which is similar to apply but works a bit differently and gives the internal operations a bigger scope to work on. See here for more
df = pd.DataFrame({ 'Date' : ['2015-05-08', '2015-05-07', '2015-05-06', '2015-05-05', '2015-05-08', '2015-05-07', '2015-05-06', '2015-05-05'], 'Sym' : ['aapl', 'aapl', 'aapl', 'aapl', 'aaww', 'aaww', 'aaww', 'aaww'], 'Data2': [11, 8, 10, 15, 110, 60, 100, 40], 'Data3': [5, 8, 6, 1, 50, 100, 60, 120] }) print(pd.pivot_table(data=df,index='Date',columns='Sym', aggfunc={'Data2':'sum','Data3':'sum'})) output Data2 Data3 Sym aapl aaww aapl aaww Date 2015-05-05 15 40 1 120 2015-05-06 10 100 6 60 2015-05-07 8 60 8 100 2015-05-08 11 110 5 50
Sorting columns in pandas dataframe while automating reports
I am working on an automation task and my dataframe columns are as shown below Defined Discharge Bin Apr-20 Jan-20 Mar-20 May-20 Grand Total 2-4 min 1 1 4-6 min 5 1 6 6-8 min 5 7 2 14 I want to sort the columns starting from Jan-20. The problem here is that the columns automatically get sorted according to alphabetical order. Sorting can be done manually but since I'm working on an automation task I need to ensure that each month when we feed the data the columns should automatically get sorted according to the months of the year.
Try this: import pandas as pd df = pd.DataFrame(data={'Defined Discharge Bin':['2-4 min', '4-6 min','6-8 min'], 'Apr-20':['', '', ''], 'Jan-20':['', 5, 5], 'Mar-20':['', '', 7], 'May-20':[1, 1, 2], 'Grand Total':[1, 6, 14]}) cols_exclude = ['Defined Discharge Bin', 'Grand Total'] cols_date = [c for c in df.columns.tolist() if c not in cols_exclude] cols_sorted = sorted(cols_date, key=lambda x: pd.to_datetime(x, format='%b-%y')) df = df[cols_exclude[0:1] + cols_sorted + cols_exclude[-1:]] print(df) Output: Defined Discharge Bin Jan-20 Mar-20 Apr-20 May-20 Grand Total 0 2-4 min 1 1 1 4-6 min 5 1 6 2 6-8 min 5 7 2 14
Sorting date values in a dataframe doesn't work
I have the column 'Created At' in this form: The date is in this format: '%d/%m/%Y' -> day, month, year obj = {'Created At': ['01/01/2017', '01/02/2017', '02/01/2017', '02/02/2017', '03/01/2017', '03/02/2017','04/01/2017' ], 'Text': [1, 70,14,17,84,76,32]} df = pd.DataFrame(data=obj) I did it, but dosen't work: df.sort_values(by='Created At', inplace=True) It seems that it sorts only the days and disregards the month. What do I do?
It does sort it properly: your dates are strings here. Strings are sorted lexicographically. So that means that only if the first character is the same, it will look at the second character, etc. You therefore might want to convert the column first to datetime objects: df['Created At'] = pd.to_datetime(df['Created At'], format='%d/%m/%Y') then we can sort the dataframe, and obtain: >>> df.sort_values(by='Created At', inplace=True) >>> df Created At Text 0 2017-01-01 1 2 2017-01-02 14 4 2017-01-03 84 6 2017-01-04 32 1 2017-02-01 70 3 2017-02-02 17 5 2017-02-03 76
Pandas Flatten a Complex Multi-level column dataframe
I initially had a dataframe with column ID and Date, i wanted to find the first and last Date entry for every ID. Therefore i applied an aggregation function: df.groupby('ID').agg({'Date':['first','last']}) I have a dataframe in the following form: print(df.columns) >> MultiIndex(levels=[['Date', 'ID', 'difference'], ['first', 'last', '']], labels=[[1, 0, 0, 2], [2, 0, 1, 2]]) I want to flatten this dataframe such that i get the dataframe in the following manner: I tried using df.reset_index(level=[0]) and also used df.unstack() but couldn't get the desired result. Any leads regarding on how to solve this problem?
I think you need change aggregate function for avoid MultiIndex in columns with specify column for aggregate and list of aggregating functions: rng = pd.date_range('2017-04-03', periods=10) df = pd.DataFrame({'Date': rng, 'id': [23] * 5 + [35] * 5}) print (df) Date id 0 2017-04-03 23 1 2017-04-04 23 2 2017-04-05 23 3 2017-04-06 23 4 2017-04-07 23 5 2017-04-08 35 6 2017-04-09 35 7 2017-04-10 35 8 2017-04-11 35 9 2017-04-12 35 df1 = df.groupby('id')['Date'].agg(['first','last']).reset_index() print (df1) id first last 0 23 2017-04-03 2017-04-07 1 35 2017-04-08 2017-04-12
Slicing a dataframe from unstack()
This is a dataframe, df that emerged using a unstack() Index 0 7 21 22 23 June 89 0 3 5 2 July 30 0 2 5 4 August 20 8 5 5 5 I tried to slice a portion of the dataframe using df2=df.loc[: , :'21'] I tried to slice a portion of the dataframe using df2=df.loc[: , :'21'] But I have this error: keyError: '21'
Error mean there is no column 21 as string. Check it by: #integers columns print (df.columns.tolist()) [0, 7, 21, 22, 23] #strings columns #print (df.columns.tolist()) #['0', '7', '21', '22', '23'] You can use for Series integer 21: df2 = df[21] print (df2) Index June 3 July 2 August 5 Name: 21, dtype: int64 And for one columns DataFrame use double []: df2 = df[[21]] print (df2) 21 Index June 3 July 2 August 5 EDIT: Another problem should be trailing whitespaces: print (df.columns.tolist()) ['0', '7', ' 21 ', '22', '23'] and for remove need str.strip: df.columns = df.columns.str.strip() print (df.columns.tolist()) ['0', '7', '21', '22', '23'] EDIT1: For filter columns with values less as 24 with bad data also use to_numeric with errors='coerce' for NaNs for unparseable values: df2 = df.loc[:, pd.to_numeric(df.columns, errors='coerce') < 24]