How can I compute a rolling sum using groupby in pandas? - pandas

I'm working on a fun side project and would like to compute a moving sum for number of wins for NBA teams over 2 year periods. Consider the sample pandas dataframe below,
pd.DataFrame({'Team':['Hawks','Hawks','Hawks','Hawks','Hawks'], 'Season':[1970,1971,1972,1973,1974],'Wins':[40,34,30,46,42]})
I would ideally like to compute the sum of the number of wins between 1970 and 1971, 1971 and 1972, 1972 and 1973, etc. An inefficient way would be to use a loop, is there a way to do this using the .groupby function?

This is a little bit of a hack, but you could group by df['Season'] // 2 * 2, which means dividing by two, taking a floor operation, then multiplying by two again. The effect is to round each year to a multiple of two.
df_sum = pd.DataFrame(df.groupby(['Team', df['Season'] // 2 * 2])['Wins'].sum()).reset_index()
Output:
Team Season Wins
0 Hawks 1970 74
1 Hawks 1972 76
2 Hawks 1974 42

If you have years ordered for each team you can just use rolling with groupby on command. For example:
import pandas as pd
df = pd.DataFrame({'Team':['Hawks','Hawks','Hawks','Hawks','Hawks'], 'Season':[1970,1971,1972,1973,1974],'Wins':[40,34,30,46,42]})
res = df.groupby('Team')['Wins'].rolling(2).sum()
print(res)
Out:
Team
Hawks 0 NaN
1 74.0
2 64.0
3 76.0
4 88.0

Related

using groupby for datetime values in pandas

I'm using this code in order to groupby my data by year
df = pd.read_csv('../input/companies-info-wikipedia-2021/sparql_2021-11-03_22-25-45Z.csv')
df = pd.read_csv('../input/companies-info-wikipedia-2021/sparql_2021-11-03_22-25-45Z.csv')
df_duplicate_name = df[df.duplicated(['name'])]
df = df.drop_duplicates(subset='name').reset_index()
df = df.drop(['a','type','index'],axis=1).reset_index()
df = df[~df['foundation'].str.contains('[A-Za-z]', na=False)]
df = df.drop([140,214,220])
df['foundation'] = df['foundation'].fillna(0)
df['foundation'] = pd.to_datetime(df['foundation'])
df['foundation'] = df['foundation'].dt.year
df = df.groupby('foundation')
But as a result it does not group it by foundation values:
0 0 Deutsche EuroShop AG 1999 http://dbpedia.org/resource/Germany Investment in shopping centers http://dbpedia.org/resource/Real_property 4 2.964E9 1.25E9 2.241E8 8.04E7
1 1 Industry of Machinery and Tractors 1996 http://dbpedia.org/resource/Belgrade http://dbpedia.org/resource/Tractors http://dbpedia.org/resource/Agribusiness 4 4.648E7 0.0 30000.0 -€0.47 million
2 2 TelexFree Inc. 2012 http://dbpedia.org/resource/Massachusetts 99 http://dbpedia.org/resource/Multi-level_marketing 7 did not disclose did not disclose did not disclose did not disclose
3 3 (prev. Common Cents Communications Inc.) 2012 http://dbpedia.org/resource/United_States 99 http://dbpedia.org/resource/Multi-level_marketing 7 did not disclose did not disclose did not disclose did not disclose
4 4 Bionor Holding AS 1993 http://dbpedia.org/resource/Oslo http://dbpedia.org/resource/Health_care http://dbpedia.org/resource/Biotechnology 18 NOK 253 395 million NOK 203 320 million 1.09499E8 NOK 49 020 million
... ... ... ... ... ... ... ... ... ... ... ...
255 255 Ageas SA/NV 1990 http://dbpedia.org/resource/Belgium http://dbpedia.org/resource/Insurance http://dbpedia.org/resource/Financial_services 45000 1.0872E11 1.348E10 1.112E10 9.792E8
256 256 Sharp Corporation 1912 http://dbpedia.org/resource/Japan Televisions, audiovisual, home appliances, inf... http://dbpedia.org/resource/Consumer_electronics 52876 NaN NaN NaN NaN
257 257 Erste Group Bank AG 2008 Vienna, Austria Retail and commercial banking, investment and ... http://dbpedia.org/resource/Financial_services 47230 2.71983E11 1.96E10 6.772E9 1187000.0
258 258 Manulife Financial Corporation 1887 200 Asset management, Commercial banking, Commerci... http://dbpedia.org/resource/Financial_services 34000 750300000000 47200000000 39000000000 4800000000
259 259 BP plc 1909 London, England, UK http://dbpedia.org/resource/Natural_gas http://dbpedia.org/resource/Petroleum_industry
I also tried with making it again pd.to_datetime and sorting by dt.year - but still unsuccessful.
Column names:
Index(['index', 'name', 'foundation', 'location', 'products', 'sector',
'employee', 'assets', 'equity', 'revenue', 'profit'],
dtype='object')
#Ruslan you simply need to use a "sorting" command, not a "groupby" . You can achieve this generally in two ways:
myDF.sort_value(by='column_name' , ascending= 'true', inplace=true)
or, in case you need to set your column as index, you would need to do this:
myDF.index.name = 'column_name'
myDF.sort_index(ascending=True)
GroupBy is a totally different command, it is used to make actions after you group values by some criteria. Such as find sum, average , min, max of values, grouped-by some criteria.
pandas.DataFrame.sort_values
pandas.DataFrame.groupby
I think you're misunderstanding how groupby() works.
You can't do df = df.groupby('foundation'). groupby() does not return a new DataFrame. Instead, it returns a GroupBy, which is essentially just a mapping from value grouped-by to a dataframe containg the rows that all share that value for the specified column.
You can, for example, print how many rows are in each group with the following code:
groups = df.groupby('foundation')
for val, sub_df in groups:
print(f'{val}: {sub_df.shape[0]} rows')

why is groupby in pandas not displaying

I have a df like:
df = pd.DataFrame({'Animal': ['Falcon', 'Falcon',
'Parrot', 'Parrot','Elephant','Elephant','Elephant'],
'Max Speed': [380, 370, 24, 26,5,7,3]})
I would like to groupby Animal.
if I do in a notebook:
a = df.groupby(['Animal'])
display(a)
I get:
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f945bdd7b80>
I expected something like:
What I ultimate want to do is sort the df by number of animal appearances (Elephant 3, falcon 2 etc)
Because you are not using any aggregated functions after groupby -
a = df.groupby(['Animal'])
display(a)
Rectified -
a = df.groupby(['Animal']).count()
display(a)
Now, after using count() function or sort_values() or sum() etc. you will get the groupby results
You need check DataFrame.groupby:
Group DataFrame using a mapper or by a Series of columns.
So it is not for remove duplicates values by column, but for aggregation.
If need remove duplicated vales, set to empty string use:
df.loc[df['Animal'].duplicated(), 'Animal'] = ''
print (df)
Animal Max Speed
0 Falcon 380
1 370
2 Parrot 24
3 26
4 Elephant 5
5 7
6 3
If need groupby:
for i, g in df.groupby(['Animal']):
print (g)
Animal Max Speed
4 Elephant 5
5 Elephant 7
6 Elephant 3
Animal Max Speed
0 Falcon 380
1 Falcon 370
Animal Max Speed
2 Parrot 24
3 Parrot 26
The groupby object requires an action, like a max or a min. This will result in two things:
A regular pandas data frame
The grouping key appearing once
You clearly expect both of the Falcon entries to remain so you don't actually want to do a groupby. If you want to see the entries with repeated animal values hidden, you would do that by setting the Animal column as the index. I say that because your input data frame is already in the order you wanted to display.
Use mask:
>>> df.assign(Animal=df['Animal'].mask(df['Animal'].duplicated(), ''))
Animal Max Speed
0 Falcon 380
1 370
2 Parrot 24
3 26
4 Elephant 5
5 7
6 3
>>>
Or as index:
df.assign(Animal=df['Animal'].mask(df['Animal'].duplicated(), '')).set_index('Animal')
Max Speed
Animal
Falcon 380
370
Parrot 24
26
Elephant 5
7
3
>>>

pandas - how to vectorized group by calculations instead of iteration

Here is a code sniplet to simulate the problem i am facing. i am using iteration on large datasets
df = pd.DataFrame({'grp':np.random.choice([1,2,3,4,5],500),'col1':np.arange(0,500),'col2':np.random.randint(0,10,500),'col3':np.nan})
for index, row in df.iterrows():
#based on group label, get last 3 values to calculate mean
d=df.iloc[0:index].groupby('grp')
try:
dgrp_sum=d.get_group(row.grp).col2.tail(3).mean()
except:
dgrp_sum=999
#after getting last 3 values of group with reference to current row reference, multiply by other rows
df.at[index,'col3']=dgrp_sum*row.col1*row.col2
if i want to speed it up with vectors, how do i convert this code?
You basically calculate moving average over every group.
Which means you can group dataframe by "grp" and calculate rolling mean.
At the end you multiply columns in each row because it is not dependent on group.
df["col3"] = df.groupby("grp").col2.rolling(3, min_periods=1).mean().reset_index(0,drop=True)
df["col3"] = df[["col1", "col2", "col3"]].product(axis=1)
Note: In your code, each calculated mean is placed in the next row, thats why you probably have this try block.
# Skipping last product gives only mean
# np.random.seed(1234)
# print(df[df["grp"] == 2])
grp col1 col2 iter mask
4 2 4 6 999.000000 6.000000
5 2 5 0 6.000000 3.000000
6 2 6 9 3.000000 5.000000
17 2 17 1 5.000000 3.333333
27 2 27 9 3.333333 6.333333

Calculating win percentage for individual teams based on pandas df

Have a pandas df like shown below
HomeTeam AwayTeam Winner
Warriors Cavaliers 1
Pistons Rockets 0
Warriors Rockets 1
Heat Warriors 0
The winning team (home or away) is represented by the binary outcome in the Winner column. I want to calculate winning percentage overall for each team. How would I go about doing this?
Expected output would be something like
Team Win%
Warriors 85%
Heat 22%
....
We can use np.choose to select the winner, and perform .value_counts() on both the winner and both teams, and thus calculate the ratio with:
np.choose(
df['Winner'], [df['HomeTeam'], df['AwayTeam']]
).value_counts().div(
df[['HomeTeam', 'AwayTeam']].stack().value_counts()
).fillna(0)
Here we thus use np.choose to select the teams, and perform a value count, next we .stack() the HomeTeam and AwayTeam to obtain a series of values of the teams that played. We can then use .values_counts() to calculate how many times a team played. In case a team never appears at the winning side, that will result in a NaN. We can solve that by using 0 for these values.
With the given sample data, we obtain:
>>> np.choose(df['Winner'], [df['HomeTeam'], df['AwayTeam']]).value_counts().div(df[['HomeTeam', 'AwayTeam']].stack().value_counts()).fillna(0)
Cavaliers 1.0
Heat 1.0
Pistons 1.0
Rockets 0.5
Warriors 0.0
dtype: float64
Here the Cavaliers, Heat and Pistons won all their matches, the Rockets won half of their matches, and the Warriors did not win any match.
Use numpy.where for get winner team, counts values by Series.value_counts and divide by counts by all values of both columns with ravel and Series.value_counts, for division is used Series.div, then multiple by 100 and last convert Series to DataFrame by Series.rename_axis with Series.reset_index:
winners = pd.Series(np.where(df['Winner'], df['AwayTeam'], df['HomeTeam'])).value_counts()
all_teams = df[['HomeTeam', 'AwayTeam']].stack().value_counts()
df1 = winners.div(all_teams, fill_value=0).mul(100).rename_axis('Team').reset_index(name='Win%')
print (df1)
Team Win%
0 Cavaliers 100.0
1 Heat 100.0
2 Pistons 100.0
3 Rockets 50.0
4 Warriors 0.0
Details:
print (winners)
Rockets 1
Heat 1
Cavaliers 1
Pistons 1
dtype: int64
print (all_teams)
Warriors 3
Rockets 2
Heat 1
Cavaliers 1
Pistons 1
dtype: int64

rank data over a rolling window in pandas DataFrame

I am new to Python and the Pandas library, so apologies if this is a trivial question. I am trying to rank a Timeseries over a rolling window of N days. I know there is a rank function but this function ranks the data over the entire timeseries. I don't seem to be able to find a rolling rank function.
Here is an example of what I am trying to do:
A
01-01-2013 100
02-01-2013 85
03-01-2013 110
04-01-2013 60
05-01-2013 20
06-01-2013 40
If I wanted to rank the data over a rolling window of 3 days, the answer should be:
Ranked_A
01-01-2013 NaN
02-01-2013 Nan
03-01-2013 1
04-01-2013 3
05-01-2013 3
06-01-2013 2
Is there a built-in function in Python that can do this? Any suggestion?
Many thanks.
If you want to use the Pandas built-in rank method (with some additional semantics, such as the ascending option), you can create a simple function wrapper for it
def rank(array):
s = pd.Series(array)
return s.rank(ascending=False)[len(s)-1]
that can then be used as a custom rolling-window function.
pd.rolling_apply(df['A'], 3, rank)
which outputs
Date
01-01-2013 NaN
02-01-2013 NaN
03-01-2013 1
04-01-2013 3
05-01-2013 3
06-01-2013 2
(I'm assuming the df data structure from Rutger's answer)
You can write a custom function for a rolling_window in Pandas. Using numpy's argsort() in that function can give you the rank within the window:
import pandas as pd
import StringIO
testdata = StringIO.StringIO("""
Date,A
01-01-2013,100
02-01-2013,85
03-01-2013,110
04-01-2013,60
05-01-2013,20
06-01-2013,40""")
df = pd.read_csv(testdata, header=True, index_col=['Date'])
rollrank = lambda data: data.size - data.argsort().argsort()[-1]
df['rank'] = pd.rolling_apply(df, 3, rollrank)
print df
results in:
A rank
Date
01-01-2013 100 NaN
02-01-2013 85 NaN
03-01-2013 110 1
04-01-2013 60 3
05-01-2013 20 3
06-01-2013 40 2