Calculating win percentage for individual teams based on pandas df - pandas

Have a pandas df like shown below
HomeTeam AwayTeam Winner
Warriors Cavaliers 1
Pistons Rockets 0
Warriors Rockets 1
Heat Warriors 0
The winning team (home or away) is represented by the binary outcome in the Winner column. I want to calculate winning percentage overall for each team. How would I go about doing this?
Expected output would be something like
Team Win%
Warriors 85%
Heat 22%
....

We can use np.choose to select the winner, and perform .value_counts() on both the winner and both teams, and thus calculate the ratio with:
np.choose(
df['Winner'], [df['HomeTeam'], df['AwayTeam']]
).value_counts().div(
df[['HomeTeam', 'AwayTeam']].stack().value_counts()
).fillna(0)
Here we thus use np.choose to select the teams, and perform a value count, next we .stack() the HomeTeam and AwayTeam to obtain a series of values of the teams that played. We can then use .values_counts() to calculate how many times a team played. In case a team never appears at the winning side, that will result in a NaN. We can solve that by using 0 for these values.
With the given sample data, we obtain:
>>> np.choose(df['Winner'], [df['HomeTeam'], df['AwayTeam']]).value_counts().div(df[['HomeTeam', 'AwayTeam']].stack().value_counts()).fillna(0)
Cavaliers 1.0
Heat 1.0
Pistons 1.0
Rockets 0.5
Warriors 0.0
dtype: float64
Here the Cavaliers, Heat and Pistons won all their matches, the Rockets won half of their matches, and the Warriors did not win any match.

Use numpy.where for get winner team, counts values by Series.value_counts and divide by counts by all values of both columns with ravel and Series.value_counts, for division is used Series.div, then multiple by 100 and last convert Series to DataFrame by Series.rename_axis with Series.reset_index:
winners = pd.Series(np.where(df['Winner'], df['AwayTeam'], df['HomeTeam'])).value_counts()
all_teams = df[['HomeTeam', 'AwayTeam']].stack().value_counts()
df1 = winners.div(all_teams, fill_value=0).mul(100).rename_axis('Team').reset_index(name='Win%')
print (df1)
Team Win%
0 Cavaliers 100.0
1 Heat 100.0
2 Pistons 100.0
3 Rockets 50.0
4 Warriors 0.0
Details:
print (winners)
Rockets 1
Heat 1
Cavaliers 1
Pistons 1
dtype: int64
print (all_teams)
Warriors 3
Rockets 2
Heat 1
Cavaliers 1
Pistons 1
dtype: int64

Related

How can I compute a rolling sum using groupby in pandas?

I'm working on a fun side project and would like to compute a moving sum for number of wins for NBA teams over 2 year periods. Consider the sample pandas dataframe below,
pd.DataFrame({'Team':['Hawks','Hawks','Hawks','Hawks','Hawks'], 'Season':[1970,1971,1972,1973,1974],'Wins':[40,34,30,46,42]})
I would ideally like to compute the sum of the number of wins between 1970 and 1971, 1971 and 1972, 1972 and 1973, etc. An inefficient way would be to use a loop, is there a way to do this using the .groupby function?
This is a little bit of a hack, but you could group by df['Season'] // 2 * 2, which means dividing by two, taking a floor operation, then multiplying by two again. The effect is to round each year to a multiple of two.
df_sum = pd.DataFrame(df.groupby(['Team', df['Season'] // 2 * 2])['Wins'].sum()).reset_index()
Output:
Team Season Wins
0 Hawks 1970 74
1 Hawks 1972 76
2 Hawks 1974 42
If you have years ordered for each team you can just use rolling with groupby on command. For example:
import pandas as pd
df = pd.DataFrame({'Team':['Hawks','Hawks','Hawks','Hawks','Hawks'], 'Season':[1970,1971,1972,1973,1974],'Wins':[40,34,30,46,42]})
res = df.groupby('Team')['Wins'].rolling(2).sum()
print(res)
Out:
Team
Hawks 0 NaN
1 74.0
2 64.0
3 76.0
4 88.0

pandas - how to vectorized group by calculations instead of iteration

Here is a code sniplet to simulate the problem i am facing. i am using iteration on large datasets
df = pd.DataFrame({'grp':np.random.choice([1,2,3,4,5],500),'col1':np.arange(0,500),'col2':np.random.randint(0,10,500),'col3':np.nan})
for index, row in df.iterrows():
#based on group label, get last 3 values to calculate mean
d=df.iloc[0:index].groupby('grp')
try:
dgrp_sum=d.get_group(row.grp).col2.tail(3).mean()
except:
dgrp_sum=999
#after getting last 3 values of group with reference to current row reference, multiply by other rows
df.at[index,'col3']=dgrp_sum*row.col1*row.col2
if i want to speed it up with vectors, how do i convert this code?
You basically calculate moving average over every group.
Which means you can group dataframe by "grp" and calculate rolling mean.
At the end you multiply columns in each row because it is not dependent on group.
df["col3"] = df.groupby("grp").col2.rolling(3, min_periods=1).mean().reset_index(0,drop=True)
df["col3"] = df[["col1", "col2", "col3"]].product(axis=1)
Note: In your code, each calculated mean is placed in the next row, thats why you probably have this try block.
# Skipping last product gives only mean
# np.random.seed(1234)
# print(df[df["grp"] == 2])
grp col1 col2 iter mask
4 2 4 6 999.000000 6.000000
5 2 5 0 6.000000 3.000000
6 2 6 9 3.000000 5.000000
17 2 17 1 5.000000 3.333333
27 2 27 9 3.333333 6.333333

Creating a new DataFrame out of 2 existing Dataframes with Values coming from Dataframe 1?

I have 2 DataFrames.
DF1:
movieId title genres
0 1 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy
1 2 Jumanji (1995) Adventure|Children|Fantasy
2 3 Grumpier Old Men (1995) Comedy|Romance
3 4 Waiting to Exhale (1995) Comedy|Drama|Romance
4 5 Father of the Bride Part II (1995) Comedy
DF2:
userId movieId rating timestamp
0 1 1 4.0 964982703
1 1 3 4.0 964981247
2 1 6 4.0 964982224
3 1 47 5.0 964983815
4 1 50 5.0 964982931
My new DataFrame should look like this.
DF_new:
userID Toy Story Jumanji Grumpier Old Men Waiting to Exhale Father of the Pride Part II
1 4.0
2
3
4
The Values will be the ratings of the the indiviudel user to each movie.
The movie titles are now the columns.
The userId are now the rows.
I think it will work over concatinating via the movieid. But im not sure how to do this exactly, so that i still have the movie names attached to the movieid.
Anybody has an idea?
The problem consists of essentially 2 parts:
How to transpose df2, the sole table where user ratings comes from, to the desired format. pd.DataFrame.pivot_table is the standard way to go.
The rest is about mapping the movieIDs to their names. This can be easily done by direct substitution on df.columns.
In addition, if movies receiving no ratings were to be listed as well, just insert the missing movieIDs directly before name substitution mentioned previously.
Code
import pandas as pd
import numpy as np
df1 = pd.DataFrame(
data={
"movieId": [1,2,3,4,5],
"title": ["toy story (1995)",
"Jumanji (1995)",
"Grumpier 33 (1995)", # shortened for printing
"Waiting 44 (1995)",
"Father 55 (1995)"],
}
)
# to better demonstrate the correctness, 2 distinct user ids were used.
df2 = pd.DataFrame(
data={
"userId": [1,1,1,2,2],
"movieId": [1,2,2,3,5],
"rating": [4,5,4,5,4]
}
)
# 1. Produce the main table
df_new = df2.pivot_table(index=["userId"], columns=["movieId"], values="rating")
print(df_new) # already pretty close
Out[17]:
movieId 1 2 3 5
userId
1 4.0 4.5 NaN NaN
2 NaN NaN 5.0 4.0
# 2. map movie ID's to titles
# name lookup dataset
df_names = df1[["movieId", "title"]].set_index("movieId")
# strip the last 7 characters containing year
# (assume consistent formatting in df1)
df_names["title"] = df_names["title"].apply(lambda s: s[:-7])
# (optional) fill unrated columns and sort
for movie_id in df_names.index.values:
if movie_id not in df_new.columns.values:
df_new[movie_id] = np.nan
else:
df_new = df_new[df_names.index.values]
# replace IDs with titles
df_new.columns = df_names.loc[df_new.columns, "title"].values
Result
df_new
Out[16]:
toy story Jumanji Grumpier 33 Waiting 44 Father 55
userId
1 4.0 4.5 NaN NaN NaN
2 NaN NaN 5.0 NaN 4.0

Binary operation broadcasting across multiindex

can anyone explain why broadcasting across a multiindexed series doesn't work? Might it be a bug in pandas (0.12.0)?
x = pd.DataFrame({'year':[1,1,1,1,2,2,2,2],
'country':['A','A','B','B','A','A','B','B'],
'prod':[1,2,1,2,1,2,1,2],
'val':[10,20,15,25,20,30,25,35]})
x = x.set_index(['year','country','prod']).squeeze()
y = pd.DataFrame({'year':[1,1,2,2],'prod':[1,2,1,2],
'mul':[10,0.1,20,0.2]})
y = y.set_index(['year','prod']).squeeze()
From the description of matching/broadcasting behavior from the pandas docs I would expect to be able to multiply x and y and have the values of y broadcast across each country, giving:
>>> x.mul(y, level=['year','prod'])
year country prod
1 A 1 100.0
2 2.0
B 1 150.0
2 2.5
2 A 1 400.0
2 6.0
B 1 500.0
2 7.0
But instead, I get:
Exception: Join on level between two MultiIndex objects is ambiguous
(Note that this is a variation on the theme of this question.)
As discussed by me and #jreback in the issue opened to deal with this, a nice workaround to the problem involves doing the following:
Move the non-matching index level(s) to columns using unstack
Perform the multiplication/division
Put the non-matching index level(s) back using stack
Make sure the index levels are in the same order as they were before.
Here's how it works:
In [112]: x.unstack('country').mul(y, axis=0).stack('country').reorder_levels(x.index.names)
Out[112]:
year country prod
1 A 1 100.0
B 1 150.0
A 2 2.0
B 2 2.5
2 A 1 400.0
B 1 500.0
A 2 6.0
B 2 7.0
dtype: float64
I think that's rather good, and should be pretty efficient.

How to unite several results of a dataframe columns describe() into one dataframe?

I am applying describe() to several columns of my dataframe, for example:
raw_data.groupby("user_id").size().describe()
raw_data.groupby("business_id").size().describe()
And several more, because I want to find out how many data points are there per user on average/median/etc..
My question is, each of those calls returns something that seems to be an unstructured output. Is there an easy way to combine them all to a single new dataframe which columns will be: [count,mean,std,min,25%,50%,75%,max] and the index will be the various columns described?
Thanks!
I might simply build a new DataFrame manually. If you have
>>> raw_data
user_id business_id data
0 10 1 5
1 20 10 6
2 20 100 7
3 30 100 8
Then the results of groupby(smth).size().describe() are just another Series:
>>> raw_data.groupby("user_id").size().describe()
count 3.000000
mean 1.333333
std 0.577350
min 1.000000
25% 1.000000
50% 1.000000
75% 1.500000
max 2.000000
dtype: float64
>>> type(_)
<class 'pandas.core.series.Series'>
and so:
>>> descrs = ((col, raw_data.groupby(col).size().describe()) for col in raw_data)
>>> pd.DataFrame.from_items(descrs).T
count mean std min 25% 50% 75% max
user_id 3 1.333333 0.57735 1 1 1 1.5 2
business_id 3 1.333333 0.57735 1 1 1 1.5 2
data 4 1.000000 0.00000 1 1 1 1.0 1
Instead of from_items I could have passed a dictionary, e.g.
pd.DataFrame({col: raw_data.groupby(col).size().describe() for col in raw_data}).T, but this way the column order is preserved without having to think about it.
If you don't want all the columns, instead of for col in raw_data, you could define columns_to_describe = ["user_id", "business_id"] etc and use for col in columns_to_describe, or use for col in raw_data if col.endswith("_id"), or whatever you like.