Aggregate data based on values appearing in two columns interchangeably? - pandas

home_team_name away_team_name home_ppg_per_odds_pre_game away_ppg_per_odds_pre_game
0 Manchester United Tottenham Hotspur 3.310000 4.840000
1 AFC Bournemouth Aston Villa 0.666667 3.230000
2 Norwich City Crystal Palace 0.666667 13.820000
3 Leicester City Sunderland 4.733333 3.330000
4 Everton Watford 0.583333 2.386667
5 Chelsea Manchester United 1.890000 3.330000
The home_ppg_per_odds_pre_game and away_ppg_per_odds_pre_game are basically the same metric. The former reprsents the value of this metric for the home_team, while the latter represents this metric for the away team. I want a mean of this metric for each team and that is regardless whether the team is playing home or away. In the example df you Manchester United as home_team_name in zero and as away_team_name in 5. I want the mean for Manchester United that includes all this examples.
df.groupby("home_team_name")["home_ppg_per_odds_pre_game"].mean()
This will only bring me the mean for the occasion when the team is playing home, but I want both home and away.

Since the two metrics are the same, you can append the home and away team metrics, like this:
data_df = pd.concat([df.loc[:,('home_team_name','home_ppg_per_odds_pre_game')], df.loc[:,('away_team_name','away_ppg_per_odds_pre_game')].rename(columns={'away_team_name':'home_team_name','away_ppg_per_odds_pre_game':'home_ppg_per_odds_pre_game'})])
Then you can use groupby to get the means:
data_df.groupby('home_team_name')['home_ppg_per_odds_pre_game'].mean().reset_index()

Related

How to change index name in multiindex groupby object with condition?

I need to change 0 level index ('Product Group') of pandas groupby object, based on conditions (sum of related values in column 'Sales').
Since code is very long and some files are needed, I`ll copy output.
the last string of code is:
tdk_regions = tdk[['Region', 'Sales', 'Product Group']].groupby(['Product Group', 'Region']).sum()
###The output will be like this
Product Group Region Sales
ALUMINUM & FILM CAPACITORS BG America 7.425599e+07
China 2.249969e+08
Europe 2.404613e+08
India 6.034134e+07
Japan 7.667371e+06
... ... ...
TEMPERATURE&PRESSURE SENSORS BG Europe 1.308471e+08
India 3.077273e+06
Japan 2.851744e+07
Korea 1.309189e+06
OSEAN 1.258075e+07
Try MultiIndex.rename:
df.index.rename("New Name", level=0, inplace=True)
print(df)
Prints:
Sales
New Name Region
ALUMINUM & FILM CAPACITORS BG America 74255990.0
China 224996900.0
Europe 240461300.0
India 60341340.0
Japan 7667371.0

How to collapse multiple unique observations into one and find a mean?

Data: https://www.dropbox.com/s/c2yef22u96dd3s5/female_mentions_centrality_1.xlsx?dl=0
Data set screenshot:
I have a data set which looks like the picture above. It has multiple (unique) observations for the same Movie Name. For example, there are 3 unique observations for the movie Aan Milo Sajna and 2 for Aap Ke Saath.
I want that wherever there are multiple observations for a given Movie Name, they get collapsed into a single observation such that each variable value is the mean of the multiple observations.
For example, see below.
Transformed data set screenshot:
The Movie Names that had single observations remain untouched. But the three observations for Aan Milo Sajna and the 2 observations for Aap Ke Sath get collapsed into single observations. And each of the variable values is changed to the mean of the multiple observations as shown in the picture.
How can I accomplish this?
df_mean = df.groupby('MOVIE NAME').agg(np.mean).reset_index()
MOVIE NAME FEMALE MENTIONS TOTAL FEMALE CENTRALITY FEMALE COUNT AVERAGE FEMALE CENTRALITY
0 1920 19.000 258.417 140.500 1.669
1 100 Days 18.600 435.320 153.000 3.427
2 13B 2.333 74.289 23.333 1.259
3 1920 London 14.500 926.183 152.500 3.118
4 1942: A Love Story 11.000 398.500 78.000 5.109
... ... ... ... ... ...
2029 Zindagi 5.000 119.667 45.667 2.506
2030 Zindagi Na Milegi Dobara 13.000 265.750 135.000 1.865
2031 Zindagi Tere Naam 2.500 57.500 21.250 3.689
2032 Zubeidaa 0.000 1260.122 101.000 14.421
2033 Zulmi 1.000 5.333 4.000 1.333

Pandas retrieve a value based on index

Been trying to search for this but somehow can't seem to find the right answer.
Given the following simple dataframe:
country continent population
0 UK Europe 111111
1 Spain Europe 222222
2 Malaysia Asia 333333
3 USA America 444444
How can I retrieve the country value if I have a condition WHERE an index value is given? For example, If I am given an index value of 2, I should return Malaysia.
Edit: Forget to mention that the input index value comes from a variable (think of it as a user select a particular row and the selected row provide an index value variable).
Thank you.
df.iloc[2]['country']
iloc is used for selection by position, see pandas.DataFrame.iloc documentation for further options.
index = 2
print(df.iloc[index]['country'])
Malaysia

I need to get the min, max, avg of an Array List using Java

The values come from a Junit test case or from a scanner, so the method has to work in all scenarios. The array List looks something like this.
Utah 5
Nevada 6
California 12
Oregon 8
Utah 9
California 10
Nevada 4
Nevada 4
Oregon 17
California 6
I need to be able to calculate the average of, let's say, Utah. I know how to do something that finds the average. My biggest problem is knowing how to only get values from the names of Utah rather than just getting all of them.
names=states values= numbers categories=what
you are calculating
here is the start of the method:
public static ArrayList<Double> summarizeData (ArrayList<String> names, ArrayList<Double> values ,ArrayList<String> categories, int operation)

vba loop through all the pivot fields of a pivot table and return specified values

I have a dataset whose entries has 5 different attributes and one value. For example, I have a height of 5000 people. For each person I have his hair color, eye color, his nationality, the city he were born and the name of his mother (the 5 dimensions).
No/Eye Color/Hair Color/Nationality/Hometown/Mother's Name/Height
Blue Blond Swiss Zürich Nicole 184
Blue Brown English York Ruby 164
Brown Brown French Paris Sophie 154
etc..
So there are 5 dimensions. The data is set dynamically, so the number of categories in each dimensions can vary. I sought to compute the average height of people depending on whether I want to include some dimensions or not (from 1 to 5). For example I wanted the retrieve:
The average height of French and Blue eyed people. Next day only the people born in London. And the week after, the Swiss, blue-eyed, red-haired, born in Geneva and whose mother is called Nicole.
So I create a pivot table with the Eye Color as Row labels, Hair Color as Column labels, the average height as the Data and the last 3 dimensions as Market Filters. This allowed me see all the possible and desired combinations of average height that my data implies.
Now my goal is:
I want to create a Macro that goes through all the possible combinations that my dimensions entails (i.e 2^5-1=31) and store in a vector all the combination of height average that are above a certain value, e.g. 190. And then It could print on a worksheet.
I was thinking on using some booleans arrays vector and For-Each-Next structure, but I must say that I fail to picture how to implement it.
Any ideas?
Thanks for the time and help!