How to collapse multiple unique observations into one and find a mean? - dataframe

Data: https://www.dropbox.com/s/c2yef22u96dd3s5/female_mentions_centrality_1.xlsx?dl=0
Data set screenshot:
I have a data set which looks like the picture above. It has multiple (unique) observations for the same Movie Name. For example, there are 3 unique observations for the movie Aan Milo Sajna and 2 for Aap Ke Saath.
I want that wherever there are multiple observations for a given Movie Name, they get collapsed into a single observation such that each variable value is the mean of the multiple observations.
For example, see below.
Transformed data set screenshot:
The Movie Names that had single observations remain untouched. But the three observations for Aan Milo Sajna and the 2 observations for Aap Ke Sath get collapsed into single observations. And each of the variable values is changed to the mean of the multiple observations as shown in the picture.
How can I accomplish this?

df_mean = df.groupby('MOVIE NAME').agg(np.mean).reset_index()
MOVIE NAME FEMALE MENTIONS TOTAL FEMALE CENTRALITY FEMALE COUNT AVERAGE FEMALE CENTRALITY
0 1920 19.000 258.417 140.500 1.669
1 100 Days 18.600 435.320 153.000 3.427
2 13B 2.333 74.289 23.333 1.259
3 1920 London 14.500 926.183 152.500 3.118
4 1942: A Love Story 11.000 398.500 78.000 5.109
... ... ... ... ... ...
2029 Zindagi 5.000 119.667 45.667 2.506
2030 Zindagi Na Milegi Dobara 13.000 265.750 135.000 1.865
2031 Zindagi Tere Naam 2.500 57.500 21.250 3.689
2032 Zubeidaa 0.000 1260.122 101.000 14.421
2033 Zulmi 1.000 5.333 4.000 1.333

Related

Style specific rows in multiindex dataframe

I have a pandas dataframe that looks like:
Year 2019 2020
Decision Applied Admitted % Applied Admitted %
Class Residency
Freshmen Resident 1143.0 918.0 80.3 1094.0 1003.0 91.7
Non-Resident 1371.0 1048.0 76.4 1223.0 1090.0 89.1
Total 2514.0 1966.0 78.2 2317.0 2093.0 90.3
Transfer Resident 404.0 358.0 88.6 406.0 354.0 87.2
Non-Resident 371.0 313.0 84.4 356.0 288.0 80.9
Total 775.0 671.0 86.6 762.0 642.0 84.3
Grad/Postbacc Total 418.0 311.0 74.4 374.0 282.0 75.4
Grand Total 3707.0 2948.0 79.5 3453.0 3017.0 87.4
note: Full MWE is in this question.
I'm trying to italicize the total rows (here that's rows 3,6,7,8) and bold the grand total row (row 8) in a way that doesn't rely on actual row numbers.
I can do that with:
df_totals.style.apply(lambda x:["font-style: italic;"]*len(x),subset=((slice(None),"Total"),))\
.applymap_index(lambda x:"font-style: italic;" if x in ("Grand","Total") else "")
That just seems super unpythonic, ugly, and unmaintainable to me, especially the call to applymap_index. Is there a more fluent way of doing this?
First part should be simplify by Styler.set_properties, second part is good in my opinion, there is only small change by example in Styler.applymap_index:
df_totals.style.set_properties(**{'font-style': 'italic'}, subset=((slice(None),"Total")))
.applymap_index(lambda x:"font-style: italic;" if x in ("Grand","Total") else None)

Can I use two keys from one of the data sets and one key from another dataset when I merge two data sets?

I would like to have an index for countries. But I have two columns of country names. One column is for the origin of the FDI and the other one is for the destination of the FDI.
origin
destination
FDI
US
UK
120
ITA
US
90
TR
SPA
40
This is the other data set I will use.
Country
Index
ITA
0
UK
1
TR
0
SPA
1
Should I merge the latest data set two times with the first one changing the key for each time. Or there is a better way of doing that?
The expected output is unclear, but you can map as many columns as you want:
mapper = df2.set_index('Country')['Index']
df1[['new_origin', 'new_destination']] = (df1[['origin', 'destination']]
.apply(lambda s: s.map(mapper))
)
Or with join:
out = df1.join(df1.drop(columns='FDI')
.apply(lambda s: s.map(mapper))
.add_prefix('new_'))
output:
origin destination FDI new_origin new_destination
0 US UK 120 NaN 1.0
1 ITA US 90 0.0 NaN
2 TR SPA 40 0.0 1.0

Averaging dataframes with many string columns and display back all the columns

I have struggled with this even after looking at the various past answers to no avail.
My data consists of columns numeric and non numeric. I'd like to average the numeric columns and display my data on the GUI together with the information on the non-numeric columns.The non numeric columns have info such as names,rollno,stream while the numeric columns contain students marks for various subjects. It works well when dealing with one dataframe but fails when I combine two or more dataframes in which it returms only the average of the numeric columns and displays it leaving the non numeric columns undisplayed. Below is one of the codes I've tried so far.
df=pd.concat((df3,df5))
dfs =df.groupby(df.index,level=0).mean()
headers = list(dfs)
self.marks_table.setRowCount(dfs.shape[0])
self.marks_table.setColumnCount(dfs.shape[1])
self.marks_table.setHorizontalHeaderLabels(headers)
df_array = dfs.values
for row in range(dfs.shape[0]):
for col in range(dfs.shape[1]):
self.marks_table.setItem(row, col,QTableWidgetItem(str(df_array[row,col])))
A working code should return averages in something like this
STREAM ADM NAME KCPE ENG KIS
0 EAGLE 663 FLOYCE ATI 250 43 5
1 EAGLE 664 VERONICA 252 32 33
2 EAGLE 665 MACREEN A 341 23 23
3 EAGLE 666 BRIDGIT 286 23 2
Rather than
ADM KCPE ENG KIS
0 663.0 250.0 27.5 18.5
1 664.0 252.0 26.5 33.0
2 665.0 341.0 17.5 22.5
3 666.0 286.0 38.5 23.5
Sample data
Df1 = pd.DataFrame({
'STREAM':[NORTH,SOUTH],
'ADM':[437,238,439],
'NAME':[JAMES,MARK,PETER],
'KCPE':[233,168,349],
'ENG':[70,28,79],
'KIS':[37,82,79],
'MAT':[67,38,29]})
Df2 = pd.DataFrame({
'STREAM':[NORTH,SOUTH],
'ADM':[437,238,439],
'NAME':[JAMES,MARK,PETER],
'KCPE':[233,168,349],
'ENG':[40,12,56],
'KIS':[33,43,43],
'MAT':[22,58,23]})
Your question not clear. However guessing the origin of question based on content. I have modified your datframes which were not well done by adding a stream called 'CENTRAL', see
Df1 = pd.DataFrame({'STREAM':['NORTH','SOUTH', 'CENTRAL'],'ADM':[437,238,439], 'NAME':['JAMES','MARK','PETER'],'KCPE':[233,168,349],'ENG':[70,28,79],'KIS':[37,82,79],'MAT':[67,38,29]})
Df2 = pd.DataFrame({ 'STREAM':['NORTH','SOUTH','CENTRAL'],'ADM':[437,238,439], 'NAME':['JAMES','MARK','PETER'],'KCPE':[233,168,349],'ENG':[40,12,56],'KIS':[33,43,43],'MAT':[22,58,23]})
I have assumed you want to merge the two dataframes and find avarage
df3=Df2.append(Df1)
df3.groupby(['STREAM','ADM','NAME'],as_index=False).sum()
Outcome

MDX Aggregate DImensions to filter

I'm new to mdx and need your help:
[Item].[Segment] [Country].[World] [Measures].[Periodic]
1 Region A 150
2 Region B 60
3 Region C 1400
4 Region D 20
I have two dimensions Segment and World. If I take only world, I get no values. But I want to achieve to combine the two dimensions to one dimension on segment level as following:
[Item].[Segment] [Measures].[Periodic]
1 150
2 60
3 1400
4 20
Would an aggregation be useful in this case?
Thanks in advance!
The Structure is like following:
Cube_Structure
--> I need to combine both dimensions Segment and World in order to have one dimension on the row which shows me the values for the segments only!

SSRS 2008 display mutilple columns of data without a new line

I am creating a report in SSRS 2008 with MS SQL Server 2008 R2. I have data based on the Aggregate value of Medical condition and the level of severity.
Outcome Response Adult Youth Total
BMI GOOD 70 0 70
BMI MONITOR 230 0 230
BMI PROBLEM! 10 0 10
LDL GOOD 5 0 5
LDL MONITOR 4 0 4
LDL PROBLEM! 2 0 2
I need to display the data based on the Response like:
BMI BMI BMI
GOOD MONITOR PROBLEM!
Total 70 230 10
Youth 0 0 0
Adult 70 230 10
LDL LDL LDL
GOOD MONITOR PROBLEM!
Total 5 4 2
Youth 0 0 0
Adult 5 4 2
I first tried to use SSRS to do the grouping based on the Outcome and then the Response but I got each response on a separate row of data but I need all Outcomes on a single line. I now believe that a pivot would work but all the examples I have seen is a pivot on one column of data pivoted using another. Is it possible to pivot multiple columns of data based on a single column?
With your existing Dataset you could so something similar to the following:
Create a List item, and change the Details grouping to be based on Outcome:
In the List cell, add a new Matrix with one Column Group based on Response:
You'll note that since you have individual columns for Total, Youth, Adult, you need to add grand total rows to display each group.
The end result is pretty close to your requirements:
For your underlying data, to help with report development it might be useful to have the Total, Youth, Adult as unpivoted columns, but it's not a big deal if the groups are fairly static.