Averaging dataframes with many string columns and display back all the columns - pandas

I have struggled with this even after looking at the various past answers to no avail.
My data consists of columns numeric and non numeric. I'd like to average the numeric columns and display my data on the GUI together with the information on the non-numeric columns.The non numeric columns have info such as names,rollno,stream while the numeric columns contain students marks for various subjects. It works well when dealing with one dataframe but fails when I combine two or more dataframes in which it returms only the average of the numeric columns and displays it leaving the non numeric columns undisplayed. Below is one of the codes I've tried so far.
df=pd.concat((df3,df5))
dfs =df.groupby(df.index,level=0).mean()
headers = list(dfs)
self.marks_table.setRowCount(dfs.shape[0])
self.marks_table.setColumnCount(dfs.shape[1])
self.marks_table.setHorizontalHeaderLabels(headers)
df_array = dfs.values
for row in range(dfs.shape[0]):
for col in range(dfs.shape[1]):
self.marks_table.setItem(row, col,QTableWidgetItem(str(df_array[row,col])))
A working code should return averages in something like this
STREAM ADM NAME KCPE ENG KIS
0 EAGLE 663 FLOYCE ATI 250 43 5
1 EAGLE 664 VERONICA 252 32 33
2 EAGLE 665 MACREEN A 341 23 23
3 EAGLE 666 BRIDGIT 286 23 2
Rather than
ADM KCPE ENG KIS
0 663.0 250.0 27.5 18.5
1 664.0 252.0 26.5 33.0
2 665.0 341.0 17.5 22.5
3 666.0 286.0 38.5 23.5
Sample data
Df1 = pd.DataFrame({
'STREAM':[NORTH,SOUTH],
'ADM':[437,238,439],
'NAME':[JAMES,MARK,PETER],
'KCPE':[233,168,349],
'ENG':[70,28,79],
'KIS':[37,82,79],
'MAT':[67,38,29]})
Df2 = pd.DataFrame({
'STREAM':[NORTH,SOUTH],
'ADM':[437,238,439],
'NAME':[JAMES,MARK,PETER],
'KCPE':[233,168,349],
'ENG':[40,12,56],
'KIS':[33,43,43],
'MAT':[22,58,23]})

Your question not clear. However guessing the origin of question based on content. I have modified your datframes which were not well done by adding a stream called 'CENTRAL', see
Df1 = pd.DataFrame({'STREAM':['NORTH','SOUTH', 'CENTRAL'],'ADM':[437,238,439], 'NAME':['JAMES','MARK','PETER'],'KCPE':[233,168,349],'ENG':[70,28,79],'KIS':[37,82,79],'MAT':[67,38,29]})
Df2 = pd.DataFrame({ 'STREAM':['NORTH','SOUTH','CENTRAL'],'ADM':[437,238,439], 'NAME':['JAMES','MARK','PETER'],'KCPE':[233,168,349],'ENG':[40,12,56],'KIS':[33,43,43],'MAT':[22,58,23]})
I have assumed you want to merge the two dataframes and find avarage
df3=Df2.append(Df1)
df3.groupby(['STREAM','ADM','NAME'],as_index=False).sum()
Outcome

Related

Search values in a Pandas DataFrame with values from another DataFrame

I have 2 dataframes.
df_dora
content
feature
id
1
cyber hygien
risk management
1
2
cyber risk
risk management
2
...
...
... ...
59
intellig share
information sharing
63
60
inform share
information sharing
64
df_corpus
content
id
meta.name
meta._split_id
0
market grow cyber attack...
56a2a2e28954537131a4aa734f49e361
14_Group_AG_2021
0
1
sec form file index
7aedfd4df02687d3dff9897c925da508
14_Group_AG_2021
1
...
...
...
...
213769
cyber secur alert parent compani fina...
ab10325601597f203f3f0af7aa647112
17_La_Banque_2021
8581
213770
intellig share statement parent compani fina...
6af5687ac31849d19d2048e0b2ca472d
17_La_Banque_2021
8582
I am trying to extract a count of each term listed in df_dora.content within df_corpus.content grouped by df_content.meta.name.
I tried to use isin
df = df_corpus[df_corpus.content.isin(df_dora.content)]
len(df)
Returns only 17 rows
content
id
meta.name
meta
41474
incid
a4c478e0fad1b9775c05e01d871b3aaf
3_Agricole_2021
10185
68690
oper risk
2e5139d82c242c89523110cc1110647a
10_Banking_Group_PLC_2021
5525
...
...
...
...
...
99259
risk report
a84eefb9a4772d13eb67f2d6ae5215cb
31_Building_Society_2021
4820
105662
risk manag
e8050be841fedb6dd10599e8b4892a9f
43_Bank_SA_2021
131
df_corpus.loc[df_corpus.content.isin(df_dora.content), 'content'].tolist()
also returns 17 rows
if I search for 2 of the terms that exist in df_dora directly in df_corpus
resiliency_term = df_corpus.loc[df_corpus['content'].str.contains("cyber risk|inform share", case=False)]
print(resiliency_term)
I get 243 rows (which matches what was in the original file.)
So given the above...my question is this how do I extract a count of each term listed in df_dora.content within df_corpus.content grouped by df_content.meta.name.
Thanks in advance for any help.
unique_vals = '|'.join(df_dora.content.unique())
df_corpus.groupby('meta.name').apply(lambda x: x.content.str.findall(unique_vals).explode().value_counts())
Output given your four lines of each:
17_La_Banque_2021 intellig share 1
Name: content, dtype: int64

Sum duplicate bigrams in dataframe

I currently have a data frame that contains values such as:
Bigram Frequency
0 (ice, cream) 23
1 (cream, sandwich) 21
2 (google, android) 19
3 (galaxy, nexus) 14
4 (android, google) 12
There are values in there that I want to merge (like google, android and android,google) there are others like "ice, cream" and "cream, sandwich" but that's a different problem.
In order to sum up the duplicates I tried to do this:
def remove_duplicates(ngrams):
return {" ".join(sorted(key.split(" "))):ngrams[key] for key in ngrams}
freq_all_tw_pos_bg['Word'] = freq_all_tw_pos_bg['Word'].apply(remove_duplicates)
I looked around and found similar exercises which are marked as right answers but when I try to do it I get:
TypeError: tuple indices must be integers or slices, not str
Which makes sense but then I tried converting it to a string and it shuffled the bigrams in a weird way so I wonder, am I missing something that should be easier?
EDIT:
The input is the first values I show. A list of bigrams some which are repeated (due to the words in them being reversed. I.e. google, android vs android,google
I want to have this same output (that is a dataframe with the bigrams) but that it sums up the frequencies of the reversed words. If I grab the same list from above and process it then it should output.
Bigram Frequency
0 (ice, cream) 23
1 (cream, sandwich) 21
2 (google, android) 31
3 (galaxy, nexus) 14
4 (apple, iPhone) 6
Notice how it "merged" (google, android) and (android, google) and also summed up the frequencies.
If there ara tuples use sorted with convert to tuples:
freq_all_tw_pos_bg['Bigram'] = freq_all_tw_pos_bg['Bigram'].apply(lambda x:tuple(sorted(x)))
print (freq_all_tw_pos_bg)
Bigram Frequency
0 (cream, ice) 23
1 (cream, sandwich) 21
2 (android, google) 31
3 (galaxy, nexus) 14
4 (apple, iPhone) 6
And then aggregate sum:
df = freq_all_tw_pos_bg.groupby('Bigram', as_index=False)['Frequency'].sum()

Pivoting with grouby?

I wonder if you can help me to find a solution for the following problem. Given a data frame df1 like this
d1={'L':['aaa','bbb','ccc','aaa','bbb','ddd'],
'w':[1,5,9,13,17,21],
'x':[2,6,10,14,18,22],
'y':[3,7,11,15,19,23],
'z':[4,8,12,16,20,24]}
df1=pd.DataFrame(d1)
and two dictionaries to define grouping over columns and rows
dctRowGroups={'aaa':'A','bbb':'B','ccc':'A','ddd':'B'}
dctColGroups={'w':'ALPHA','x':'BETA','y':'ALPHA','z':'BETA'}
I wanted to aggregate over columns as a first step. Applying
g2=df1.groupby(dctColGroups,axis=1)
g2.sum()
results in
but I wanted to keep the 'L' column for the next step row-wise aggregation, i.e. the result should be a dataframe df2 more like this:
What do I need to code to make this happen?
As a next step, I want to aggregate df2 over the rows using the dctRowGroups dictionary
g3=df2.groupby(dctRowGroups,axis=0)
g3.sum()
to get a final result like this:
In what way can I do all these steps in as few lines of code as possible?
Appreciate your advice on this.
Thanks a lot
Willfried.
You can do:
Firstly create df2 and insert 'L' column by using insert() method:
df2=df1.groupby(dctColGroups,axis=1).sum()
df2.insert(0,'L',df1['L']) #use this only when the order matters
#OR(use anyone of the method either insert or assign)
df2=df2.assign(L=df1['L']) #otherwise use this
Finally use assign() ,map() and groupby() method:
result=df2.assign(L=df2['L'].map(dctRowGroups)).groupby('L').sum()
Outputs:
df2:
L ALPHA BETA
0 aaa 4 6
1 bbb 12 14
2 ccc 20 22
3 aaa 28 30
4 bbb 36 38
5 ddd 44 46
result:
ALPHA BETA
L
A 52 58
B 92 98

How to collapse multiple unique observations into one and find a mean?

Data: https://www.dropbox.com/s/c2yef22u96dd3s5/female_mentions_centrality_1.xlsx?dl=0
Data set screenshot:
I have a data set which looks like the picture above. It has multiple (unique) observations for the same Movie Name. For example, there are 3 unique observations for the movie Aan Milo Sajna and 2 for Aap Ke Saath.
I want that wherever there are multiple observations for a given Movie Name, they get collapsed into a single observation such that each variable value is the mean of the multiple observations.
For example, see below.
Transformed data set screenshot:
The Movie Names that had single observations remain untouched. But the three observations for Aan Milo Sajna and the 2 observations for Aap Ke Sath get collapsed into single observations. And each of the variable values is changed to the mean of the multiple observations as shown in the picture.
How can I accomplish this?
df_mean = df.groupby('MOVIE NAME').agg(np.mean).reset_index()
MOVIE NAME FEMALE MENTIONS TOTAL FEMALE CENTRALITY FEMALE COUNT AVERAGE FEMALE CENTRALITY
0 1920 19.000 258.417 140.500 1.669
1 100 Days 18.600 435.320 153.000 3.427
2 13B 2.333 74.289 23.333 1.259
3 1920 London 14.500 926.183 152.500 3.118
4 1942: A Love Story 11.000 398.500 78.000 5.109
... ... ... ... ... ...
2029 Zindagi 5.000 119.667 45.667 2.506
2030 Zindagi Na Milegi Dobara 13.000 265.750 135.000 1.865
2031 Zindagi Tere Naam 2.500 57.500 21.250 3.689
2032 Zubeidaa 0.000 1260.122 101.000 14.421
2033 Zulmi 1.000 5.333 4.000 1.333

Creating new column based on condition and extracting respective value from other column. Pandas Dataframe

I am relatively new to this field and am working with a data set to find meaningful insights into customer behavior. My dataset looks like:
customerId week first_trip_week rides
0 156 44 36 2
1 164 44 38 6
2 224 42 36 5
3 224 43 36 4
4 224 44 36 5
What I want to do is create new columns week 44,week 43, week 42 and get the values in the "ride" column to be filled into the rows for the respective customer id. This is in the hope that I can eventually also make the customerId my index and can get denominations for different weeks. Help would be greatly appreciated!
Thank you!!
If I'm understanding you correctly, you want to create new columns in the same dataframe for weeks 44, 43, and 42 with the correct values for each customerId and NaN for those that don't have it. If your original dataframe has all the user data, I would first filter for dataframes that have the correct week number
week42DF = dataset.loc[dataset['week']==42,['customerId','rides']].rename(columns={'rides':'week42Rides'})
getting only the rides and customerId and renaming the former here to make things a little easier for us. Then left join the old dataframe and the new one on customerId
dataset = pd.merge(dataset,week42DF,how='left',on='customerId')
The users that are missing from week42DF will have NaN in the week42rides column in the merged dataset which you can then use the .fillna(0) method to replace with zeros. Do this for each week you require.
See Pandas' documentation on merge and the more general concatenate for more info.