Can I use two keys from one of the data sets and one key from another dataset when I merge two data sets? - pandas

I would like to have an index for countries. But I have two columns of country names. One column is for the origin of the FDI and the other one is for the destination of the FDI.
origin
destination
FDI
US
UK
120
ITA
US
90
TR
SPA
40
This is the other data set I will use.
Country
Index
ITA
0
UK
1
TR
0
SPA
1
Should I merge the latest data set two times with the first one changing the key for each time. Or there is a better way of doing that?

The expected output is unclear, but you can map as many columns as you want:
mapper = df2.set_index('Country')['Index']
df1[['new_origin', 'new_destination']] = (df1[['origin', 'destination']]
.apply(lambda s: s.map(mapper))
)
Or with join:
out = df1.join(df1.drop(columns='FDI')
.apply(lambda s: s.map(mapper))
.add_prefix('new_'))
output:
origin destination FDI new_origin new_destination
0 US UK 120 NaN 1.0
1 ITA US 90 0.0 NaN
2 TR SPA 40 0.0 1.0

Related

Changing column name and it's values at the same time

Pandas help!
I have a specific column like this,
Mpg
0 18
1 17
2 19
3 21
4 16
5 15
Mpg is mile per gallon,
Now I need to replace that 'MPG' column to 'litre per 100 km' and change those values to litre per 100 km' at the same time. Any help? Thanks beforehand.
-Tom
I changed the name of the column but doing both simultaneously,i could not.
Use pop to return and delete the column at the same time and rdiv to perform the conversion (1 mpg = 1/235.15 liter/100km):
df['litre per 100 km'] = df.pop('Mpg').rdiv(235.15)
If you want to insert the column in the same position:
df.insert(df.columns.get_loc('Mpg'), 'litre per 100 km',
df.pop('Mpg').rdiv(235.15))
Output:
litre per 100 km
0 13.063889
1 13.832353
2 12.376316
3 11.197619
4 14.696875
5 15.676667
An alternative to pop would be to store the result in another dataframe. This way you can perform the two steps at the same time. In my code below, I first reproduce your dataframe, then store the constant for conversion and perform it on all entries using the apply method.
df = pd.DataFrame({'Mpg':[18,17,19,21,16,15]})
cc = 235.214583 # constant for conversion from mpg to L/100km
df2 = pd.DataFrame()
df2['litre per 100 km'] = df['Mpg'].apply(lambda x: cc/x)
print(df2)
The output of this code is:
litre per 100 km
0 13.067477
1 13.836152
2 12.379715
3 11.200694
4 14.700911
5 15.680972
as expected.

Search values in a Pandas DataFrame with values from another DataFrame

I have 2 dataframes.
df_dora
content
feature
id
1
cyber hygien
risk management
1
2
cyber risk
risk management
2
...
...
... ...
59
intellig share
information sharing
63
60
inform share
information sharing
64
df_corpus
content
id
meta.name
meta._split_id
0
market grow cyber attack...
56a2a2e28954537131a4aa734f49e361
14_Group_AG_2021
0
1
sec form file index
7aedfd4df02687d3dff9897c925da508
14_Group_AG_2021
1
...
...
...
...
213769
cyber secur alert parent compani fina...
ab10325601597f203f3f0af7aa647112
17_La_Banque_2021
8581
213770
intellig share statement parent compani fina...
6af5687ac31849d19d2048e0b2ca472d
17_La_Banque_2021
8582
I am trying to extract a count of each term listed in df_dora.content within df_corpus.content grouped by df_content.meta.name.
I tried to use isin
df = df_corpus[df_corpus.content.isin(df_dora.content)]
len(df)
Returns only 17 rows
content
id
meta.name
meta
41474
incid
a4c478e0fad1b9775c05e01d871b3aaf
3_Agricole_2021
10185
68690
oper risk
2e5139d82c242c89523110cc1110647a
10_Banking_Group_PLC_2021
5525
...
...
...
...
...
99259
risk report
a84eefb9a4772d13eb67f2d6ae5215cb
31_Building_Society_2021
4820
105662
risk manag
e8050be841fedb6dd10599e8b4892a9f
43_Bank_SA_2021
131
df_corpus.loc[df_corpus.content.isin(df_dora.content), 'content'].tolist()
also returns 17 rows
if I search for 2 of the terms that exist in df_dora directly in df_corpus
resiliency_term = df_corpus.loc[df_corpus['content'].str.contains("cyber risk|inform share", case=False)]
print(resiliency_term)
I get 243 rows (which matches what was in the original file.)
So given the above...my question is this how do I extract a count of each term listed in df_dora.content within df_corpus.content grouped by df_content.meta.name.
Thanks in advance for any help.
unique_vals = '|'.join(df_dora.content.unique())
df_corpus.groupby('meta.name').apply(lambda x: x.content.str.findall(unique_vals).explode().value_counts())
Output given your four lines of each:
17_La_Banque_2021 intellig share 1
Name: content, dtype: int64

Fill nan's in dataframe after filtering column by names

Can anyone please tell me what the right approach here to filter (and fill nan) based on another column name. Thanks a lot.
Related link: How to fill dataframe's empty/nan cell with conditional column mean
df
ID Name Industry Expenses
1 Treslam Financial Services 734545
2 Rednimdox Construction nan
3 Lamtone IT Services 567678
4 Stripfind Financial Services nan
5 Openjocon Construction 8678957
6 Villadox Construction 5675676
7 Sumzoomit Construction 231244
8 Abcd Construction nan
9 Stripfind Financial Services nan
df_mean_expenses = (df.groupby(['Industry'], as_index = False)['Expenses']).mean()
df_mean_expenses
Industry Expenses
0 Construction 554433.11
1 Financial Services 2362818.48
2 IT Services 149153.46
In order to replace the Contruction-Revenue nan's by the contruction row's mean (in df_mean_expenses) , i tried two approaches:
1.
df.loc[df['Expenses'].isna(),['Expenses']][df['Industry'] == 'Construction'] = df_mean_expenses.loc[df_mean_expenses['Industry'] == 'Construction',['Expenses']].values
.. returns Error: Item wrong length 500 instead of 3!
2.
df['Expenses'][np.isnan(df['Expenses'])][df['Industry'] == 'Construction'] = df_mean_expenses.loc[df_mean_expenses['Industry'] == 'Construction',['Expenses']].values
.. this runs but does not add values to the df.
Expected output:
df
ID Name Industry Expenses
1 Treslam Financial Services 734545
2 Rednimdox Construction 554433.11
3 Lamtone IT Services 567678
4 Stripfind Financial Services nan
5 Openjocon Construction 8678957
6 Villadox Construction 5675676
7 Sumzoomit Construction 231244
8 Abcd Construction 554433.11
9 Stripfind Financial Services nan
Try with transform
df_mean_expenses = df.groupby('Industry')['Expenses'].transform('mean')
df['Revenue'] = df['Revenue'].fillna(df_mean_expenses[df['Industry']=='Construction'])

Averaging dataframes with many string columns and display back all the columns

I have struggled with this even after looking at the various past answers to no avail.
My data consists of columns numeric and non numeric. I'd like to average the numeric columns and display my data on the GUI together with the information on the non-numeric columns.The non numeric columns have info such as names,rollno,stream while the numeric columns contain students marks for various subjects. It works well when dealing with one dataframe but fails when I combine two or more dataframes in which it returms only the average of the numeric columns and displays it leaving the non numeric columns undisplayed. Below is one of the codes I've tried so far.
df=pd.concat((df3,df5))
dfs =df.groupby(df.index,level=0).mean()
headers = list(dfs)
self.marks_table.setRowCount(dfs.shape[0])
self.marks_table.setColumnCount(dfs.shape[1])
self.marks_table.setHorizontalHeaderLabels(headers)
df_array = dfs.values
for row in range(dfs.shape[0]):
for col in range(dfs.shape[1]):
self.marks_table.setItem(row, col,QTableWidgetItem(str(df_array[row,col])))
A working code should return averages in something like this
STREAM ADM NAME KCPE ENG KIS
0 EAGLE 663 FLOYCE ATI 250 43 5
1 EAGLE 664 VERONICA 252 32 33
2 EAGLE 665 MACREEN A 341 23 23
3 EAGLE 666 BRIDGIT 286 23 2
Rather than
ADM KCPE ENG KIS
0 663.0 250.0 27.5 18.5
1 664.0 252.0 26.5 33.0
2 665.0 341.0 17.5 22.5
3 666.0 286.0 38.5 23.5
Sample data
Df1 = pd.DataFrame({
'STREAM':[NORTH,SOUTH],
'ADM':[437,238,439],
'NAME':[JAMES,MARK,PETER],
'KCPE':[233,168,349],
'ENG':[70,28,79],
'KIS':[37,82,79],
'MAT':[67,38,29]})
Df2 = pd.DataFrame({
'STREAM':[NORTH,SOUTH],
'ADM':[437,238,439],
'NAME':[JAMES,MARK,PETER],
'KCPE':[233,168,349],
'ENG':[40,12,56],
'KIS':[33,43,43],
'MAT':[22,58,23]})
Your question not clear. However guessing the origin of question based on content. I have modified your datframes which were not well done by adding a stream called 'CENTRAL', see
Df1 = pd.DataFrame({'STREAM':['NORTH','SOUTH', 'CENTRAL'],'ADM':[437,238,439], 'NAME':['JAMES','MARK','PETER'],'KCPE':[233,168,349],'ENG':[70,28,79],'KIS':[37,82,79],'MAT':[67,38,29]})
Df2 = pd.DataFrame({ 'STREAM':['NORTH','SOUTH','CENTRAL'],'ADM':[437,238,439], 'NAME':['JAMES','MARK','PETER'],'KCPE':[233,168,349],'ENG':[40,12,56],'KIS':[33,43,43],'MAT':[22,58,23]})
I have assumed you want to merge the two dataframes and find avarage
df3=Df2.append(Df1)
df3.groupby(['STREAM','ADM','NAME'],as_index=False).sum()
Outcome

Replacing NaN values with group mean

I have s dataframe made of countries, years and many other features. there are many years for a single country
country year population..... etc.
1 2000 5000
1 2001 NaN
1 2002 4800
2 2000
now there are many NaN in the dataframe.
I want to replace each NaN corresponding to a specific country in every column with the country average of this column.
so for example for the NaN in the population column corresponding to country 1, year 2001, I want to use the average population for country 1 for all the years = (5000+4800)/2.
now I am using the groupby().mean() method to find the means for each country, but I am running into the following difficulties:
1- some means are coming as NaN when I know for sure there is a value for it. why is so?
2- how can I get access to specific values in the groupby clause? in other words, how can I replace every NaN with its correct average?
Thanks a lot.
Using combine_first with groupby mean
df.combine_first(df.groupby('country').transform('mean'))
Or
df.fillna(df.groupby('country').transform('mean'))