Pandas bringing in data from another dataframe - pandas

I am trying to bring data from a dataframe which is mapping table into another dataframe using the following, however I get an error 'x' is not defined, what am I doing wrong pls?
Note for values not in the mapping table (China/CN) I would just like the value to be blank or nan. If there are values in the mapping table that are not in my data - I don't want to include them.
import pandas as pd
languages = {'Language': ["English", "German", "French", "Spanish"],
'countryCode': ["EN", "DE", "FR", "ES"]
}
countries = {'Country': ["Australia", "Argentina", "Mexico", "Algeria", "China"],
'countryCode': ["EN", "ES", "ES", "FR", "CN"]
}
language_map = pd.DataFrame(languages)
data = pd.DataFrame(countries)
def language_converter(x):
return language_map.query(f"countryCode=='{x}'")['Language'].values[0]
data['Language'] = data['countryCode'].apply(language_converter(x))

Use pandas.DataFrame.merge:
data.merge(language_map, how='left')
Output:
Country countryCode Language
0 Australia EN English
1 Argentina ES Spanish
2 Mexico ES Spanish
3 Algeria FR French
4 China CN NaN

.apply accepts a callable object, but you've passed language_converter(x) which is already a function call with undefined x variable as apply is not applied yet.
A valid usage is: .apply(language_converter).
But next, you'll have another error IndexError: index 0 is out of bounds for axis 0 with size 0 as some country codes may not be found (which breaks the indexing .values[0]).
If proceeding with your starting approach a valid version would look as below:
def language_converter(x):
lang = language_map[language_map["countryCode"] == x]['Language'].values
return lang[0] if lang.size > 0 else np.nan
data['Language'] = data['countryCode'].apply(language_converter)
print(data)
Country countryCode Language
0 Australia EN English
1 Argentina ES Spanish
2 Mexico ES Spanish
3 Algeria FR French
4 China CN NaN
But, instead of defining and applying language_converter it's much simpler and straightforward to map country codes explicitly with just:
data['Language'] = data['countryCode'].map(language_map.set_index("countryCode")['Language'])

Related

Python - compare multiple columns, search list of keywords in column and compare with another, in two dataframes to generate a third resultset

I have two very different dataframes.
df1 looks like this:
Region
Entity
Desk
Function
Key
US
AAA
Top class, Desk1, Mike's team
Writing, advising
Unique_1
US
AAA
team beta, Blue rats, Tom
task_a, task_d
Unique_2
EMEA
ZZZ
Delta one
Forecasts, month-end, Sales
Unique_3
JPN
XYZ
Admin
task1, task_b, task_g
Unique_4
df2 looks like this:
Region
Entity
Desk
Function
ID
EMEA
ZZZ
Equity, delta one
Sales, sweet talking, schmoozing
A_01
US
AAA
Desk 1, A team, Top class
Writing,calling,listening, advising
A_02
US
AAA
Desk 2, Ninjas, 2nd team, Tom's team
Secret, private
A_03
EMEA
DDD
Equity, Private Equity
task1, task2, task3, task4
A_04
JPN
XXX
Admin, Secretaries
task_a, task_b
A_05
df2 is a much larger recordset than df1.
Both Desk and Function in each of the dataframes were free-text fields and allowed the input of rubbish data. I am trying to build a new recordset from these dataframes using the following criteria:
where -
df1['Region'] == df2['Region']
AND
df1['Entity'] == df2['Entity']
AND
any of the phrases within df1['Desk'] can be matched to any of the phrases within df2['Desk']
AND
any of the phrases within df1['Function'] can be matched to any of the phrases within df2['Function'].
I need the ultimate output to look something like this:
df2.Id
df1.Key
MATCH
A_02
Unique_1
Exact
Unique_2
No match
A_01
Unique_3
Exact
Unique_4
No match
I am really struggling with this. I have both dataframes but I cannot loop through df1 to match the columns as specified above in df2. I've tried merging the dataframes, using np.where and brute force looping but nothing is working. The tricky bit is matching the Desk and Function columns.
Any ideas?
IIUC, one option is to use a cross merge :
def cross_match(df1, df2, col):
df = df1.merge(df2, how="cross")
colx, coly = [f"{col}_x", f"{col}_y"]
df[[colx, coly]] = df[[colx, coly]].apply(lambda x: x.str.lower()
.str.split("\s*,\s*"))
df["MATCH"] = (pd.Series([any(w in sent for w in lst)
for lst, sent in zip(df[f"{col}_x"], df[f"{col}_y"])])
.map({True:"Exact"}))
return df.query("MATCH == 'Exact'")
desk, func = cross_match(df1, df2, "Desk"), cross_match(df1, df2, "Function")
out = (
pd.merge(desk, func,
left_on=["Region_x", "Entity_x", "ID"],
right_on=["Region_y", "Entity_y", "ID"],
suffixes=("", "_")).set_index("Key")
.reindex(df1["Key"].unique())
.fillna({"MATCH": "No match"})
.reset_index()[["ID", "Key", "MATCH"]]
)
Disclaimer : This approach may get incredibly slow when huge datasets (df1, df2).
Output :
print(out)
ID Key MATCH
0 A_02 Unique_1 Exact
1 NaN Unique_2 No match
2 A_01 Unique_3 Exact
3 NaN Unique_4 No match

Pandas: Make a column with (1,2,3) if string of another column-Row value starts with ("A","B","C")

I have dataframe with filenames and classification, these are predictions from a network, I want to map them into integers to evaluate prediction from a network.
My dataframe is :
Filename: Class:
GHT347 Europe
GHT568 lONDON
GHT78 Europe
HJU US
HJI lONDON
HJK US
KLO Europe
KLU lONDON
KLP lONDON
KLY1 lONDON
KL34 US
The true prediction should be :
GHT-- EUROPE
HJU -- US
KL -- London
I want to map : GHT and Europe to 1, US and HJ to 0, KL and London to 2 by adding an additional two columns Prediction and Actual
Actual Prediction
1 1
1 2
pandas str.startswith method returns true or false, here I want three values. Can anyone guide me?
i cannot fully understand what you want, but I can give you some tips
use regular expressions:
df['actual'] = np.nan
df.loc[(df.Filename.str.contains('^GHT.*')) & (df.Class == 'Europe'), 'Actual'] = 1
df.loc[(df.Filename.str.contains('^HJ.*')) & (df.Class == 'US'), 'Actual'] = 0
and so on
You can set column values to anything you like, based on the values of one or more other columns. This toy example shows one way to do it:
row1list = ['GHT347', 'Europe']
row2list = ['GHT568', 'lONDON']
row3list = ['KLU', 'lONDON']
df = pd.DataFrame([row1list, row2list, row3list],
columns=['Filename', 'Class'])
df['Actual'] = -1 # start with a value you will ignore
df['Prediction'] = -1
df.loc[(df['Filename'].str.startswith('GHT')) & (df['Class'] == 'Europe'), 'Actual'] = 1
df.loc[(df['Filename'].str.startswith('KL')) & (df['Class'] == 'lONDON'), 'Prediction'] = 2
print(df)
# Filename Class Actual Prediction
# 0 GHT347 Europe 1 -1
# 1 GHT568 lONDON -1 -1
# 2 KLU lONDON -1 2

Applying a function to list of columns of a dataframe?

I scraped this table from this URL:
"https://www.patriotsoftware.com/blog/accounting/average-cost-living-by-state/"
Which looks like this:
State Annual Mean Wage (All Occupations) Median Monthly Rent Value of a Dollar
0 Alabama $44,930 $998 $1.15
1 Alaska $59,290 $1,748 $0.95
2 Arizona $50,930 $1,356 $1.04
3 Arkansas $42,690 $953 $1.15
4 California $61,290 $2,518 $0.87
And then I wrote this function to help me turn the strings into ints:
def money_string_to_int(s):
return int(s.replace(",", "").replace("$",""))
money_string_to_int("$1,23")
My function works when I apply it to only one column. I found this answer here about using on multiple columns: How to apply a function to multiple columns in Pandas
But my code below does not work and produces no errors:
ls = ['Annual Mean Wage (All Occupations)', 'Median Monthly Rent',
'Value of a Dollar']
ppe_table[ls] = ppe_table[ls].apply(money_string_to_int)
Lets try
df.set_index('State').apply(lambda x: (x.str.replace('[$,]','').astype(float))).reset_index()

how to do this operation in pandas

I have a data frame that contains country column. Unfortunately the country names characters are all capitalized and I need them to be ISO3166_1_Alpha_3
as an example United States of America is going to be U.S.A
United Kingdom is going to be U.K and so on.
Fortunately I found this data frame on the internet that contains 2 important columns the first is the country name and the second is the ISO3166_1_Alpha_3
you can find the data frame on this website
https://datahub.io/JohnSnowLabs/iso-3166-country-codes-itu-dialing-codes-iso-4217-currency-codes
So i wrote this code
data_geo = pd.read_excel("tab0.xlsx")#this is the data frame that contains all the capitalized country name
country_iso = pd.read_csv(r"https://datahub.io/JohnSnowLabs/iso-3166-country-codes-itu-dialing-codes-iso-4217-currency-codes/r/iso-3166-country-codes-itu-dialing-codes-iso-4217-currency-codes-csv.csv",
usecols=['Official_Name_English', 'ISO3166_1_Alpha_3'])
s = pd.Series(data_geo.countery_name_e).str.lower().str.title()#this line make all the names characters small except the first character
y = pd.Series([])
Now i want to make a loop when a
s = Official_Name_English
I want to append
country_iso[Official_Name_English].ISO3166_1_Alpha_3
to the Y series. If country name isn't in this list append NaN
this is 20 rows in s
['Diffrent Countries', 'Germany', 'Other Countries', 'Syria',
'Jordan', 'Yemen', 'Sudan', 'Somalia', 'Australia',
'Other Countries', 'Syria', 'Lebanon', 'Jordan', 'Yemen', 'Qatar',
'Sudan', 'Ethiopia', 'Djibouti', 'Somalia', 'Botswana Land']
Do you know how can i make this?
You could try map:
data_geo = pd.read_excel("tab0.xlsx")
country_iso = pd.read_csv(r"https://datahub.io/JohnSnowLabs/iso-3166-country-codes-itu-dialing-codes-iso-4217-currency-codes/r/iso-3166-country-codes-itu-dialing-codes-iso-4217-currency-codes-csv.csv",
usecols=['Official_Name_English', 'ISO3166_1_Alpha_3'])
s = pd.Series(data_geo.countery_name_e).str.lower().str.title()
mapper = (country_iso.drop_duplicates('Official_Name_English')
.dropna(subset=['Official_Name_English'])
.set_index('Official_Name_English')['ISO3166_1_Alpha_3'])
y = data_geo['countery_name_e'].map(mapper)

Pandas: Unable to change value of a cell using while loop

I am trying to use a while loop to read through all the rows of my file and edit the value of a particular cell when a condition is met.
My logic is working just fine when I am reading data from an excel. But same logic is not working when I am reading from a csv file.
Here is my logic to read from Excel file:
df = pd.read_excel('Energy Indicators.xls', 'Energy', index_col=None, na_values=['NA'], skiprows = 15, skipfooter = 38, header = 1, parse_cols ='C:F')
df = df.rename(columns = {'Unnamed: 0' : 'Country', 'Renewable Electricity Production': '% Renewable'})
df = df.drop(0, axis=0)
i = 0
while (i !=len(df)):
if df.iloc[i]['Country'] == "Ukraine18":
print(df.iloc[i]['Country'])
df.iloc[i]['Country'] = 'Ukraine'
print(df.iloc[i]['Country'])
i += 1
df
The result I get is:
Ukraine18
Ukraine
But when I read a CSV file:
df = pd.read_csv('world_bank.csv', skiprows = 4)
df = df.rename(columns = {'Country Name' : 'Country'})
i = 0
while (i !=len(df)):
if df.iloc[i]['Country'] == "Aruba":
print(df.iloc[i]['Country'])
df.iloc[i]['Country'] = "Arb"
print(df.iloc[i]['Country'])
i += 1
df
The result I get is:
Aruba
Aruba
Can someone please help? What am I doing wrong with my CSV file?
#Anna Iliukovich-Strakovskaia, #msr_003, you guys are right! I changed my code to df['ColumnName][i], and it worked with the CSV file. But it is not working with Excel file now.
So, it seems with data read from CSV file, df['ColumnName][i] works correctly,
but with data read from Excel file, df.iloc[i]['ColumnName'] works correctly.
At time point, I have no clue why there should be a difference, because I am not working with the data 'within' the files, rather I am working on data that was read from these files into a 'dataframe'. Once the data is in the dataframe, the source shouldn't have any influence, I think.
Anyway, thank you for your help!!
generally i used to modify as below.
testdf = pd.read_csv("sam.csv")
testdf
ExportIndex AgentName Country
0 1 Prince United Kingdom
1 2 Nit United Kingdom
2 3 Akhil United Kingdom
3 4 Ruiva United Kingdom
4 5 Niraj United Kingdom
5 6 Nitin United States
i = 0
while(i != len(testdf)):
if(testdf['AgentName'][i] == 'Nit'):
testdf['AgentName'][i] = 'Nitesh'
i += 1
testdf
ExportIndex AgentName Country
0 1 Prince United Kingdom
1 2 Nitesh United Kingdom
2 3 Akhil United Kingdom
3 4 Ruiva United Kingdom
4 5 Niraj United Kingdom
5 6 Nitin United States
But i'm not sure what's wrong with your approach.