I am trying to bring data from a dataframe which is mapping table into another dataframe using the following, however I get an error 'x' is not defined, what am I doing wrong pls?
Note for values not in the mapping table (China/CN) I would just like the value to be blank or nan. If there are values in the mapping table that are not in my data - I don't want to include them.
import pandas as pd
languages = {'Language': ["English", "German", "French", "Spanish"],
'countryCode': ["EN", "DE", "FR", "ES"]
}
countries = {'Country': ["Australia", "Argentina", "Mexico", "Algeria", "China"],
'countryCode': ["EN", "ES", "ES", "FR", "CN"]
}
language_map = pd.DataFrame(languages)
data = pd.DataFrame(countries)
def language_converter(x):
return language_map.query(f"countryCode=='{x}'")['Language'].values[0]
data['Language'] = data['countryCode'].apply(language_converter(x))
Use pandas.DataFrame.merge:
data.merge(language_map, how='left')
Output:
Country countryCode Language
0 Australia EN English
1 Argentina ES Spanish
2 Mexico ES Spanish
3 Algeria FR French
4 China CN NaN
.apply accepts a callable object, but you've passed language_converter(x) which is already a function call with undefined x variable as apply is not applied yet.
A valid usage is: .apply(language_converter).
But next, you'll have another error IndexError: index 0 is out of bounds for axis 0 with size 0 as some country codes may not be found (which breaks the indexing .values[0]).
If proceeding with your starting approach a valid version would look as below:
def language_converter(x):
lang = language_map[language_map["countryCode"] == x]['Language'].values
return lang[0] if lang.size > 0 else np.nan
data['Language'] = data['countryCode'].apply(language_converter)
print(data)
Country countryCode Language
0 Australia EN English
1 Argentina ES Spanish
2 Mexico ES Spanish
3 Algeria FR French
4 China CN NaN
But, instead of defining and applying language_converter it's much simpler and straightforward to map country codes explicitly with just:
data['Language'] = data['countryCode'].map(language_map.set_index("countryCode")['Language'])
FIRST TIME POSTING:
I'm preparing data for arules() read.transactions and need to compress unique Invoice data (500k+ cases) so that each unique Invoice and its associated info fits on a single line like this:
Invoice001,CustomerID,Country,StockCodeXYZ,StockCode123
Invoice002...etc
However, the data reads in repeating the Invoice for each StockCode like this:
Invoice001,CustomerID,Country,StockCodeXYZ
Invoice001,CustomerID,Country,StockCode123
Invoice002....etc
I've been trying pivot_wider() and then unite(), but it generates 285M+ MOSTLY NULL cells into a LIST which I'm having a hard time resolving and unable to write to csv or read into arules. I've also tried keep(~!is.null(.)), discard(is.null), compact() without success and am open to any method to achieve the desired outcome above.
However, I feel like I should be able to solve it using the built-in arules() read.transactions() fx, but am getting various errors as I try different things there too.
The data is opensource from University of California, Irvin and found here: https://archive.ics.uci.edu/ml/machine-learning-databases/00352/Online%20Retail.xlsx
Any help would be greatly appreciated.
library(readxl)
url <- "https://archive.ics.uci.edu/ml/machine-learning-databases/00352/Online%20Retail.xlsx"
destfile <- "Online_20Retail.xlsx"
curl::curl_download(url, destfile)
Online_20Retail <- read_excel(destfile)
trans <- read.transactions(????????????)
this one invoice "573585" hast over 1.000 itens so it will generate the acording number of columns if you only get the stock number from the invoice items... still we have a bit over 1.000 columns.
library(dplyr)
Online_20Retail %>%
dplyr::transmute(new = paste0(InvoiceNo, ", ",
CustomerID, ", ",
Country, ", "),
StockCode) %>%
dplyr::group_by(new) %>%
dplyr::summarise(output = paste(StockCode, collapse = ", ")) %>%
dplyr::transmute(mystring = paste0(new, output))
# you might want to put "%>% dplyr::pull(mystring)" at the ending of the line above to get a vector not tibble/dataframe
# A tibble: 25,900 x 1
mystring
<chr>
1 536365, 17850, United Kingdom, 85123A, 71053, 84406B, 84029G, 84029E, 22752, 21730
2 536366, 17850, United Kingdom, 22633, 22632
3 536367, 13047, United Kingdom, 84879, 22745, 22748, 22749, 22310, 84969, 22623, 22622, 21754, 21755, 21777, 48187
4 536368, 13047, United Kingdom, 22960, 22913, 22912, 22914
5 536369, 13047, United Kingdom, 21756
6 536370, 12583, France, 22728, 22727, 22726, 21724, 21883, 10002, 21791, 21035, 22326, 22629, 22659, 22631, 22661, 21731, 22900, 21913, 22540, 22~
7 536371, 13748, United Kingdom, 22086
8 536372, 17850, United Kingdom, 22632, 22633
9 536373, 17850, United Kingdom, 85123A, 71053, 84406B, 20679, 37370, 21871, 21071, 21068, 82483, 82486, 82482, 82494L, 84029G, 84029E, 22752, 217~
10 536374, 15100, United Kingdom, 21258
# ... with 25,890 more rows
I have dataframe with filenames and classification, these are predictions from a network, I want to map them into integers to evaluate prediction from a network.
My dataframe is :
Filename: Class:
GHT347 Europe
GHT568 lONDON
GHT78 Europe
HJU US
HJI lONDON
HJK US
KLO Europe
KLU lONDON
KLP lONDON
KLY1 lONDON
KL34 US
The true prediction should be :
GHT-- EUROPE
HJU -- US
KL -- London
I want to map : GHT and Europe to 1, US and HJ to 0, KL and London to 2 by adding an additional two columns Prediction and Actual
Actual Prediction
1 1
1 2
pandas str.startswith method returns true or false, here I want three values. Can anyone guide me?
i cannot fully understand what you want, but I can give you some tips
use regular expressions:
df['actual'] = np.nan
df.loc[(df.Filename.str.contains('^GHT.*')) & (df.Class == 'Europe'), 'Actual'] = 1
df.loc[(df.Filename.str.contains('^HJ.*')) & (df.Class == 'US'), 'Actual'] = 0
and so on
You can set column values to anything you like, based on the values of one or more other columns. This toy example shows one way to do it:
row1list = ['GHT347', 'Europe']
row2list = ['GHT568', 'lONDON']
row3list = ['KLU', 'lONDON']
df = pd.DataFrame([row1list, row2list, row3list],
columns=['Filename', 'Class'])
df['Actual'] = -1 # start with a value you will ignore
df['Prediction'] = -1
df.loc[(df['Filename'].str.startswith('GHT')) & (df['Class'] == 'Europe'), 'Actual'] = 1
df.loc[(df['Filename'].str.startswith('KL')) & (df['Class'] == 'lONDON'), 'Prediction'] = 2
print(df)
# Filename Class Actual Prediction
# 0 GHT347 Europe 1 -1
# 1 GHT568 lONDON -1 -1
# 2 KLU lONDON -1 2
I have data of countries trade with one another. I have split the main file according to months and got 12 csv files for the year 2019. A sample of the data of January csv is provided below:
reporter partner year month trade
0 Albania Argentina 2019 01 515256
1 Albania Australia 2019 01 398336
2 Albania Austria 2019 01 7664503
3 Albania Bahrain 2019 01 400
4 Albania Bangladesh 2019 01 653907
5 Zimbabwe Zambia 2019 01 79569855
I want to make complex network for every month and print the number of nodes of every network. Now I can do it the hard (stupid) way like so.
df01 = pd.read_csv('012019.csv')
df02 = pd.read_csv('022019.csv')
df03 = pd.read_csv('032019.csv')
df1= df01[['reporter','partner', 'trade']]
df2= df02[['reporter','partner', 'trade']]
df3= df03[['reporter','partner', 'trade']]
G1 = nx.Graph()
G1 = nx.from_pandas_edgelist(df1, 'reporter', 'partner', edge_attr='trade')
G1.number_of_nodes()
and so on for the next networks.
My question is how can I use a "for loop" to read the files, convert them to networks from dataframe and report the number of nodes of each node.
I tried this but nothing is reported.
for f in glob.glob('.csv'):
df = pd.read_csv(f)
df1 = df[['reporter','partner', 'trade']]
G = nx.from_pandas_edgelist(df1, 'reporter', 'partner', edge_attr='trade')
G.number_of_nodes()
Thanks.
Edit:
Ok. So I managed to do the above using similar codes like below:
for files in glob.glob('/home/user/VMShared/network/2nd/*.csv'):
df = pd.read_csv(files)
df1=df[['reporter','partner', 'import']]
G = nx.Graph()
G = nx.from_pandas_edgelist(df1, 'reporter', 'partner', edge_attr='import')
nx.write_graphml_lxml(G, "/home/user/VMShared/network/2nd/*.graphml")
The problem that I now face is how to write separate files. All I get from this is one file titled *.graphml. How can I get graphml files for every input file? Also if I can get the same graphml output name as the input file would be a plus.
I have a data frame that contains country column. Unfortunately the country names characters are all capitalized and I need them to be ISO3166_1_Alpha_3
as an example United States of America is going to be U.S.A
United Kingdom is going to be U.K and so on.
Fortunately I found this data frame on the internet that contains 2 important columns the first is the country name and the second is the ISO3166_1_Alpha_3
you can find the data frame on this website
https://datahub.io/JohnSnowLabs/iso-3166-country-codes-itu-dialing-codes-iso-4217-currency-codes
So i wrote this code
data_geo = pd.read_excel("tab0.xlsx")#this is the data frame that contains all the capitalized country name
country_iso = pd.read_csv(r"https://datahub.io/JohnSnowLabs/iso-3166-country-codes-itu-dialing-codes-iso-4217-currency-codes/r/iso-3166-country-codes-itu-dialing-codes-iso-4217-currency-codes-csv.csv",
usecols=['Official_Name_English', 'ISO3166_1_Alpha_3'])
s = pd.Series(data_geo.countery_name_e).str.lower().str.title()#this line make all the names characters small except the first character
y = pd.Series([])
Now i want to make a loop when a
s = Official_Name_English
I want to append
country_iso[Official_Name_English].ISO3166_1_Alpha_3
to the Y series. If country name isn't in this list append NaN
this is 20 rows in s
['Diffrent Countries', 'Germany', 'Other Countries', 'Syria',
'Jordan', 'Yemen', 'Sudan', 'Somalia', 'Australia',
'Other Countries', 'Syria', 'Lebanon', 'Jordan', 'Yemen', 'Qatar',
'Sudan', 'Ethiopia', 'Djibouti', 'Somalia', 'Botswana Land']
Do you know how can i make this?
You could try map:
data_geo = pd.read_excel("tab0.xlsx")
country_iso = pd.read_csv(r"https://datahub.io/JohnSnowLabs/iso-3166-country-codes-itu-dialing-codes-iso-4217-currency-codes/r/iso-3166-country-codes-itu-dialing-codes-iso-4217-currency-codes-csv.csv",
usecols=['Official_Name_English', 'ISO3166_1_Alpha_3'])
s = pd.Series(data_geo.countery_name_e).str.lower().str.title()
mapper = (country_iso.drop_duplicates('Official_Name_English')
.dropna(subset=['Official_Name_English'])
.set_index('Official_Name_English')['ISO3166_1_Alpha_3'])
y = data_geo['countery_name_e'].map(mapper)