how to do this operation in pandas - pandas

I have a data frame that contains country column. Unfortunately the country names characters are all capitalized and I need them to be ISO3166_1_Alpha_3
as an example United States of America is going to be U.S.A
United Kingdom is going to be U.K and so on.
Fortunately I found this data frame on the internet that contains 2 important columns the first is the country name and the second is the ISO3166_1_Alpha_3
you can find the data frame on this website
https://datahub.io/JohnSnowLabs/iso-3166-country-codes-itu-dialing-codes-iso-4217-currency-codes
So i wrote this code
data_geo = pd.read_excel("tab0.xlsx")#this is the data frame that contains all the capitalized country name
country_iso = pd.read_csv(r"https://datahub.io/JohnSnowLabs/iso-3166-country-codes-itu-dialing-codes-iso-4217-currency-codes/r/iso-3166-country-codes-itu-dialing-codes-iso-4217-currency-codes-csv.csv",
usecols=['Official_Name_English', 'ISO3166_1_Alpha_3'])
s = pd.Series(data_geo.countery_name_e).str.lower().str.title()#this line make all the names characters small except the first character
y = pd.Series([])
Now i want to make a loop when a
s = Official_Name_English
I want to append
country_iso[Official_Name_English].ISO3166_1_Alpha_3
to the Y series. If country name isn't in this list append NaN
this is 20 rows in s
['Diffrent Countries', 'Germany', 'Other Countries', 'Syria',
'Jordan', 'Yemen', 'Sudan', 'Somalia', 'Australia',
'Other Countries', 'Syria', 'Lebanon', 'Jordan', 'Yemen', 'Qatar',
'Sudan', 'Ethiopia', 'Djibouti', 'Somalia', 'Botswana Land']
Do you know how can i make this?

You could try map:
data_geo = pd.read_excel("tab0.xlsx")
country_iso = pd.read_csv(r"https://datahub.io/JohnSnowLabs/iso-3166-country-codes-itu-dialing-codes-iso-4217-currency-codes/r/iso-3166-country-codes-itu-dialing-codes-iso-4217-currency-codes-csv.csv",
usecols=['Official_Name_English', 'ISO3166_1_Alpha_3'])
s = pd.Series(data_geo.countery_name_e).str.lower().str.title()
mapper = (country_iso.drop_duplicates('Official_Name_English')
.dropna(subset=['Official_Name_English'])
.set_index('Official_Name_English')['ISO3166_1_Alpha_3'])
y = data_geo['countery_name_e'].map(mapper)

Related

Compress Large Data in R into csv without NULLS or LIST

FIRST TIME POSTING:
I'm preparing data for arules() read.transactions and need to compress unique Invoice data (500k+ cases) so that each unique Invoice and its associated info fits on a single line like this:
Invoice001,CustomerID,Country,StockCodeXYZ,StockCode123
Invoice002...etc
However, the data reads in repeating the Invoice for each StockCode like this:
Invoice001,CustomerID,Country,StockCodeXYZ
Invoice001,CustomerID,Country,StockCode123
Invoice002....etc
I've been trying pivot_wider() and then unite(), but it generates 285M+ MOSTLY NULL cells into a LIST which I'm having a hard time resolving and unable to write to csv or read into arules. I've also tried keep(~!is.null(.)), discard(is.null), compact() without success and am open to any method to achieve the desired outcome above.
However, I feel like I should be able to solve it using the built-in arules() read.transactions() fx, but am getting various errors as I try different things there too.
The data is opensource from University of California, Irvin and found here: https://archive.ics.uci.edu/ml/machine-learning-databases/00352/Online%20Retail.xlsx
Any help would be greatly appreciated.
library(readxl)
url <- "https://archive.ics.uci.edu/ml/machine-learning-databases/00352/Online%20Retail.xlsx"
destfile <- "Online_20Retail.xlsx"
curl::curl_download(url, destfile)
Online_20Retail <- read_excel(destfile)
trans <- read.transactions(????????????)
this one invoice "573585" hast over 1.000 itens so it will generate the acording number of columns if you only get the stock number from the invoice items... still we have a bit over 1.000 columns.
library(dplyr)
Online_20Retail %>%
dplyr::transmute(new = paste0(InvoiceNo, ", ",
CustomerID, ", ",
Country, ", "),
StockCode) %>%
dplyr::group_by(new) %>%
dplyr::summarise(output = paste(StockCode, collapse = ", ")) %>%
dplyr::transmute(mystring = paste0(new, output))
# you might want to put "%>% dplyr::pull(mystring)" at the ending of the line above to get a vector not tibble/dataframe
# A tibble: 25,900 x 1
mystring
<chr>
1 536365, 17850, United Kingdom, 85123A, 71053, 84406B, 84029G, 84029E, 22752, 21730
2 536366, 17850, United Kingdom, 22633, 22632
3 536367, 13047, United Kingdom, 84879, 22745, 22748, 22749, 22310, 84969, 22623, 22622, 21754, 21755, 21777, 48187
4 536368, 13047, United Kingdom, 22960, 22913, 22912, 22914
5 536369, 13047, United Kingdom, 21756
6 536370, 12583, France, 22728, 22727, 22726, 21724, 21883, 10002, 21791, 21035, 22326, 22629, 22659, 22631, 22661, 21731, 22900, 21913, 22540, 22~
7 536371, 13748, United Kingdom, 22086
8 536372, 17850, United Kingdom, 22632, 22633
9 536373, 17850, United Kingdom, 85123A, 71053, 84406B, 20679, 37370, 21871, 21071, 21068, 82483, 82486, 82482, 82494L, 84029G, 84029E, 22752, 217~
10 536374, 15100, United Kingdom, 21258
# ... with 25,890 more rows

Compare two Dataframes based on column value(String, Substring) and update another column value

Dataframes df1,df2, where df1 Name column has a partial matching string on df2 Name column value. On the partial match of name column values, then compare the price column value of both data frames and if it is the same price then update column(Flag) in df1 as 'Delete'
df1
Name
Price
Flag
VENTILLA HOME FARR
662324.21
Delete
VENTILLA HOME FARR
-277961.62
VENTILLA HOME FARR
776011.5
VARAMANT METRO PLANET
662324.21
VARAMANT METRO PLANET
55555.21
Delete
VARAMANT METRO PLANET
267117.5499
FANTHOM STREET LLB
83265.2799
FANTHOM STREET LLB
-444452.96
Delete
FANTHOM STREET LLB
267117.5499
df2
my_dict = {'VT METRO PLANET ': 267117.5499, 'VENTILLA HOME FA ': -277961.62, 'FANTHOM STREET ': 83265.2799}
df2 = pd.DataFrame(list(my_dict.items()),columns = ['Name','Price'])
Expected Output
Any help would be appreciated
the solution I share here for this problem is based on the set, so if the Name of dataframe 1 is at least sharing one word with the Name of dataframe 2, and also their Price is equal then we edit the Flag column in the dataframe 1 by "Delete" otherwise we made it as "None"
This The Code Source :
def check(row):
df1_Name = set(map(lambda word: word.lower(),row.Name.split(' ')))
df1_price = row.Price
df1_flag = row.Flag
for df2_Name, df2_Price in df2[['Name', 'Price']].values:
df2_Name = set(map(lambda word: word.lower(),df2_Name.split(' ')))
if len(df1_Name.intersection(df2_Name)) > 1 and df1_price == df2_Price:
return 'Delete'
return ''
df1["Flag"]= df1.apply(checkMatch,axis=1)

How do i remove a a huge number of columns in a dataframe based on their number?

I have a dataframe which has 60 columns. The names of the columns are years and are named 1960.0, 1961.0......2010.0. I want to remove the columns from 1960 to 2006. This is what i have tried so far:
a = list(map(str,map(float,range(1960,2006))))
gdp = gdp.drop(a,axis=1)
gdp
When i run the code, it shows this:
KeyError: "['1960.0' '1961.0' '1962.0' '1963.0' '1964.0' '1965.0' '1966.0' '1967.0'\n '1968.0' '1969.0' '1970.0' '1971.0' '1972.0' '1973.0' '1974.0' '1975.0'\n '1976.0' '1977.0' '1978.0' '1979.0' '1980.0' '1981.0' '1982.0' '1983.0'\n '1984.0' '1985.0' '1986.0' '1987.0' '1988.0' '1989.0' '1990.0' '1991.0'\n '1992.0' '1993.0' '1994.0' '1995.0' '1996.0' '1997.0' '1998.0' '1999.0'\n '2000.0' '2001.0' '2002.0' '2003.0' '2004.0' '2005.0'] not found in axis"
I think that the \n are interfering here but i dont know how to make it work. Any help?
The heading of the columns are 1960.0, 1961.0,....2010.0. But it still doesnt work.
I think you need to remove floats, not strings, so removed converted to strings:
a = list(map(float,range(1960,2006)))
#or
#a = list(range(1960,2006))
gdp = gdp.drop(a,axis=1)
Or:
gdp = gdp[:, ~gdp.columns.isin(a)]

How to randomly put number, strings in excel using pandas

I have 3000 rows of data in excel
id,product,store,revenue,data,state
1,Ball,,222,nil,,
1,Pen,,234,nil,nil,,
2,Books,,543,nil,,
2,Ink,,123,nil,,
I need to fill the 3rd store column with random number between 1 & 5
My code is giving every time 1, df['store'] = df['store'].fillna(random.randint(1,5))
I need to fill the 5th state column with random string {'CA', 'WD','CH', 'AL'}
I need to create a 6th which is country column if 'CA', 'CH' in 5th column map to USA and 'WD', 'AL' map to Japan
{'CA':'USA', 'CH':'USA', 'AL':'Japan'}
Let us try
n=len(df)
num=np.random.randint(1,6,size=n)
l={'CA', 'WD','CH', 'AL'}
state=np.random.choice(list(l), n)
df['store'] = df['store'].fillna(pd.Series(num,index=df.index))
df['state'] = df['state'].fillna(pd.Series(state,index=df.index))
df['country']=df.state.map({'CA':'USA', 'CH':'USA', 'AL':'Japan'})

SSRS Switch statement in Expression is not working (color code a polygon in a chart)

I have a report that breaks down financials by state. Here is what it looks like:
That is the tablix version of the data. I also have a chart as a map where I want to display the data visually.
The actual data is broken up like this:
NM City 100
NJ City1 100
NJ City2 100
NJ City3 100
NY City 100
NY City2 100
In SSRS, each state is a polygon.
I want to set the fill color of that polygon to be a color based on the Total Value of that state.
The best way to do this would be to just set the color value equal to my formula against the total value. Then I would use that same line of code for every polygon and it would color code accordingly.
However, I do not think the polygons know which state they belong to. For example, is there any way to get the New York Polygon to only look at the NY state value?
In case there isn't, I'm trying to so a switch statement where for every polygon I'll have it only get the value where the state name equals whatever I manually input.
=SWITCH
(Max(Fields!State.Value, "CustomersByState") = "NE" , "10000"
Max(Fields!State.Value, "CustomersByState") = "NY" , "20000"
1=1,"Coral")
When I have that line be as the expression for the label name of that polygon (for testing, if I can make this work I can make anything work) it gives me an error and says comma, ')', or a valid expression continuation expected.
I believe you need a comma after "10000" and "20000"