Applying a function to list of columns of a dataframe? - pandas

I scraped this table from this URL:
"https://www.patriotsoftware.com/blog/accounting/average-cost-living-by-state/"
Which looks like this:
State Annual Mean Wage (All Occupations) Median Monthly Rent Value of a Dollar
0 Alabama $44,930 $998 $1.15
1 Alaska $59,290 $1,748 $0.95
2 Arizona $50,930 $1,356 $1.04
3 Arkansas $42,690 $953 $1.15
4 California $61,290 $2,518 $0.87
And then I wrote this function to help me turn the strings into ints:
def money_string_to_int(s):
return int(s.replace(",", "").replace("$",""))
money_string_to_int("$1,23")
My function works when I apply it to only one column. I found this answer here about using on multiple columns: How to apply a function to multiple columns in Pandas
But my code below does not work and produces no errors:
ls = ['Annual Mean Wage (All Occupations)', 'Median Monthly Rent',
'Value of a Dollar']
ppe_table[ls] = ppe_table[ls].apply(money_string_to_int)

Lets try
df.set_index('State').apply(lambda x: (x.str.replace('[$,]','').astype(float))).reset_index()

Related

Python - compare multiple columns, search list of keywords in column and compare with another, in two dataframes to generate a third resultset

I have two very different dataframes.
df1 looks like this:
Region
Entity
Desk
Function
Key
US
AAA
Top class, Desk1, Mike's team
Writing, advising
Unique_1
US
AAA
team beta, Blue rats, Tom
task_a, task_d
Unique_2
EMEA
ZZZ
Delta one
Forecasts, month-end, Sales
Unique_3
JPN
XYZ
Admin
task1, task_b, task_g
Unique_4
df2 looks like this:
Region
Entity
Desk
Function
ID
EMEA
ZZZ
Equity, delta one
Sales, sweet talking, schmoozing
A_01
US
AAA
Desk 1, A team, Top class
Writing,calling,listening, advising
A_02
US
AAA
Desk 2, Ninjas, 2nd team, Tom's team
Secret, private
A_03
EMEA
DDD
Equity, Private Equity
task1, task2, task3, task4
A_04
JPN
XXX
Admin, Secretaries
task_a, task_b
A_05
df2 is a much larger recordset than df1.
Both Desk and Function in each of the dataframes were free-text fields and allowed the input of rubbish data. I am trying to build a new recordset from these dataframes using the following criteria:
where -
df1['Region'] == df2['Region']
AND
df1['Entity'] == df2['Entity']
AND
any of the phrases within df1['Desk'] can be matched to any of the phrases within df2['Desk']
AND
any of the phrases within df1['Function'] can be matched to any of the phrases within df2['Function'].
I need the ultimate output to look something like this:
df2.Id
df1.Key
MATCH
A_02
Unique_1
Exact
Unique_2
No match
A_01
Unique_3
Exact
Unique_4
No match
I am really struggling with this. I have both dataframes but I cannot loop through df1 to match the columns as specified above in df2. I've tried merging the dataframes, using np.where and brute force looping but nothing is working. The tricky bit is matching the Desk and Function columns.
Any ideas?
IIUC, one option is to use a cross merge :
def cross_match(df1, df2, col):
df = df1.merge(df2, how="cross")
colx, coly = [f"{col}_x", f"{col}_y"]
df[[colx, coly]] = df[[colx, coly]].apply(lambda x: x.str.lower()
.str.split("\s*,\s*"))
df["MATCH"] = (pd.Series([any(w in sent for w in lst)
for lst, sent in zip(df[f"{col}_x"], df[f"{col}_y"])])
.map({True:"Exact"}))
return df.query("MATCH == 'Exact'")
desk, func = cross_match(df1, df2, "Desk"), cross_match(df1, df2, "Function")
out = (
pd.merge(desk, func,
left_on=["Region_x", "Entity_x", "ID"],
right_on=["Region_y", "Entity_y", "ID"],
suffixes=("", "_")).set_index("Key")
.reindex(df1["Key"].unique())
.fillna({"MATCH": "No match"})
.reset_index()[["ID", "Key", "MATCH"]]
)
Disclaimer : This approach may get incredibly slow when huge datasets (df1, df2).
Output :
print(out)
ID Key MATCH
0 A_02 Unique_1 Exact
1 NaN Unique_2 No match
2 A_01 Unique_3 Exact
3 NaN Unique_4 No match

How to separate entries, and count the occurrences

I'm trying to count which country most celebrities come from. However the csv that I'm working with has multiple countries for a single celeb. e.g. "France, US" for someone with a double nationality.
To count the above, I can use .count() for the entries in the "nationality" column. But, I want to count France, US and any other country separately.
I cannot figure out a way to separate all the entries in column and then, count the occurrences.
I want to be able to reorder my dataframe with these counts, so I want to count this inside the structure
data.groupby(by="nationality").count()
This returns some faulty counts of
"France, US" 1
Assuming this type of data:
data = pd.DataFrame({'nationality': ['France','France, US', 'US', 'France']})
nationality
0 France
1 France, US
2 US
3 France
You need to split and explode, then use value_counts to get the sorted counts per country:
out = (data['nationality']
.str.split(', ')
.explode()
.value_counts()
)
Output:
France 3
US 2
Name: nationality, dtype: int64

Divide rows in two columns with Pandas

I am using Pandas.
For each row, regardless of the County, I would like to divide "AcresBurned" by "CrewsInvolved".
For each County, I would like to sum the total AcresBurned for that County and divide by the sum of the total CrewsInvolved for that County.
I just started coding and am not able to solve this. Please help. Thank you so much.
Counties AcresBurned CrewsInvolved
1 400 2
2 500 3
3 600 5
1 800 9
2 850 8
This is very simple with Pandas. You could create a new col with these operations.
df['Acer_per_Crew'] = df['AcersBurned'] / df['CrewsaInvolved']
You could use a groupby clause for viewing the sum of AcersBurned for a county.
df_gb = df.groupby(['counties']) ['AcersBurned', 'CrewsInvolved'].sum().reset_index()
df_gb.columns = ['counties', 'AcersBurnedPerCounty', 'CrewsInvolvedPerCounty']
df = df.merge(df_gb, on = 'counties')
Once you've done this, you could create a new column with a similar arithmetic operation to divide AcersBurnedPerCounty by CrewsInvolvedPerCounty.

Loops in Dataframe

I have 4 columns: Country, Year, GDP Annual Growth and Field Size in MM Barrels.
I am looking for a way to create a loop function that generates the mean GDP growth values over the 5 years following the discovery of a field ("Field Size MM Barrels"). Example: In 1961 a discovery was made in Algeria and its size is 2462. What is the average GDP annual growth value over the next following 5 years (1962-1967)?.
NaN refers to years where no discoveries were made in this case. I would like the loop to add the mean value each time in a column next to Field Size. Any idea how to do that?
Country,Year,GDP Annual Growth,Field_Size_MM_Barrels
Algeria,1961,-13.605441,2462.0
Algeria,1962,-19.685042,2413.0
Algeria,1963,34.313729,NaN
Algeria,1964,5.839413,NaN
Algeria,1965,6.206898,500.0
Yemen,2016,-13.621458,NaN
Yemen,2017,-5.942320,NaN
Yemen,2018,-2.701475,NaN
Divided Neutral Zone: Kuwait/Saudi Arabia,1963,NaN,832.0
Divided Neutral Zone: Kuwait/Saudi Arabia,1967,NaN,1566.0
# read in with
df = pd.read_clipboard(sep=',')
If you could include a sample of the dataframe (say first 20 rows) then it will help answer/test answers. Here's a possible starting point:
# create a list for average GDP values
average = []
# go over all rows in df.values
for row_id in range(1, len(self.df.values)):
test = self.df.iloc[row_id]["Field Size MM Barrels"]
if (test == 'NaN'):
row_list = []
# create a row list to average over:
for i in range(1+row_id,6+row_id):
row_list.append(i)
average = df[["GDP"]].iloc[row_list].mean(axis=0)

How to access value_counts() data in pandas?

When I do a count of values in a Panda, who do I access a column name?
Consider the US Census dataset. I can count the number of counties in each state with:
df2["STNAME"].value_counts()
Which returns a series which looks like this:
Alabama 24
Alaska 23
Arizona 1
etc ...
Name: STNAME, dtype: int64
How do I access the State name (the STNAME, which actually I'm not sure is the index, since in SQL terms this is, I think, just a view on the data).