How to access value_counts() data in pandas? - pandas

When I do a count of values in a Panda, who do I access a column name?
Consider the US Census dataset. I can count the number of counties in each state with:
df2["STNAME"].value_counts()
Which returns a series which looks like this:
Alabama 24
Alaska 23
Arizona 1
etc ...
Name: STNAME, dtype: int64
How do I access the State name (the STNAME, which actually I'm not sure is the index, since in SQL terms this is, I think, just a view on the data).

Related

How to separate entries, and count the occurrences

I'm trying to count which country most celebrities come from. However the csv that I'm working with has multiple countries for a single celeb. e.g. "France, US" for someone with a double nationality.
To count the above, I can use .count() for the entries in the "nationality" column. But, I want to count France, US and any other country separately.
I cannot figure out a way to separate all the entries in column and then, count the occurrences.
I want to be able to reorder my dataframe with these counts, so I want to count this inside the structure
data.groupby(by="nationality").count()
This returns some faulty counts of
"France, US" 1
Assuming this type of data:
data = pd.DataFrame({'nationality': ['France','France, US', 'US', 'France']})
nationality
0 France
1 France, US
2 US
3 France
You need to split and explode, then use value_counts to get the sorted counts per country:
out = (data['nationality']
.str.split(', ')
.explode()
.value_counts()
)
Output:
France 3
US 2
Name: nationality, dtype: int64

Drop duplicates based on 2 columns if the value in another column is null - Pandas

If I have a dataframe
Index City Country State
0 Chicago US IL
1 Sacramento US CA
2 Sacramento US
3 Naperville US IL
I want to find rows with duplicate values for 'City' and 'Country' but only drop the row with no entry for 'State.
Ie. drop row#2
What is the best way to approach this using Pandas?
Use a boolean mask to get the index of rows to delete then use drop to remove this rows with inplace=True as argument:
df.drop(df.loc[(df.duplicated(['City','Country'])
& df['State'].notna())].index, inplace=True)
print(df)
# Output:
City Country State
0 Chicago US IL
1 Sacramento US CA
3 Naperville US IL
Note: the answer of #QuangHoang is the opposite of this one. I drop the bad rows, he keeps the right rows. Honestly, I prefer his method.

Applying a function to list of columns of a dataframe?

I scraped this table from this URL:
"https://www.patriotsoftware.com/blog/accounting/average-cost-living-by-state/"
Which looks like this:
State Annual Mean Wage (All Occupations) Median Monthly Rent Value of a Dollar
0 Alabama $44,930 $998 $1.15
1 Alaska $59,290 $1,748 $0.95
2 Arizona $50,930 $1,356 $1.04
3 Arkansas $42,690 $953 $1.15
4 California $61,290 $2,518 $0.87
And then I wrote this function to help me turn the strings into ints:
def money_string_to_int(s):
return int(s.replace(",", "").replace("$",""))
money_string_to_int("$1,23")
My function works when I apply it to only one column. I found this answer here about using on multiple columns: How to apply a function to multiple columns in Pandas
But my code below does not work and produces no errors:
ls = ['Annual Mean Wage (All Occupations)', 'Median Monthly Rent',
'Value of a Dollar']
ppe_table[ls] = ppe_table[ls].apply(money_string_to_int)
Lets try
df.set_index('State').apply(lambda x: (x.str.replace('[$,]','').astype(float))).reset_index()

Unable to create new features in Machine learning

I have a dataset. I am using pandas dataframe and named it df.
The dataset has 50,000 rows - here are the first 5:.
Name_Restaurant cuisines_available Average cost
Food Heart Japnese, chinese 60$
Spice n Hungary Indian, American, mexican 42$
kfc, Lukestreet Thai, Japnese 29$
Brown bread shop American 11$
kfc, Hypert mall Thai, Japnese 40$
I want to create column which contains the no. of cuisines available
I am trying code
df['no._of_cuisines_available']=df['cuisines_available'].str.len()
Then instead of showing the no. of cuisines, it is showing the sum of charecters.
For example - for first row the o/p should be 2 , but its showing 17.
I need a new column that contain number of stores for each restaurant. example -
here kfc has 2 stores kfc, lukestreet and kfc, hypert mall. I have completely
no idea how to code this.
i)
df['cuisines_available'].str.split(',').apply(len)
ii)
df['Name_Restaurant'].str.split(',', expand=True).melt().['value'].str.strip().value_counts()
What ii) does: split columns at ',' and store all strings thus generated in an individual column. Then use melt to make one big column, strip away spaces etc. and count individual entries.

Replacing NaN values with group mean

I have s dataframe made of countries, years and many other features. there are many years for a single country
country year population..... etc.
1 2000 5000
1 2001 NaN
1 2002 4800
2 2000
now there are many NaN in the dataframe.
I want to replace each NaN corresponding to a specific country in every column with the country average of this column.
so for example for the NaN in the population column corresponding to country 1, year 2001, I want to use the average population for country 1 for all the years = (5000+4800)/2.
now I am using the groupby().mean() method to find the means for each country, but I am running into the following difficulties:
1- some means are coming as NaN when I know for sure there is a value for it. why is so?
2- how can I get access to specific values in the groupby clause? in other words, how can I replace every NaN with its correct average?
Thanks a lot.
Using combine_first with groupby mean
df.combine_first(df.groupby('country').transform('mean'))
Or
df.fillna(df.groupby('country').transform('mean'))