How to Group the Borough column by each the 5 boroughs in NYC, and taking average of the total population in each Borough - pandas

I need to create box plot which has the average population of each borough. I have the population of each of the zip codes in each of the 5 boroughs. How can I get to my preferred result? Open the link to see my dataframe.

A simple groupby:
df.groupby('Borough')['Population'].sum()
If you want by Borough and Zip_codes:
df.groupby(['Borough', 'Zip_codes')['Population'].sum()

Related

How to separate entries, and count the occurrences

I'm trying to count which country most celebrities come from. However the csv that I'm working with has multiple countries for a single celeb. e.g. "France, US" for someone with a double nationality.
To count the above, I can use .count() for the entries in the "nationality" column. But, I want to count France, US and any other country separately.
I cannot figure out a way to separate all the entries in column and then, count the occurrences.
I want to be able to reorder my dataframe with these counts, so I want to count this inside the structure
data.groupby(by="nationality").count()
This returns some faulty counts of
"France, US" 1
Assuming this type of data:
data = pd.DataFrame({'nationality': ['France','France, US', 'US', 'France']})
nationality
0 France
1 France, US
2 US
3 France
You need to split and explode, then use value_counts to get the sorted counts per country:
out = (data['nationality']
.str.split(', ')
.explode()
.value_counts()
)
Output:
France 3
US 2
Name: nationality, dtype: int64

Return datapoints over polygons using Geopandas

I am trying to get what is the number of trips by each borough in NYC, I am using NYC Taxi Fare dataset and I have retrieved 1'500,000 datapoints. The problem is that the procedure is very very slow, are there others procedure to calculate these values, calculate number of trips por each borough. Thank you I will appreciate any comment or idea.
Blockquote
count=0
results=[]
for index_boro, row_boro in boroughs_gpd.iterrows():
count=0
print(row_boro.boro_name)
geom_boro = row_boro.geometry
for index_points, row_points in gdf.iterrows():
if (row_points.geometry.within(geom_boro)):
count=count+1
results.append((row_boro.boro_code,count))
break
a = tuple(results)
a

Unable to create new features in Machine learning

I have a dataset. I am using pandas dataframe and named it df.
The dataset has 50,000 rows - here are the first 5:.
Name_Restaurant cuisines_available Average cost
Food Heart Japnese, chinese 60$
Spice n Hungary Indian, American, mexican 42$
kfc, Lukestreet Thai, Japnese 29$
Brown bread shop American 11$
kfc, Hypert mall Thai, Japnese 40$
I want to create column which contains the no. of cuisines available
I am trying code
df['no._of_cuisines_available']=df['cuisines_available'].str.len()
Then instead of showing the no. of cuisines, it is showing the sum of charecters.
For example - for first row the o/p should be 2 , but its showing 17.
I need a new column that contain number of stores for each restaurant. example -
here kfc has 2 stores kfc, lukestreet and kfc, hypert mall. I have completely
no idea how to code this.
i)
df['cuisines_available'].str.split(',').apply(len)
ii)
df['Name_Restaurant'].str.split(',', expand=True).melt().['value'].str.strip().value_counts()
What ii) does: split columns at ',' and store all strings thus generated in an individual column. Then use melt to make one big column, strip away spaces etc. and count individual entries.

Loops in Dataframe

I have 4 columns: Country, Year, GDP Annual Growth and Field Size in MM Barrels.
I am looking for a way to create a loop function that generates the mean GDP growth values over the 5 years following the discovery of a field ("Field Size MM Barrels"). Example: In 1961 a discovery was made in Algeria and its size is 2462. What is the average GDP annual growth value over the next following 5 years (1962-1967)?.
NaN refers to years where no discoveries were made in this case. I would like the loop to add the mean value each time in a column next to Field Size. Any idea how to do that?
Country,Year,GDP Annual Growth,Field_Size_MM_Barrels
Algeria,1961,-13.605441,2462.0
Algeria,1962,-19.685042,2413.0
Algeria,1963,34.313729,NaN
Algeria,1964,5.839413,NaN
Algeria,1965,6.206898,500.0
Yemen,2016,-13.621458,NaN
Yemen,2017,-5.942320,NaN
Yemen,2018,-2.701475,NaN
Divided Neutral Zone: Kuwait/Saudi Arabia,1963,NaN,832.0
Divided Neutral Zone: Kuwait/Saudi Arabia,1967,NaN,1566.0
# read in with
df = pd.read_clipboard(sep=',')
If you could include a sample of the dataframe (say first 20 rows) then it will help answer/test answers. Here's a possible starting point:
# create a list for average GDP values
average = []
# go over all rows in df.values
for row_id in range(1, len(self.df.values)):
test = self.df.iloc[row_id]["Field Size MM Barrels"]
if (test == 'NaN'):
row_list = []
# create a row list to average over:
for i in range(1+row_id,6+row_id):
row_list.append(i)
average = df[["GDP"]].iloc[row_list].mean(axis=0)

vba loop through all the pivot fields of a pivot table and return specified values

I have a dataset whose entries has 5 different attributes and one value. For example, I have a height of 5000 people. For each person I have his hair color, eye color, his nationality, the city he were born and the name of his mother (the 5 dimensions).
No/Eye Color/Hair Color/Nationality/Hometown/Mother's Name/Height
Blue Blond Swiss Zürich Nicole 184
Blue Brown English York Ruby 164
Brown Brown French Paris Sophie 154
etc..
So there are 5 dimensions. The data is set dynamically, so the number of categories in each dimensions can vary. I sought to compute the average height of people depending on whether I want to include some dimensions or not (from 1 to 5). For example I wanted the retrieve:
The average height of French and Blue eyed people. Next day only the people born in London. And the week after, the Swiss, blue-eyed, red-haired, born in Geneva and whose mother is called Nicole.
So I create a pivot table with the Eye Color as Row labels, Hair Color as Column labels, the average height as the Data and the last 3 dimensions as Market Filters. This allowed me see all the possible and desired combinations of average height that my data implies.
Now my goal is:
I want to create a Macro that goes through all the possible combinations that my dimensions entails (i.e 2^5-1=31) and store in a vector all the combination of height average that are above a certain value, e.g. 190. And then It could print on a worksheet.
I was thinking on using some booleans arrays vector and For-Each-Next structure, but I must say that I fail to picture how to implement it.
Any ideas?
Thanks for the time and help!