Return datapoints over polygons using Geopandas - data-science

I am trying to get what is the number of trips by each borough in NYC, I am using NYC Taxi Fare dataset and I have retrieved 1'500,000 datapoints. The problem is that the procedure is very very slow, are there others procedure to calculate these values, calculate number of trips por each borough. Thank you I will appreciate any comment or idea.
Blockquote
count=0
results=[]
for index_boro, row_boro in boroughs_gpd.iterrows():
count=0
print(row_boro.boro_name)
geom_boro = row_boro.geometry
for index_points, row_points in gdf.iterrows():
if (row_points.geometry.within(geom_boro)):
count=count+1
results.append((row_boro.boro_code,count))
break
a = tuple(results)
a

Related

How to Group the Borough column by each the 5 boroughs in NYC, and taking average of the total population in each Borough

I need to create box plot which has the average population of each borough. I have the population of each of the zip codes in each of the 5 boroughs. How can I get to my preferred result? Open the link to see my dataframe.
A simple groupby:
df.groupby('Borough')['Population'].sum()
If you want by Borough and Zip_codes:
df.groupby(['Borough', 'Zip_codes')['Population'].sum()

Trying to find the highest improvement in score for a given city

I tried grouping it by city, date and score. Then sorting by city and score and finally applying linear regression to calculate the highest slopes.
Code below
from scipy.stats import linregress
result = data.groupby(['facility_city', 'score', 'activity_date'], as_index=False).sum()
result = result.sort_values(by=['facility_city', 'activity_date'], ascending=True)
print(result)
result = result.apply(lambda v: linregress(v['activity_date'], v['score'])[0])
print(result)
An example of the data is
"activity_date","employee_id","facility_address","facility_city","facility_id","facility_name","facility_state","facility_zip","grade","owner_id","owner_name","pe_description","program_element_pe","program_name","program_status","record_id","score","serial_number","service_code","service_description"
2017-05-09,"EE0000593","17660 CHATSWORTH ST","GRANADA HILLS","FA0175397","HOVIK'S FAMOUS MEAT & DELI","CA","91344","A","OW0181955","JOHN'S FAMOUS MEAT & DELI INC.","FOOD MKT RETAIL (25-1,999 SF) HIGH RISK",1612,"HOVIK'S FAMOUS MEAT & DELI","ACTIVE","PR0168541",98,"DAHDRUQZO",1,"ROUTINE INSPECTION"
2017-04-10,"EE0000126","3615 PACIFIC COAST HWY","TORRANCE","FA0242138","SHAKEY'S PIZZA","CA","90505","A","OW0237843","SCO, LLC","RESTAURANT (61-150) SEATS HIGH RISK",1638,"SHAKEY'S PIZZA","ACTIVE","PR0190290",94,"DAL3SBUE0",1,"ROUTINE INSPECTION"
2017-04-04,"EE0000593","17515 CHATSWORTH ST","GRANADA HILLS","FA0007801","BAITH AL HALAL","CA","91344","A","OW0031150","SABIR MOHAMMAD SHAHID","FOOD MKT RETAIL (25-1,999 SF) HIGH RISK",1612,"BAITH AL HALAL","INACTIVE","PR0036723",95,"DAL2PIKJU",1,"ROUTINE INSPECTION"
2017-08-15,"EE0000971","44455 VALLEY CENTRAL WAY","LANCASTER","FA0013858","FOOD 4 LESS #306","CA","93536","A","OW0012108","FOOD 4 LESS, INC.","RESTAURANT (0-30) SEATS HIGH RISK",1632,"FOOD 4 LESS DELI/BAKERY#306","ACTIVE","PR0039905",98,"DA0ZMAJXZ",1,"ROUTINE INSPECTION"
2016-09-26,"EE0000145","11700 SOUTH ST","ARTESIA","FA0179671","PHO LITTLE SAIGON","CA","90701","A","OW0185167","PHO SOUTH ST INC","RESTAURANT (61-150) SEATS HIGH RISK",1638,"PHO LITTLE SAIGON","ACTIVE","PR0173311",96,"DA41DBXA2",1,"ROUTINE INSPECTION"
2016-05-11,"EE0000720","1309 S HOOVER ST","LOS ANGELES","FA0179745","HAPPY TACOS TO GO","CA","90006-4903","A","OW0185239","MAT L. MORA","RESTAURANT (0-30) SEATS HIGH RISK",1632,"HAPPY TACOS TO GO","INACTIVE","PR0173403",96,"DAURQTTVR",1,"ROUTINE INSPECTION"
2017-02-28,"EE0000741","4959 PATATA ST","CUDAHY","FA0012590","EL POTRERO CLUB","CA","90201","B","OW0036634","TSAY, SHYR JIN","RESTAURANT (151 + ) SEATS HIGH RISK",1641,"EL POTRERO CLUB","ACTIVE","PR0041708",87,"DAUNXDSVP",1,"ROUTINE INSPECTION"
I am getting an error KeyError: 'activity_date' when I try to apply the linear regression. Any tips would be welcome
You need to groupby again to use apply like you did. Also, linear regression does not work with date directly, you'll need to convert them to numerical first (see this post).
Change your last result assignment to:
result = (result.groupby('facility_city')
.apply(lambda x: linregress(x.activity_date.map(dt.datetime.toordinal), x.score)[0])
)
Your full code should look like:
import pandas as pd
import datetime as dt
from scipy.stats import linregress
# To make sure activity_date is a Timestamp:
data['activity_date'] = pd.to_datetime(data.activity_date)
# Sort cities and dates
result = data.sort_values(['facility_city', 'activity_date'])
# Calculate linear regressions and retrieve their slopes
result = (result.groupby('facility_city')
.apply(lambda x: linregress(x.activity_date.map(dt.datetime.toordinal), x.score)[0])
)
# Show higheest slopes first
result.sort_values(ascending=False)

How to check the highest score among specific columns and compute the average in pandas?

Help with homework problem: "Let us define the "data science experience" of a given person as the person's largest score among Regression, Classification, and Clustering. Compute the average data science experience among all MSIS students."
Beginner to coding. I am trying to figure out how to check amongst columns and compare those columns to each other for the largest value. And then take the average of those found values.
I greatly appreciate your help in advance!
Picture of the sample data set: 1: https://i.stack.imgur.com/9OSjz.png
Provided Code:
import pandas as pd
df = pd.read_csv("cleaned_survey.csv", index_col=0)
df.drop(['ProgSkills','Languages','Expert'],axis=1,inplace=True)
Sample Data:
What I have tried so far:
df[data_science_experience]=df[["Regression","Classification","Clustering"]].values.max()
df['z']=df[['Regression','Classification','Clustering']].apply(np.max,axis=1)
df[data_science_experience]=df[["Regression","Classification","Clustering"]].apply(np.max,axis=1)
If you want to get the highest score of column 'hw1' you can get it with:
pd['hw1'].max(). this gives you a series of all the values in that column and max returns the maximum. for average use mean:
pd['hw1'].mean()
if you want to find the maximum of multiple columns, you can use:
maximum_list = list()
for col in pd.columns:
maximum_list.append(pd[col].max)
max = maximum_list.max()
avg = maximum_list.mean()
hope this helps.
First, you want to get only the rows with MSIS in the Program column. That can be done in the following way:
df[df['Program'] == 'MSIS']
Next, you want to get only the Regression, Classification and Clustering columns. The previous query filtered only rows; we can add to that, like this:
df.loc[df['Program'] == 'MSIS', ['Regression', 'Classification', 'Clustering']]
Now, for each row remaining, we want to take the maximum. That can be done by appending .max(axis=1) to the previous line (axis=1 because we want the maximum of each row, not each column).
At this point, we should have a DataFrame where each row represents the highest score of the three categories for each student. Now, all that's left to do is take the mean, which can be done with .mean(). The full code should therefore look like this:
df.loc[df['Program'] == 'MSIS', ['Regression', 'Classification', 'Clustering']].max(axis=1).mean()

Creating similar samples based on three different categorical variables

I am trying to do an analysis where I am trying to create two similar samples based on three different attributes. I want to create these samples first and then do the analysis to see which out of those two samples is better. The categorical variables are sales_group, age_group, and country. So I want to make both samples such as the proportion of countries, age, and sales is similar in both samples.
For example: Sample A and B have following variables in it:
Id Country Age Sales
The proportion of Country in Sample A is:
USA- 58%
UK- 22%
India-8%
France- 6%
Germany- 6%
The proportion of country in Sample B is:
India- 42%
UK- 36%
USA-12%
France-3%
Germany- 5%
The same goes for other categorical variables: age_group, and sales_group
Thanks in advance for help
You do not need to establish special procedure for sampling as one-sample proportion is unbiased estimate of population proportion. In case you have, suppose, >1000 observations and you are sampling more than, let us say, 30 samples the estimate would be quite exact (Central Limit Theorem).
You can see it in the simulation below:
set.seed(123)
n <- 10000 # Amount of rows in the source data frame
df <- data.frame(sales_group = sample(LETTERS[1:4], n, replace = TRUE),
age_group = sample(c("old", "young"), n, replace = TRUE),
country = sample(c("USA", "UK", "India", "France", "Germany"), n, replace = TRUE),
amount = abs(100 * rnorm(n)))
s <- 100 # Amount of sampled rows
sampleA <- df[sample(nrow(df), s), ]
sampleB <- df[sample(nrow(df), s), ]
table(sampleA$sales_group)
# A B C D
# 23 22 32 23
table(sampleB$sales_group)
# A B C D
# 25 22 28 25
DISCLAIMER: However if you have some very small or very big proportion and have too little samples you will need to use some advanced procedures like Laplace smoothing

error in dividing 2 pandas series with decimal values (daily stock price)

I am trying to divide 2 pandas columns (same column divided by shifting one cell) but getting an error as below...
..This is surprising as I have done such computation many times before on time series data and never encountered this issue.
Can someone suggest what is going on here?...I am computing the daily returns of Adj Close price of a stock so need the answer in decimal.
I think you need convert to float first column, because dtype is object, what is obviously string:
z = x.astype(float) / y.astype(float)
Or:
data['Adj Close'] = data['Adj Close'].astype(float)
z = data['Adj Close'].shift(-1) / data['Adj Close']