Are there any attributes to weigh the edges of the network (roadways) based on their capacity? For example, the number of lanes for each roadway or the maximum capacity of driveways.
Lanes is often available: https://wiki.openstreetmap.org/wiki/Key:lanes
For example from here: https://automating-gis-processes.github.io/site/notebooks/L6/network-analysis.html
# Retrieve only edges from the graph
edges = ox.graph_to_gdfs(graph, nodes=False, edges=True)
# Check columns
edges.columns
Index(['u', 'v', 'key', 'osmid', 'oneway', 'lanes', 'name', 'highway',
'maxspeed', 'length', 'geometry', 'junction', 'bridge', 'access'],
dtype='object'
print(edges['lanes'].value_counts())
Related
I'm trying to identify a suitable bandwidth to use for a geographically weighted regression but every time I search for the bandwidth it displays that there are missing (NaN) values within the arrays of the dataset. Although, each row features all values.
g_y = df_ct2008xy['2008 HP'].values.reshape((-1,1))
g_X = df_ct2008xy[['2008 AF', '2008 MI', '2008 MP', '2008 EB']].values
u = df_ct2008xy['X']
v = df_ct2008xy['Y']
g_coords = list(zip(u,v))
g_X = (g_X - g_X.mean(axis=0)) / g_X.std(axis=0)
g_y = g_y.reshape((-1,1))
g_y = (g_y - g_y.mean(axis=0)) / g_y.std(axis=0)
bw = mgwr.sel_bw.Sel_BW(g_coords,
g_y, # Independent variable
g_X, # Dependent variable
fixed=True, # True for fixed bandwidth and false for adaptive bandwidth
spherical=True) # Spherical coordinates (long-lat) or projected coordinates
I searched using numpy to identify if these were individual values using
np.isnan(g_y).any()
and
np.isnan(g_X)
but apparently every value is 'missing' and returning 'True'
I have a list of company names, but these have misspelling and variations. How best can I fix this so every company has the consistent naming convention (for later groupby, sort_value, etc.)?
pd.DataFrame({'Company': ['Disney','Dinsey', 'Walt Disney','General Motors','General Motor','GM','GE','General Electric','J.P. Morgan','JP Morgan']})
One good hint: FuzzyWuzzy library. "Fuzzy string matching like a boss. It uses Levenshtein Distance to calculate the differences between sequences in a simple-to-use package."
Example:
from fuzzywuzzy import process
from fuzzywuzzy import fuzz
str2Match = "apple inc"
strOptions = ["Apple Inc.","apple park","apple incorporated"]
Ratios = process.extract(str2Match,strOptions)
print(Ratios)
# You can also select the string with the highest matching percentage
highest = process.extractOne(str2Match,strOptions)
print(highest)
output:
[('Apple Inc.', 100), ('apple incorporated', 90), ('apple park', 67)]
('Apple Inc.', 100)
Now you just have to create a list with the "right names" and run all the variations against it so you can pick the best ratio and replace it on your dataset.
How can I easily compare the distributions of multiple cohorts?
Usually, https://seaborn.pydata.org/generated/seaborn.distplot.html would be a great tool to visually compare distributions. However, due to the size of my dataset, I needed to compress it and only keep the counts.
It was created as:
SELECT age, gender, compress_distributionUDF(collect_list(struct(target_y_n, count, distribution_value))) GROUP BY age, gender
where compress_distributionUDF simply takes a list of tuples and returns the counts per group.
This leaves me with a list of
Row(distribution_value=60.0, count=314251, target_y_n=0)
nested inside a pandas.Series, but one per each chohort.
Basically, it is similar to:
pd.DataFrame({'foo':[1,2], 'bar':['first', 'second'], 'baz':[{'target_y_n': 0, 'value': 0.5, 'count':1000},{'target_y_n': 1, 'value': 1, 'count':10000}]})
and I wonder how to compare distributions:
within a cohort 0 vs. 1 of target_y_n
over multiple cohorts
in a way which is visually still understandable and not only a mess.
edit
For a single cohort Plotting pre aggregated data in python could be the answer, but how can multiple cohorts be compared (not just in a loop) as this leads to too many plots to compare?
I am still quite confused but we can start from this and see where it goes. From your example, I am focusing on baz as it is not clear to me what foo and bar are (I assume cohorts).
So let focus on baz and plot the different distributions according to target_y_n.
sns.catplot('value','count',data=df, kind='bar',hue='target_y_n',dodge=False,ci=None)
sns.catplot('value','count',data=df, kind='box',hue='target_y_n',dodge=False)
plt.bar(df[df['target_y_n']==0]['value'],df[df['target_y_n']==0]['count'],width=1)
plt.bar(df[df['target_y_n']==1]['value'],df[df['target_y_n']==1]['count'],width=1)
plt.legend(['Target=0','Target=1'])
sns.barplot('value','count',data=df, hue = 'target_y_n',dodge=False,ci=None)
Finally try to have a look at the FacetGrid class to extend your comparison (see here).
g=sns.FacetGrid(df,col='target_y_n',hue = 'target_y_n')
g=g.map(sns.barplot,'value','count',ci=None)
In your case you would have something like:
g=sns.FacetGrid(df,col='target_y_n',row='cohort',hue = 'target_y_n')
g=g.map(sns.barplot,'value','count',ci=None)
And a qqplot option:
from scipy import stats
def qqplot(x, y, **kwargs):
_, xr = stats.probplot(x, fit=False)
_, yr = stats.probplot(y, fit=False)
plt.scatter(xr, yr, **kwargs)
g=sns.FacetGrid(df,col='cohort',hue = 'target_y_n')
g=g.map(qqplot,'value','count')
let's say i have a dataframe of 80 columns and 1 target column,
for example a bank account table with 80 attributes for each record (account) and 1 target column which decides if the client stays or leaves.
what steps and algorithms should i follow to select the most effective columns with the higher impact on the target column ?
There are a number of steps you can take, I'll give some examples to get you started:
A correlation coefficient, such as Pearson's Rho (for parametric data) or Spearman's R (for ordinate data).
Feature importances. I like XGBoost for this, as it includes the handy xgb.ggplot.importance / xgb.plot_importance methods.
One of the many feature selection options, such as python's sklearn.feature_selection methods.
This one way to do it using the Pearson correlation coefficient in Rstudio, I used it once when exploring the red_wine dataset my targeted variable or column was the quality and I wanted to know the effect of the rest of the columns on it.
see below figure shows the output of the code as you can see the blue color represents positive relation and red represents negative relations and the closer the value to 1 or -1 the darker the color
c <- cor(
red_wine %>%
# first we remove unwanted columns
dplyr::select(-X) %>%
dplyr::select(-rating) %>%
mutate(
# now we translate quality to a number
quality = as.numeric(quality)
)
)
corrplot(c, method = "color", type = "lower", addCoef.col = "gray", title = "Red Wine Variables Correlations", mar=c(0,0,1,0), tl.cex = 0.7, tl.col = "black", number.cex = 0.9)
I am trying to do an analysis where I am trying to create two similar samples based on three different attributes. I want to create these samples first and then do the analysis to see which out of those two samples is better. The categorical variables are sales_group, age_group, and country. So I want to make both samples such as the proportion of countries, age, and sales is similar in both samples.
For example: Sample A and B have following variables in it:
Id Country Age Sales
The proportion of Country in Sample A is:
USA- 58%
UK- 22%
India-8%
France- 6%
Germany- 6%
The proportion of country in Sample B is:
India- 42%
UK- 36%
USA-12%
France-3%
Germany- 5%
The same goes for other categorical variables: age_group, and sales_group
Thanks in advance for help
You do not need to establish special procedure for sampling as one-sample proportion is unbiased estimate of population proportion. In case you have, suppose, >1000 observations and you are sampling more than, let us say, 30 samples the estimate would be quite exact (Central Limit Theorem).
You can see it in the simulation below:
set.seed(123)
n <- 10000 # Amount of rows in the source data frame
df <- data.frame(sales_group = sample(LETTERS[1:4], n, replace = TRUE),
age_group = sample(c("old", "young"), n, replace = TRUE),
country = sample(c("USA", "UK", "India", "France", "Germany"), n, replace = TRUE),
amount = abs(100 * rnorm(n)))
s <- 100 # Amount of sampled rows
sampleA <- df[sample(nrow(df), s), ]
sampleB <- df[sample(nrow(df), s), ]
table(sampleA$sales_group)
# A B C D
# 23 22 32 23
table(sampleB$sales_group)
# A B C D
# 25 22 28 25
DISCLAIMER: However if you have some very small or very big proportion and have too little samples you will need to use some advanced procedures like Laplace smoothing