How can I add two columns of data to the 'hue' section on Geoplot-Geopandas for a cartogram map? - google-colaboratory

Having trouble selecting 'male' and 'female' for hue when creating a cartogram with geoplot (using geopandas).
I have only managed to select the total population but I would like to compare the state populations of male to female.
I understand that the 'hue' assignment is probably what I need to modify but so far I can only display the total population (Tot_P_P) rather than both "Tot_P_M" and "Tot_P_F".
This is the code so far that works for total population. I've tried for the past 2 weeks looking through tutorials and websites but there isn't much info on cartograms.
working code for total population:
ax = gplt.polyplot(gda2020,
projection=gplt.crs.AlbersEqualArea(),
figsize = (25, 15)
)
gplt.cartogram(gda2020,
scale="Tot_P_P", limits=(0.2, 1), scale_func=None,
hue="Tot_P_P",
cmap='inferno',
norm=None,
scheme=None,
legend=True,
legend_values=None,
legend_labels=None,
legend_kwargs=None,
legend_var='hue',
extent=None,
ax=ax)

Related

how to sort values in a horizontal bar graph that already has a variable

How do I sort the values based on top10[ActiveCases]? I don't seem to get the syntax right.
top10=df[0:21]
top10
plt.barh(top10['Country,Other'],width=top10['ActiveCases'])
plt.title("Top 10 countries with highest active cases")
before creating the graph use
top10 = top20
top10.sort_values(by='ActiveCases')
top10.head(10).plot(kind='barh')
this will plot the 10 highest countries

How to Group the Borough column by each the 5 boroughs in NYC, and taking average of the total population in each Borough

I need to create box plot which has the average population of each borough. I have the population of each of the zip codes in each of the 5 boroughs. How can I get to my preferred result? Open the link to see my dataframe.
A simple groupby:
df.groupby('Borough')['Population'].sum()
If you want by Borough and Zip_codes:
df.groupby(['Borough', 'Zip_codes')['Population'].sum()

Pandas value_counts() with percentage [duplicate]

I was experimenting with the kaggle.com Titanic data set (data on every person on the Titanic) and came up with a gender breakdown like this:
df = pd.DataFrame({'sex': ['male'] * 577 + ['female'] * 314})
gender = df.sex.value_counts()
gender
male 577
female 314
I would like to find out the percentage of each gender on the Titanic.
My approach is slightly less than ideal:
from __future__ import division
pcts = gender / gender.sum()
pcts
male 0.647587
female 0.352413
Is there a better (more idiomatic) way?
This function is implemented in pandas, actually even in value_counts(). No need to calculate :)
just type:
df.sex.value_counts(normalize=True)
which gives exactly the desired output.
Please note that value_counts() excludes NA values, so numbers might not add up to 1.
See here: http://pandas-docs.github.io/pandas-docs-travis/generated/pandas.Series.value_counts.html
(A column of a DataFrame is a Series)
In case you wish to show percentage one of the things that you might do is use value_counts(normalize=True) as answered by #fanfabbb.
With that said, for many purposes, you might want to show it in the percentage out of a hundred.
That can be achieved like so:
gender = df.sex.value_counts(normalize=True).mul(100).round(1).astype(str) + '%'
In this case, we multiply the results by hundred, round it to one decimal point and add the percentage sign.
If you want to merge counts with percentage, can use:
c = df.sex.value_counts(dropna=False)
p = df.sex.value_counts(dropna=False, normalize=True)
pd.concat([c,p], axis=1, keys=['counts', '%'])
I think I would probably do this in one go (without importing division):
1. * df.sex.value_counts() / len(df.sex)
or perhaps, remembering you want a percentage:
100. * df.sex.value_counts() / len(df.sex)
Much of a muchness really, your way looks fine too.

Rstudio and ggplot, stacked bar graph: Two almost identical bits of code, fct_reorder works with one, and not the other?

I wrote this which perfectly reorders the variable mcr_variant by count on my bar graph.
mcrxgenus %>%
mutate(mcr_variant = fct_reorder(mcr_variant, count)) %>%
ggplot( aes(fill=isolate_genus, y=count, x=mcr_variant)) +
geom_bar(position="stack", stat="identity") +
coord_flip() +
labs(x="MCR variant", y="Count", fill="Isolate genus")
I wrote this to display the same dataset a bit differently.
mcrxgenus %>%
mutate(isolate_genus = fct_reorder(isolate_genus, count)) %>%
ggplot( aes(fill=mcr_variant, y=count, x=isolate_genus)) +
geom_bar(position="stack", stat="identity")+
coord_flip()+
labs(x="Isolate genus", y="Count", fill="MCR variant")
It does NOT reorder my bar graph by count. I have absolutely no idea what is going on. It seems to me there should be no reason for this. mcr_variant and isolate_genus are both categorical variables. mcr_variant has 12 levels and isolate_genus has 6 possible levels. That is the only difference I can think of. Anyone run in to this problem before? It's been driving me mad! I have no idea what is happening here!
When you stack bars up, you're adding their values. fct_reorder, by default, takes the median of the values. So if MCR Variant A has counts 1, 1, 1, 2, 1, its order will be determined by the median count, 1, while its height is the sum of the counts, 6. Meanwhile if MCR Variant B has counts 2, 3, its order will be the median 2.5, but its sum is 5.
You need to make fct_reorder use sum, just like your stacked bar graph. Replace fct_reorder(isolate_genus, count) with fct_reorder(isolate_genus, count, sum).
If this doesn't work, please share a reproducible sample of data, preferably with dput so the classes are preserved and everything is copy/pasteable e.g., dput(mcrxgenus[1:10, ]) for the first 10 rows. Pick a suitable sample to illustrate the problem.

comapring compressed distribution per cohort

How can I easily compare the distributions of multiple cohorts?
Usually, https://seaborn.pydata.org/generated/seaborn.distplot.html would be a great tool to visually compare distributions. However, due to the size of my dataset, I needed to compress it and only keep the counts.
It was created as:
SELECT age, gender, compress_distributionUDF(collect_list(struct(target_y_n, count, distribution_value))) GROUP BY age, gender
where compress_distributionUDF simply takes a list of tuples and returns the counts per group.
This leaves me with a list of
Row(distribution_value=60.0, count=314251, target_y_n=0)
nested inside a pandas.Series, but one per each chohort.
Basically, it is similar to:
pd.DataFrame({'foo':[1,2], 'bar':['first', 'second'], 'baz':[{'target_y_n': 0, 'value': 0.5, 'count':1000},{'target_y_n': 1, 'value': 1, 'count':10000}]})
and I wonder how to compare distributions:
within a cohort 0 vs. 1 of target_y_n
over multiple cohorts
in a way which is visually still understandable and not only a mess.
edit
For a single cohort Plotting pre aggregated data in python could be the answer, but how can multiple cohorts be compared (not just in a loop) as this leads to too many plots to compare?
I am still quite confused but we can start from this and see where it goes. From your example, I am focusing on baz as it is not clear to me what foo and bar are (I assume cohorts).
So let focus on baz and plot the different distributions according to target_y_n.
sns.catplot('value','count',data=df, kind='bar',hue='target_y_n',dodge=False,ci=None)
sns.catplot('value','count',data=df, kind='box',hue='target_y_n',dodge=False)
plt.bar(df[df['target_y_n']==0]['value'],df[df['target_y_n']==0]['count'],width=1)
plt.bar(df[df['target_y_n']==1]['value'],df[df['target_y_n']==1]['count'],width=1)
plt.legend(['Target=0','Target=1'])
sns.barplot('value','count',data=df, hue = 'target_y_n',dodge=False,ci=None)
Finally try to have a look at the FacetGrid class to extend your comparison (see here).
g=sns.FacetGrid(df,col='target_y_n',hue = 'target_y_n')
g=g.map(sns.barplot,'value','count',ci=None)
In your case you would have something like:
g=sns.FacetGrid(df,col='target_y_n',row='cohort',hue = 'target_y_n')
g=g.map(sns.barplot,'value','count',ci=None)
And a qqplot option:
from scipy import stats
def qqplot(x, y, **kwargs):
_, xr = stats.probplot(x, fit=False)
_, yr = stats.probplot(y, fit=False)
plt.scatter(xr, yr, **kwargs)
g=sns.FacetGrid(df,col='cohort',hue = 'target_y_n')
g=g.map(qqplot,'value','count')