Rstudio and ggplot, stacked bar graph: Two almost identical bits of code, fct_reorder works with one, and not the other? - ggplot2

I wrote this which perfectly reorders the variable mcr_variant by count on my bar graph.
mcrxgenus %>%
mutate(mcr_variant = fct_reorder(mcr_variant, count)) %>%
ggplot( aes(fill=isolate_genus, y=count, x=mcr_variant)) +
geom_bar(position="stack", stat="identity") +
coord_flip() +
labs(x="MCR variant", y="Count", fill="Isolate genus")
I wrote this to display the same dataset a bit differently.
mcrxgenus %>%
mutate(isolate_genus = fct_reorder(isolate_genus, count)) %>%
ggplot( aes(fill=mcr_variant, y=count, x=isolate_genus)) +
geom_bar(position="stack", stat="identity")+
coord_flip()+
labs(x="Isolate genus", y="Count", fill="MCR variant")
It does NOT reorder my bar graph by count. I have absolutely no idea what is going on. It seems to me there should be no reason for this. mcr_variant and isolate_genus are both categorical variables. mcr_variant has 12 levels and isolate_genus has 6 possible levels. That is the only difference I can think of. Anyone run in to this problem before? It's been driving me mad! I have no idea what is happening here!

When you stack bars up, you're adding their values. fct_reorder, by default, takes the median of the values. So if MCR Variant A has counts 1, 1, 1, 2, 1, its order will be determined by the median count, 1, while its height is the sum of the counts, 6. Meanwhile if MCR Variant B has counts 2, 3, its order will be the median 2.5, but its sum is 5.
You need to make fct_reorder use sum, just like your stacked bar graph. Replace fct_reorder(isolate_genus, count) with fct_reorder(isolate_genus, count, sum).
If this doesn't work, please share a reproducible sample of data, preferably with dput so the classes are preserved and everything is copy/pasteable e.g., dput(mcrxgenus[1:10, ]) for the first 10 rows. Pick a suitable sample to illustrate the problem.

Related

R Apply across a DF column with a dynamic column selection

Been trying to figure out how to calculate across a DF using L/T/Apply by dynamically selecting the column to use. I have a DF of wind speed at different heights (Date, Year, Wind.100, Wind.120, Wind.80, etc) and I want to do calculations based on the heights that varies depending on what turbine I am simulating.
Height <- 100
Level <- paste0("Wind.", Height)
I've tried:
tapply(df[[Level]], list(DF$Hour, DF$Year), mean)
tapply(df[Level], list(DF$Hour, DF$Year), mean)
tapply(df[,..Level], list(DF$Hour, DF$Year), mean)
but it fails and says :object 'Level' not found.
There has got to be a way to script this. I know doing a
paste0("df$Wind.",Height)
doesn't work but I can't figure it out.

Create new data frame of percentages from values of an old data frame?

So I want to create a new data frame adding the values of the Sometimes and Often column and dividing it by the values of the total column and multiplying it by 100 to get percentages (unless there is a function that automatically does this in R). How would I go about doing that?
You have added an "sql" tag to your question. Should you prefer SQL over R for reasons of experience and/or knowledge you might be interested in the fabulous sqldf package which allows you to use SQL syntax within R. You will have to download it first via install.packages("sqldf") and then you can use it as in
expl <- data.frame(sometimes = c(1, 2, 4), often = c(2, 2, 2), total =c(6, 9, 8))
library(sqldf)
sqldf("SELECT 100*(sometimes+often)/total FROM expl")
The far more often used way is to add a percent column to the same data.frame instead of introducing a new one. That way, all data are kept together and you do not loose the link to e. g. the week column.
One way to go about that would be the following one-liner:
expl <- data.frame(sometimes = c(1, 2, 4), often = c(2, 2, 2), total =c(6, 9, 8))
print(expl)
expl$percent = 100 * (expl$sometimes + expl$often)/expl$total
print(expl)
First, it looks as though Total, Sometimes, and Often are character because you have commas in them, so you would need to get rid of the commas and convert them to numeric. You can do that as follows (assuming your dataframe is called mydata):
for(i in c("Total","Sometimes","Often")) mydata[[i]] = as.numeric(gsub(",", "", mydata[[i]])
Then you can use the answer by Bernard:
mydata$percent = 100 * (mydata$Sometimes + mydata$Often)/mydata$Total
Another option using the tidyverse:
library(tidyverse)
newdataframe <- olddataframe %>%
mutate(percent = (Sometimes+Often)/Total*100) %>%
select(percent)
But as said before, better leave the percentage column with the other data. In that case, remove the %>% select(percent).

Taking mean of N largest values of group by absolute value

I have some DataFrame:
d = {'fruit': ['apple', 'pear', 'peach'] * 6, 'values': np.random.uniform(-5,5,18), 'values2': np.random.uniform(-5,5,18)}
df = pd.DataFrame(data=d)
I can take the mean of each fruit group as such:
df.groupby('fruit').mean()
However, for each group of fruit, I'd like to take the mean of the N number of largest values as
ranked by absolute value.
So for example, if my values were as follows and N=3:
[ 0.7578507 , 3.81178045, -4.04810913, 3.08887538, 2.87999752, 4.65670954]
The desired outcome would be (4.65670954 + -4.04810913 + 3.81178045) / 3 = ~1.47
Edit - to clarify that sign is preserved in outcome:
(4.65670954 + -20.04810913 + 3.81178045) / 3 = -3.859
Updating with a new approach that I think is simpler. I was avoiding apply like the plague but maybe this is one of the more acceptable uses. Plus it fixes the fact that you want to mean the original values as ranked by their absolute values:
def foo(d):
return d[d.abs().nlargest(3).index].mean()
out = df.groupby('fruit')['values'].apply(foo)
So you index each group by the 3 largest absolute values, then mean.
And for the record my original, incorrect, and slower code was:
df['values'].abs().groupby(df['fruit']).nlargest(3).groupby("fruit").mean()

geom_nodelabel_repel() position for circular ggraph plot

I have a network diagram that looks like this:
I made it using ggraph and added the labels using geom_nodelabel_repel() from ggnetwork:
( ggraph_plot <- ggraph(layout) +
geom_edge_fan(aes(color = as.factor(responses), edge_width = as.factor(responses))) +
geom_node_point(aes(color = as.factor(group)), size = 10) +
geom_nodelabel_repel(aes(label = name, x=x, y=y), segment.size = 1, segment.color = "black", size = 5) +
scale_color_manual("Group", values = c("#2b83ba", "#d7191c", "#fdae61")) +
scale_edge_color_manual("Frequency of Communication", values = c("Once a week or more" = "#444444","Monthly" = "#777777",
"Once every 3 months" = "#888888", "Once a year" = "#999999"),
limits = c("Once a week or more", "Monthly", "Once every 3 months", "Once a year")) +
scale_edge_width_manual("Frequency of Communication", values = c("Once a week or more" = 3,"Monthly" = 2,
"Once every 3 months" = 1, "Once a year" = 0.25),
limits = c("Once a week or more", "Monthly", "Once every 3 months", "Once a year")) +
theme_void() +
theme(legend.text = element_text(size=16, face="bold"),
legend.title = element_text(size=16, face="bold")) )
I want to have the labels on the left side of the plot be off to the left, and the labels on the right side of the plot to be off to the right. I want to do this because the actual labels are quite long (organization names) and they get in the way of the lines in the actual plot.
How can I do this using geom_nodelabel_repel()? i've tried different combinations of box_padding and point_padding, as well as h_just and v_just but these apply to all labels and it doesn't seem like there is a way to subset or position specific points.
Apologies for not providing a reproducible example but I wasn't sure how to do this without compromising the identities of respondents from my survey.
Well, there is always the manually-intensive, yet effective method of separately adding the geom_node_label_repel function for the nodes on the "left" vs. the "right" of the plot. It's not at all elegant and probably bad coding practice, but I've done similar things myself when I can't figure out an elegant solution. It works really well when you don't have a very large dataset to begin with and if you are not planning to make the same plot over and over again. Basically, it would entail:
Identifying if there exists a property in your dataset that places points on the "left" vs. the "right". In this case, it doesn't look like it, so you would just have to create a list manually of those entries on the "left" vs. "right" of your plot.
Using separate calls to geom_node_label_repel with different nudge_x values. Use any reasonable method to subset the "left" and "right datapoints. You can create a new column in the dataset, or use formatting in-line like data = subset(your.data.frame, property %in% left.list)
For example, if you created a column called subset.side, being either "left" or "right" in your data.frame (here: your.data.frame), your calls to geom_node_label_repel might look something like:
geom_node_label_repel(
data=subset(your.data.frame, subset.side=='left'),
aes(label=name, x=x, y=y), segment.size=1, segment.color='black', size=5,
nudge_x=-10
) +
geom_node_label_repel(
data=subset(your.data.frame, subset.side=='right'),
aes(label=name, x=x, y=y), segment.size=1, segment.color='black', size=5,
nudge_x=10
) +
Alternatively, you can create a list based on the label name itself--let's say you called those lists names.left and names.right, where you can subset accordingly by swapping in as represented in the pseudo code below:
geom_node_label_repel(
data=subset(your.data.frame, name %in% names.left),...
nudge_x = -10, ...
) +
geom_node_label_repel(
data=subset(your.data.frame, name %in% names.right),...
nudge_x = 10, ...
)
To be fair, I have not worked with the node geoms before, so I am assuming here that the positioning of the labels will not affect the mapping (as it would not with other geoms).

comapring compressed distribution per cohort

How can I easily compare the distributions of multiple cohorts?
Usually, https://seaborn.pydata.org/generated/seaborn.distplot.html would be a great tool to visually compare distributions. However, due to the size of my dataset, I needed to compress it and only keep the counts.
It was created as:
SELECT age, gender, compress_distributionUDF(collect_list(struct(target_y_n, count, distribution_value))) GROUP BY age, gender
where compress_distributionUDF simply takes a list of tuples and returns the counts per group.
This leaves me with a list of
Row(distribution_value=60.0, count=314251, target_y_n=0)
nested inside a pandas.Series, but one per each chohort.
Basically, it is similar to:
pd.DataFrame({'foo':[1,2], 'bar':['first', 'second'], 'baz':[{'target_y_n': 0, 'value': 0.5, 'count':1000},{'target_y_n': 1, 'value': 1, 'count':10000}]})
and I wonder how to compare distributions:
within a cohort 0 vs. 1 of target_y_n
over multiple cohorts
in a way which is visually still understandable and not only a mess.
edit
For a single cohort Plotting pre aggregated data in python could be the answer, but how can multiple cohorts be compared (not just in a loop) as this leads to too many plots to compare?
I am still quite confused but we can start from this and see where it goes. From your example, I am focusing on baz as it is not clear to me what foo and bar are (I assume cohorts).
So let focus on baz and plot the different distributions according to target_y_n.
sns.catplot('value','count',data=df, kind='bar',hue='target_y_n',dodge=False,ci=None)
sns.catplot('value','count',data=df, kind='box',hue='target_y_n',dodge=False)
plt.bar(df[df['target_y_n']==0]['value'],df[df['target_y_n']==0]['count'],width=1)
plt.bar(df[df['target_y_n']==1]['value'],df[df['target_y_n']==1]['count'],width=1)
plt.legend(['Target=0','Target=1'])
sns.barplot('value','count',data=df, hue = 'target_y_n',dodge=False,ci=None)
Finally try to have a look at the FacetGrid class to extend your comparison (see here).
g=sns.FacetGrid(df,col='target_y_n',hue = 'target_y_n')
g=g.map(sns.barplot,'value','count',ci=None)
In your case you would have something like:
g=sns.FacetGrid(df,col='target_y_n',row='cohort',hue = 'target_y_n')
g=g.map(sns.barplot,'value','count',ci=None)
And a qqplot option:
from scipy import stats
def qqplot(x, y, **kwargs):
_, xr = stats.probplot(x, fit=False)
_, yr = stats.probplot(y, fit=False)
plt.scatter(xr, yr, **kwargs)
g=sns.FacetGrid(df,col='cohort',hue = 'target_y_n')
g=g.map(qqplot,'value','count')