Growth curves in R with standard deviation - ggplot2

I am trying to plot my data (replicate results for each strain) and i want only one line graph for each strain, this means averaged results of replicates for each strain with points along the line with error bars (error between replicate data).
If you click on the image above, it shows the plot i have so far, which displays WT and WT.1 as seperate lines and all other replicates. However, they are replicates of each strain (WT,DrsbR,DsigB) and i want them to appear as one line of mean results for each strain instead. I am using ggplot package- and melting data with reshape package, but cannot figure out how to make my replicates appear as one line together with error bars (standard deviation of mean results between replicates).
The image in black and white is something i am looking for in my graph- seperate line with points of replicate data plotted as a mean value.
library(reshape2)
melted<-melt(abs2)
print(abs2)
melted<-melt(abs2,id=1,measured=c("WT","WT.1","DsigB","DsigB.1","DrsbR","DrsbR.1"))
View(melted)
colnames(melted)<-c("Time","Strain","Values")
##line graph for melted data
melted$Time<-as.factor(melted$Time)
abs2line=ggplot(melted,aes(Time,Values))+geom_line(aes(colour=Strain,group=Strain))
abs2line+
stat_summary(fun=mean,
geom="point",
aes(group=Time))+
stat_summary(fun.data=mean_cl_boot,
geom="errorbar",
width=.2)+
xlab("Time")+
ylab("OD600")+
theme_classic()+
labs(title="Growth Curve of Mutant Strains")
summary(melted)
print(melted)

One approach is to take your melted data frame and separate out the "variable" column into "species" and "strain" using the separate() function from tidyr. I don't have your dataset -- it is appreciated if you are able to share your dataset via dput(your.data.frame) for future questions -- so I made a dummy dataset that's similar to yours. Here we have two "species" (red and blue) and two "strains" for each species.
df <- data.frame(
time = seq(0, 40, by=10),
blue = c(0:4),
blue.1 = c(0, 1.1, 1.9, 3.1, 4.1),
red = seq(0, 8, by=2),
red.1 = c(0, 2.1, 4.2, 5.5, 8.2)
)
df.melt <- melt(df,
id.vars = 'time',
measure.vars = c('blue', 'blue.1', 'red', 'red.1'))
We can then use tidyr::separate() to separate the resulting "variable" column into a "species" column and a "strain" column. Luckily, your data contains a "." which can be a handy character to use for the separation:
df.melt.mod <- df.melt %>%
separate(col=variable, into=c('species', 'strain'), sep='\\.')
Note: The above code will give you a warning related to the point that "blue" and "red" do not have the "." character, thereby giving you NA for the "strain" column. We don't care here, because we're not using that column for anything here. In your own dataset, you can similarly not care too much.
Then, you can actually just use stat_summary() for all geoms... modify as you see fit for your own visual and thematic preference. Note that order matters for layering, so I plot geom_line first, then geom_point, then geom_errorbar. Also note that you can assign the group=species aesthetic in the base ggplot() call and that mapping applies to all geoms unless overwritten.
ggplot(df.melt.mod, aes(x=time, y=value, group=species)) +
stat_summary(
fun = mean,
geom='line',
aes(color=species)) +
stat_summary(
fun=mean,
geom='point') +
stat_summary(
fun.data=mean_cl_boot,
geom='errorbar',
width=0.5) +
theme_bw()

Related

why is ggplot2 geom_col misreading discrete x axis labels as continuous?

Aim: plot a column chart representing concentration values at discrete sites
Problem: the 14 site labels are numeric, so I think ggplot2 is assuming continuous data and adding spaces for what it sees as 'missing numbers'. I only want 14 columns with 14 marks/labels, relative to the 14 values in the dataframe. I've tried assigning the sites as factors and characters but neither work.
Also, how do you ensure the y-axis ends at '0', so the bottom of the columns meet the x-axis?
Thanks
Data:
Sites: 2,4,6,7,8,9,10,11,12,13,14,15,16,17
Concentration: 10,16,3,15,17,10,11,19,14,12,14,13,18,16
You have two questions in one with two pretty straightforward answers:
1. How to force a discrete axis when your column is a continuous one? To make ggplot2 draw a discrete axis, the data must be discrete. You can force your numeric data to be discrete by converting to a factor. So, instead of x=Sites in your plot code, use x=as.factor(Sites).
2. How to eliminate the white space below the columns in a column plot? You can control the limits of the y axis via the scale_y_continuous() function. By default, the limits extend a bit past the actual data (in this case, from 0 to the max Concentration). You can override that behavior via the expand= argument. Check the documentation for expansion() for more details, but here I'm going to use mult=, which uses a multiplication to find the new limits based on the data. I'm using 0 for the lower limit to make the lower axis limit equal the minimum in your data (0), and 0.05 as the upper limit to expand the chart limits about 5% past the max value (this is default, I believe).
Here's the code and resulting plot.
library(ggplot2)
df <- data.frame(
Sites = c(2,4,6,7,8,9,10,11,12,13,14,15,16,17),
Concentration = c(10,16,3,15,17,10,11,19,14,12,14,13,18,16)
)
ggplot(df, aes(x=as.factor(Sites), y=Concentration)) +
geom_col(color="black", fill="lightblue") +
scale_y_continuous(expand=expansion(mult=c(0, 0.05))) +
theme_bw()

Creating bar charts with binary data

I have the following data , which I am trying to use to create a bar chart from to show how preference of fruit varies with country:
see data table here
I want to create a bar chart that shows preference of apples, oranges, grapes and bananas based on survey location (i.e x= surveyloc and Y = pref freq of oranges, apples, bananas). I am not quite sure how to do this when dealing with binary data and am hoping to get some assistance.
If you are looking to see preference for multiple variables (ex. fruits) across multiple locations (ex. locations), when only having binary data ("yes" or "no", or 0 vs 1), a bar chart is probably not the best option. My recommendation would be something like a tile plot so that you can convey at a glance preferences across the locations. Here's an example using some dummy data. I'll first show you an example of a bar plot (column plot), then the recommendation I have for you, which would be a tilemap.
Example Dataset
library(ggplot2)
library(dplyr)
library(tidyr)
set.seed(8675309)
df <- data.frame(
location = state.name[1:10],
apples = rbinom(10,1,0.3),
oranges = rbinom(10,1,0.1),
pears = rbinom(10,1,0.25),
grapes = rbinom(10,1,0.6),
mangos = rbinom(10,1,0.65)
)
# tidy data
df <- df %>% pivot_longer(cols = -location) %>%
mutate(value = factor(value))
I created df above initially in the same format you have for your dataset (location | pref1 | pref2 | pref3 | ...). It's difficult to use ggplot2 to plot this type of data easily, since it is designed to handle what is referred to as Tidy Data. This is overall a better strategy for data management and is adaptable to whatever output you wish - I'd recommend reading that vignette for more info. Needless to say, after the code above we have df formatted as a "tidy" table.
Note I've also turned the binary "value" column into a factor (since it only holds "0" or "1", and values of "0.5" and the like don't make sense here with this data).
"Bar Chart"
I put "bar chart" in quotes, because as we are plotting the value (0 or 1) on the y axis and location on the x axis, we are creating a "column chart". "Bar charts" formally only need a list of values and plot count, density, or probability on the y axis. Regardless, here's an example:
bar_plot <-
df %>%
ggplot(aes(x=location, y=value, fill=name)) +
geom_col(position="dodge", color='gray50', width=0.7) +
scale_fill_viridis_d()
bar_plot
We could think about just showing where value==1, but that's probably not going to make things clearer.
Example of Tile Plot
What I think works better here is a tilemap. The idea is that you spread location on the x axis and name (of the fruit) on the y axis, and then show the value field as the color of the resulting tiles. I think it makes things a bit easier to view, and it should work pretty much the same if your data is binary or probabilistic. For probability data, you just don't need to conver to a factor first.
tile_plot <-
df %>%
ggplot(aes(x=location, y=name, fill=value)) +
geom_tile(color='black') +
scale_fill_manual(values=c(`0`="gray90", `1`="skyblue")) +
coord_fixed() +
scale_x_discrete(expand=expansion(0)) +
scale_y_discrete(expand=expansion(0))
tile_plot
To explain a little what's going on here is that we setup the aesthetics as indicated above in ggplot(...). Then we draw the tiles with geom_tile(), where the color= represents the line around the tiles. The actual fill colors are described in scale_fill_manual(). The tiles are forced to be "sqare" via coord_fixed(), and then I remove excess area around the tiles via the scale_x_*() and scale_y_*() commands.

how fix the y-axis's rate in plot

I am using a line to estimate the slope of my graphs. the data points are in the same size. But look at these two pictures. the first one seems to have a larger slope but its not true. the second one has larger slope. but since the y-axis has different rate, the first one looks to have a larger slope. is there any way to fix the rate of y-axis, then I can see with my eye which one has bigger slop?
code:
x = np.array(list(range(0,df.shape[0]))) # = array([0, 1, 2, ..., 3598, 3599, 3600])
df1[skill]=pd.to_numeric(df1[skill])
fit = np.polyfit(x, df1[skill], 1)
fit_fn = np.poly1d(fit)
df['fit_fn(x)']=fit_fn(x)
df[['Hodrick-Prescott filter',skill,'fit_fn(x)']].plot(title=skill + date)
Two ways:
One, use matplotlib.pyplot.axis to get the axis limits of the first figure and set the second figure to have the same axis limits (using the same function) (could also use get_ylim and set_ylim, which are specific to the y-axis but require directly referencing the Axes object)
Two, plot both in a subplots figure and set the argument sharey to True (my preferred, depending on the desired use)

Adjusting the y-axis in ggplot (bar size, ordering, formatting)

I have the following data.table:
I would like to have a plot which shows the columns symbol and value in a box plot. The boxes should be ordered by the column value.
My code, that I've tried:
plot1 <- ggplot(symbols, aes(symbol, value, fill = from)) +
geom_bar(stat = 'identity') +
ggtitle(paste0("Total quantity traded: ", format(sum(symbols$quantity), scientific = FALSE, nsmall = 2, big.mark = " "))) +
theme_bw()
plot1
This returns the following plot:
What I would like to change:
- flip x- and y-axis
- show the correct height of boxes (y-axis)...currently the relation between the boxes is not correct.
- decreasing order of the boxes by columns value
- format the y-axis with two digits
- make the x-axis readable...currently the x-axis is just a long bunch of what is written in column symbol.
Thanks upfront for the help!
To make things a bit easier, it is suggested that you post your data frame as the output of dput(your.data.frame), which presents code that can be used to replicate your dataset in r.
With that being said, I recreated your data (it was not too big)--some numbers were rounded a bit to make things easier.
A few comments:
y-axis numbers are odd: The numbers on your y-axis are not numeric. If you type str(your.data.frame) you'll probably notice that "value" is not numeric, but a character or factor. This can be easily remedied via: df$value <- as.numeric(df$value), where df is your dataframe.
flipping axis: You can use coord_flip() (typically added to the end of your ggplot call. Be warned that when you do this, your aesthetics flip for the plot, so just keep that in mind.
your dataframe name is also a function/data name in r: This may not be causing any issues (due to your environment), but just be aware to use caution to name your dataset to not have names that are used in r elsewhere. This goes for column/variable names too. I don't think it causes any issues here, but just an FYI
geom_col vs geom_bar: Check out this documentation link for some description on the differences between geom_bar and geom_col. Basically, you want to use geom_bar when your y-axis is count, and geom_col when your y-axis is a value. Here, you want to plot a value, so choose geom_col(), and not geom_bar().
Fixing the issues in plot
Here's the representation of your data (note I rounded... hopefully got the actual data correct, because I manually had to copy each value):
from symbol quantity usd value
1 BTC BTCUSDT 12910.470 6776.340 87485737
2 ETH ETHUSDT 6168.730 154.398 952445
3 BNB BNBUSDT 51002.650 14.764 753017
4 BNB BNBBTC 31071.280 14.764 458745
5 ETH ETHBTC 2216.576 154.398 342236
6 LTC LTCUSDT 4332.024 40.481 175368
7 BNB BNBETH 3150.030 14.764 46507
8 LTC LTCBTC 922.560 40.481 37346
9 LTC LTCBNB 521.476 40.481 21110
10 NEO NEOUSDT 2438.353 7.203 17564
11 NEO NEOBTC 417.930 7.203 3010
Here's the basic plot, flipped:
ggplot(df, aes(symbol, value, fill=from)) +
geom_col() +
coord_flip()
The problem here is that when you plot values... BTCUSDT is huge in comparison. I would suggest you plot on log of value. See this link for some advice on how to do that. I like the scale_y_log10() function, since it just works here pretty well:
ggplot(df, aes(symbol, value, fill=from)) +
geom_col() +
scale_y_log10() +
coord_flip()
If you wanted to keep the columns going in the vertical orientation, you can still do that and avoid having the text run into each other on the x-axis. In that case, you can rotate the labels via theme(axis.text.x=...). Note the adjustments to horizontal and vertical alignment (hjust=1), which forces the labels to be "right-aligned":
ggplot(df, aes(symbol, value, fill=from)) +
geom_col() +
scale_y_log10() +
theme(axis.text.x=element_text(angle=45, hjust=1))

Overlaying mixed effects model results with ggplot2

I have been having some difficulty in displaying the results from my lmer model within ggplot2. I am specifically interested in displaying predicted regression lines on top of observed data. The lmer model I am running on this (speech) data is here below:
lmer.declination <- lmer(zlogF0_m60~Center.syll*Tone + (1|Trial) + (1+Tone|Speaker) + (1|Utterance.num), data=data)
The dependent variable here is fundamental frequency (F0), normalized and averaged across the middle 60% of a syllable. The fixed effects are syllable number (Center.syll), counted backwards from the end of a sentence (e.g. -2 is the 3rd last syllable in the sentence). The data here is from a lexical tone language, so the Tone (all low tone /1/, all mid tone /3/, and all high tone /4/) is a discrete fixed effect. The experimental questions are whether F0 falls across the sentences for this language, if so, by how much, and whether tone matters. It was a bit difficult for me to think of a way to produce a toy data set here, but the data can be downloaded here (a 437K file).
In order to extract the model fits, I used the effects package and converted the output to a data frame.
ex <- Effect(c("Center.syll","Tone"),lmer.declination)
ex.df <- as.data.frame(ex)
I plot the data using ggplot2, with the following code:
t.plot <- ggplot(data, aes(factor(Center.syll), zlogF0_m60, group=Tone, color=Tone)) + stat_summary(fun.data = mean_cl_boot, geom = "smooth") + ylab("Normalized log(F0)") + xlab("Syllable number") + ggtitle("F0 change across utterances with identical level tones, medial 60% of vowel") + geom_pointrange(data=ex.df, mapping=aes(x=Center.syll, y=fit, ymin=lower, ymax=upper)) + theme_bw()
t.plot
This produces the following plot:
Predicted trajectories and observed trajectories
The predicted values appear to the left of the observed data, not overlaid on the data itself. Whatever I seem to try, I can not get them to overlap on the observed data. I would ideally like to have a single line drawn rather than a pointrange, but when I attempted to use geom_line, the default was for the line to connect from the upper bound of one point to the lower bound of the next (not at the median/midpoint). Thank you for your help.
(Edit: As the OP pointed out, he did in fact include a link to his data set. My apologies for implying that he didn't.)
First of all, you will have much better luck getting a helpful response if you provide a minimal, complete, and verifiable example (MVCE). Look here for information on how to best do that for R specifically.
Lacking your actual data to work with, I believe your problem is that you're factoring the x-axis for the stat_summary, but not for the geom_pointrange. I mocked up a toy example from the plot you linked to in order to demonstrate:
dat1 <- data.frame(x=c(-6:0, -5:0, -4:0),
y=c(-0.25, -0.5, -0.6, -0.75, -0.8, -0.8, -1.5,
0.5, 0.45, 0.4, 0.2, 0.1, 0,
0.5, 0.9, 0.7, 0.6, 1.1),
z=c(rep('a', 7), rep('b', 6), rep('c', 5)))
dat2 <- data.frame(x=dat1$x,
y=dat1$y + runif(18, -0.2, 0.2),
z=dat1$z,
upper=dat1$y + 0.3 + runif(18, -0.1, 0.1),
lower=dat1$y - 0.3 + runif(18, -0.1, 0.1))
Now, the following call gives me a result similar to the graph you linked to:
ggplot(dat1, aes(factor(x), # note x being factored here
y, group=z, color=z)) +
geom_line() + # (this is a place-holder for your stat_summary)
geom_pointrange(data=dat2,
mapping=aes(x=x, # but x not being factored here
y=y, ymin=lower, ymax=upper))
However, if I remove the factoring of the initial x value, I get the line and the point ranges overlaid:
ggplot(dat1, aes(x, # no more factoring here
y, group=z, color=z)) +
geom_line() +
geom_pointrange(data=dat2,
mapping=aes(x=x, y=y, ymin=lower, ymax=upper))
Note that I still get the overlaid result if I factor both of the x-axes. The two simply have to be consistent.
Again, I can't stress enough how much it helps this entire process if you provide code we can copy/paste into an R session and see what you're seeing. Hopefully this helps you out, but it all goes more smoothly (and quickly) if you help us help you.