Creating bar charts with binary data - ggplot2

I have the following data , which I am trying to use to create a bar chart from to show how preference of fruit varies with country:
see data table here
I want to create a bar chart that shows preference of apples, oranges, grapes and bananas based on survey location (i.e x= surveyloc and Y = pref freq of oranges, apples, bananas). I am not quite sure how to do this when dealing with binary data and am hoping to get some assistance.

If you are looking to see preference for multiple variables (ex. fruits) across multiple locations (ex. locations), when only having binary data ("yes" or "no", or 0 vs 1), a bar chart is probably not the best option. My recommendation would be something like a tile plot so that you can convey at a glance preferences across the locations. Here's an example using some dummy data. I'll first show you an example of a bar plot (column plot), then the recommendation I have for you, which would be a tilemap.
Example Dataset
library(ggplot2)
library(dplyr)
library(tidyr)
set.seed(8675309)
df <- data.frame(
location = state.name[1:10],
apples = rbinom(10,1,0.3),
oranges = rbinom(10,1,0.1),
pears = rbinom(10,1,0.25),
grapes = rbinom(10,1,0.6),
mangos = rbinom(10,1,0.65)
)
# tidy data
df <- df %>% pivot_longer(cols = -location) %>%
mutate(value = factor(value))
I created df above initially in the same format you have for your dataset (location | pref1 | pref2 | pref3 | ...). It's difficult to use ggplot2 to plot this type of data easily, since it is designed to handle what is referred to as Tidy Data. This is overall a better strategy for data management and is adaptable to whatever output you wish - I'd recommend reading that vignette for more info. Needless to say, after the code above we have df formatted as a "tidy" table.
Note I've also turned the binary "value" column into a factor (since it only holds "0" or "1", and values of "0.5" and the like don't make sense here with this data).
"Bar Chart"
I put "bar chart" in quotes, because as we are plotting the value (0 or 1) on the y axis and location on the x axis, we are creating a "column chart". "Bar charts" formally only need a list of values and plot count, density, or probability on the y axis. Regardless, here's an example:
bar_plot <-
df %>%
ggplot(aes(x=location, y=value, fill=name)) +
geom_col(position="dodge", color='gray50', width=0.7) +
scale_fill_viridis_d()
bar_plot
We could think about just showing where value==1, but that's probably not going to make things clearer.
Example of Tile Plot
What I think works better here is a tilemap. The idea is that you spread location on the x axis and name (of the fruit) on the y axis, and then show the value field as the color of the resulting tiles. I think it makes things a bit easier to view, and it should work pretty much the same if your data is binary or probabilistic. For probability data, you just don't need to conver to a factor first.
tile_plot <-
df %>%
ggplot(aes(x=location, y=name, fill=value)) +
geom_tile(color='black') +
scale_fill_manual(values=c(`0`="gray90", `1`="skyblue")) +
coord_fixed() +
scale_x_discrete(expand=expansion(0)) +
scale_y_discrete(expand=expansion(0))
tile_plot
To explain a little what's going on here is that we setup the aesthetics as indicated above in ggplot(...). Then we draw the tiles with geom_tile(), where the color= represents the line around the tiles. The actual fill colors are described in scale_fill_manual(). The tiles are forced to be "sqare" via coord_fixed(), and then I remove excess area around the tiles via the scale_x_*() and scale_y_*() commands.

Related

How to organize facet_grid with too many panels

I am trying to plot gene expression values in a clinical trial. I have come to an scalar, represeting relative quantification, and I have 3 groups of intervention.I am trying to plot it in a bar graph (I am open to alternatives). My database is made up 55 genes to study in 151 samples.
The plot design is not pretty fancy, I would like to discriminate the groups by colours
ggplot(genes[time = 3], aes(grup_int, RQ)) +
stat_summary(fun.y = mean, geom='point') +
stat_summary(fun.data = 'mean_cl_boot', geom='errorbar', width=.25) +
facet_grid(.~gen)
As you see the resolution is low. I was wondering if there is other approach, maybe rearrange the plots or even divide the plot in 2 plots...
Thanks in advance!

How do I display all of my x axis labels in my bar plot in ggplot2?

So I have my graph made which is really great. I am now just trying to get each x=axis label to be there (patient column).Is there a way for me to do this in ggplot? (please note patient is just 1-67). I have tried a couple of other methods like lars or +theme(legend.key.size=unit(1,"in"),legend.title=element_text(size=22),legend.text = element_text(size=17))+axis(1,at=midpts,labels=names(Patient)) but neither have worked. Any advice is appreciated!
[![][1]][1]
ggplot(data=V5.ACE2.double.replacement.and.redo.of.AUC.calculation.CSV.file, mapping=aes(x=Patient,y=Fluorescent.sum.over.240.min,fill=Top.20.),las=2)+geom_bar(stat='identity')+theme(panel.grid.major=element_blank(),panel.grid.minor=element_blank(),panel.background=element_blank())+labs(x="\nPatient Sample")+labs(y="\nFluorescence sum over 240 min")+theme(axis.title.x=element_text(size=26))+theme(axis.title.y=element_text(size=26))+theme(axis.text=element_text(size=18))+theme(legend.key.size=unit(1,"in"),legend.title=element_text(size=22),legend.text = element_text(size=17))
Your x-axis is the numeric patient ID and it's getting configured as a continuous scale. It sounds like you want a categorical scale. Turn the Patient column into a factor with something like this:
library(tidyverse)
V5.ACE2.double.replacement.and.redo.of.AUC.calculation.CSV.file <- V5.ACE2.double.replacement.and.redo.of.AUC.calculation.CSV.file %>% mutate(Patient = as_factor(Patient))
And you'll get a categorical axis.

Changing variable labels/legend in raster plot to discrete characters

I have just made a plot using raster data that consists of 6 different land types and fit them to polygon vectors. I'm trying to change the values on the continuous scale bar (1-6) to the names of each landtype (e.g. grasslands, urban, etc) which is what each different colour represents. I have tried inserting breaks, however then each box in the legend contains labels (1-2, 2-3, 3-4 etc.)
Raster plot where each diff colour represents diff land type
This is my code:
rasterxpolygonplotcode
Example data
library(terra)
r <- rast(nrows=10, ncols=10)
values(r) <- sample(3, ncell(r), replace=TRUE)
cover <- c("forest", "water", "urban")
You can either do:
plot(r, type="classes", levels=cover)
Or first make the raster categorical
levels(r) <- data.frame(id=1:3, cover=c("forest", "water", "urban"))
plot(r)

Adjusting the y-axis in ggplot (bar size, ordering, formatting)

I have the following data.table:
I would like to have a plot which shows the columns symbol and value in a box plot. The boxes should be ordered by the column value.
My code, that I've tried:
plot1 <- ggplot(symbols, aes(symbol, value, fill = from)) +
geom_bar(stat = 'identity') +
ggtitle(paste0("Total quantity traded: ", format(sum(symbols$quantity), scientific = FALSE, nsmall = 2, big.mark = " "))) +
theme_bw()
plot1
This returns the following plot:
What I would like to change:
- flip x- and y-axis
- show the correct height of boxes (y-axis)...currently the relation between the boxes is not correct.
- decreasing order of the boxes by columns value
- format the y-axis with two digits
- make the x-axis readable...currently the x-axis is just a long bunch of what is written in column symbol.
Thanks upfront for the help!
To make things a bit easier, it is suggested that you post your data frame as the output of dput(your.data.frame), which presents code that can be used to replicate your dataset in r.
With that being said, I recreated your data (it was not too big)--some numbers were rounded a bit to make things easier.
A few comments:
y-axis numbers are odd: The numbers on your y-axis are not numeric. If you type str(your.data.frame) you'll probably notice that "value" is not numeric, but a character or factor. This can be easily remedied via: df$value <- as.numeric(df$value), where df is your dataframe.
flipping axis: You can use coord_flip() (typically added to the end of your ggplot call. Be warned that when you do this, your aesthetics flip for the plot, so just keep that in mind.
your dataframe name is also a function/data name in r: This may not be causing any issues (due to your environment), but just be aware to use caution to name your dataset to not have names that are used in r elsewhere. This goes for column/variable names too. I don't think it causes any issues here, but just an FYI
geom_col vs geom_bar: Check out this documentation link for some description on the differences between geom_bar and geom_col. Basically, you want to use geom_bar when your y-axis is count, and geom_col when your y-axis is a value. Here, you want to plot a value, so choose geom_col(), and not geom_bar().
Fixing the issues in plot
Here's the representation of your data (note I rounded... hopefully got the actual data correct, because I manually had to copy each value):
from symbol quantity usd value
1 BTC BTCUSDT 12910.470 6776.340 87485737
2 ETH ETHUSDT 6168.730 154.398 952445
3 BNB BNBUSDT 51002.650 14.764 753017
4 BNB BNBBTC 31071.280 14.764 458745
5 ETH ETHBTC 2216.576 154.398 342236
6 LTC LTCUSDT 4332.024 40.481 175368
7 BNB BNBETH 3150.030 14.764 46507
8 LTC LTCBTC 922.560 40.481 37346
9 LTC LTCBNB 521.476 40.481 21110
10 NEO NEOUSDT 2438.353 7.203 17564
11 NEO NEOBTC 417.930 7.203 3010
Here's the basic plot, flipped:
ggplot(df, aes(symbol, value, fill=from)) +
geom_col() +
coord_flip()
The problem here is that when you plot values... BTCUSDT is huge in comparison. I would suggest you plot on log of value. See this link for some advice on how to do that. I like the scale_y_log10() function, since it just works here pretty well:
ggplot(df, aes(symbol, value, fill=from)) +
geom_col() +
scale_y_log10() +
coord_flip()
If you wanted to keep the columns going in the vertical orientation, you can still do that and avoid having the text run into each other on the x-axis. In that case, you can rotate the labels via theme(axis.text.x=...). Note the adjustments to horizontal and vertical alignment (hjust=1), which forces the labels to be "right-aligned":
ggplot(df, aes(symbol, value, fill=from)) +
geom_col() +
scale_y_log10() +
theme(axis.text.x=element_text(angle=45, hjust=1))

Growth curves in R with standard deviation

I am trying to plot my data (replicate results for each strain) and i want only one line graph for each strain, this means averaged results of replicates for each strain with points along the line with error bars (error between replicate data).
If you click on the image above, it shows the plot i have so far, which displays WT and WT.1 as seperate lines and all other replicates. However, they are replicates of each strain (WT,DrsbR,DsigB) and i want them to appear as one line of mean results for each strain instead. I am using ggplot package- and melting data with reshape package, but cannot figure out how to make my replicates appear as one line together with error bars (standard deviation of mean results between replicates).
The image in black and white is something i am looking for in my graph- seperate line with points of replicate data plotted as a mean value.
library(reshape2)
melted<-melt(abs2)
print(abs2)
melted<-melt(abs2,id=1,measured=c("WT","WT.1","DsigB","DsigB.1","DrsbR","DrsbR.1"))
View(melted)
colnames(melted)<-c("Time","Strain","Values")
##line graph for melted data
melted$Time<-as.factor(melted$Time)
abs2line=ggplot(melted,aes(Time,Values))+geom_line(aes(colour=Strain,group=Strain))
abs2line+
stat_summary(fun=mean,
geom="point",
aes(group=Time))+
stat_summary(fun.data=mean_cl_boot,
geom="errorbar",
width=.2)+
xlab("Time")+
ylab("OD600")+
theme_classic()+
labs(title="Growth Curve of Mutant Strains")
summary(melted)
print(melted)
One approach is to take your melted data frame and separate out the "variable" column into "species" and "strain" using the separate() function from tidyr. I don't have your dataset -- it is appreciated if you are able to share your dataset via dput(your.data.frame) for future questions -- so I made a dummy dataset that's similar to yours. Here we have two "species" (red and blue) and two "strains" for each species.
df <- data.frame(
time = seq(0, 40, by=10),
blue = c(0:4),
blue.1 = c(0, 1.1, 1.9, 3.1, 4.1),
red = seq(0, 8, by=2),
red.1 = c(0, 2.1, 4.2, 5.5, 8.2)
)
df.melt <- melt(df,
id.vars = 'time',
measure.vars = c('blue', 'blue.1', 'red', 'red.1'))
We can then use tidyr::separate() to separate the resulting "variable" column into a "species" column and a "strain" column. Luckily, your data contains a "." which can be a handy character to use for the separation:
df.melt.mod <- df.melt %>%
separate(col=variable, into=c('species', 'strain'), sep='\\.')
Note: The above code will give you a warning related to the point that "blue" and "red" do not have the "." character, thereby giving you NA for the "strain" column. We don't care here, because we're not using that column for anything here. In your own dataset, you can similarly not care too much.
Then, you can actually just use stat_summary() for all geoms... modify as you see fit for your own visual and thematic preference. Note that order matters for layering, so I plot geom_line first, then geom_point, then geom_errorbar. Also note that you can assign the group=species aesthetic in the base ggplot() call and that mapping applies to all geoms unless overwritten.
ggplot(df.melt.mod, aes(x=time, y=value, group=species)) +
stat_summary(
fun = mean,
geom='line',
aes(color=species)) +
stat_summary(
fun=mean,
geom='point') +
stat_summary(
fun.data=mean_cl_boot,
geom='errorbar',
width=0.5) +
theme_bw()