Adjusting the y-axis in ggplot (bar size, ordering, formatting) - ggplot2

I have the following data.table:
I would like to have a plot which shows the columns symbol and value in a box plot. The boxes should be ordered by the column value.
My code, that I've tried:
plot1 <- ggplot(symbols, aes(symbol, value, fill = from)) +
geom_bar(stat = 'identity') +
ggtitle(paste0("Total quantity traded: ", format(sum(symbols$quantity), scientific = FALSE, nsmall = 2, big.mark = " "))) +
theme_bw()
plot1
This returns the following plot:
What I would like to change:
- flip x- and y-axis
- show the correct height of boxes (y-axis)...currently the relation between the boxes is not correct.
- decreasing order of the boxes by columns value
- format the y-axis with two digits
- make the x-axis readable...currently the x-axis is just a long bunch of what is written in column symbol.
Thanks upfront for the help!

To make things a bit easier, it is suggested that you post your data frame as the output of dput(your.data.frame), which presents code that can be used to replicate your dataset in r.
With that being said, I recreated your data (it was not too big)--some numbers were rounded a bit to make things easier.
A few comments:
y-axis numbers are odd: The numbers on your y-axis are not numeric. If you type str(your.data.frame) you'll probably notice that "value" is not numeric, but a character or factor. This can be easily remedied via: df$value <- as.numeric(df$value), where df is your dataframe.
flipping axis: You can use coord_flip() (typically added to the end of your ggplot call. Be warned that when you do this, your aesthetics flip for the plot, so just keep that in mind.
your dataframe name is also a function/data name in r: This may not be causing any issues (due to your environment), but just be aware to use caution to name your dataset to not have names that are used in r elsewhere. This goes for column/variable names too. I don't think it causes any issues here, but just an FYI
geom_col vs geom_bar: Check out this documentation link for some description on the differences between geom_bar and geom_col. Basically, you want to use geom_bar when your y-axis is count, and geom_col when your y-axis is a value. Here, you want to plot a value, so choose geom_col(), and not geom_bar().
Fixing the issues in plot
Here's the representation of your data (note I rounded... hopefully got the actual data correct, because I manually had to copy each value):
from symbol quantity usd value
1 BTC BTCUSDT 12910.470 6776.340 87485737
2 ETH ETHUSDT 6168.730 154.398 952445
3 BNB BNBUSDT 51002.650 14.764 753017
4 BNB BNBBTC 31071.280 14.764 458745
5 ETH ETHBTC 2216.576 154.398 342236
6 LTC LTCUSDT 4332.024 40.481 175368
7 BNB BNBETH 3150.030 14.764 46507
8 LTC LTCBTC 922.560 40.481 37346
9 LTC LTCBNB 521.476 40.481 21110
10 NEO NEOUSDT 2438.353 7.203 17564
11 NEO NEOBTC 417.930 7.203 3010
Here's the basic plot, flipped:
ggplot(df, aes(symbol, value, fill=from)) +
geom_col() +
coord_flip()
The problem here is that when you plot values... BTCUSDT is huge in comparison. I would suggest you plot on log of value. See this link for some advice on how to do that. I like the scale_y_log10() function, since it just works here pretty well:
ggplot(df, aes(symbol, value, fill=from)) +
geom_col() +
scale_y_log10() +
coord_flip()
If you wanted to keep the columns going in the vertical orientation, you can still do that and avoid having the text run into each other on the x-axis. In that case, you can rotate the labels via theme(axis.text.x=...). Note the adjustments to horizontal and vertical alignment (hjust=1), which forces the labels to be "right-aligned":
ggplot(df, aes(symbol, value, fill=from)) +
geom_col() +
scale_y_log10() +
theme(axis.text.x=element_text(angle=45, hjust=1))

Related

create separate legend for each row of facet_grid

I am using nested_facet() to plot a large number of experiments, yielding a 9x6 array. Each panel in the array is a stacked bargraph with variables indicated by color, same set of variables common to each row.
My code...
ggplot(data, aes(x=enzyme_drug, y=counts, fill = species)) +
geom_bar(position = "fill", stat = "identity") +
theme(axis.text.x = element_text(angle = 90)) +
guides(fill=guide_legend(ncol=1)) +
facet_grid(pH ~ combo, scales = "free")
The only problem with putting them all together into a facet grid is that the number of variables indicated by color in the figure legend is too high, resulting in adjacent colors that are hard to differentiate.
An easy way out of this would be to create a separate legend for each row, with the small number of variables for each row being much easier to differentiate.
Any ideas on how to do this?
The other alternative I realise is to loop the creation of separate facet_grids into a list and then put them together with ggarrange() - this yields unwieldy x-axis labels repeated for each row, although I could remove the x-axes manually for each row in the arrangement there must be a simpler method?
Thanks in advance...

why is ggplot2 geom_col misreading discrete x axis labels as continuous?

Aim: plot a column chart representing concentration values at discrete sites
Problem: the 14 site labels are numeric, so I think ggplot2 is assuming continuous data and adding spaces for what it sees as 'missing numbers'. I only want 14 columns with 14 marks/labels, relative to the 14 values in the dataframe. I've tried assigning the sites as factors and characters but neither work.
Also, how do you ensure the y-axis ends at '0', so the bottom of the columns meet the x-axis?
Thanks
Data:
Sites: 2,4,6,7,8,9,10,11,12,13,14,15,16,17
Concentration: 10,16,3,15,17,10,11,19,14,12,14,13,18,16
You have two questions in one with two pretty straightforward answers:
1. How to force a discrete axis when your column is a continuous one? To make ggplot2 draw a discrete axis, the data must be discrete. You can force your numeric data to be discrete by converting to a factor. So, instead of x=Sites in your plot code, use x=as.factor(Sites).
2. How to eliminate the white space below the columns in a column plot? You can control the limits of the y axis via the scale_y_continuous() function. By default, the limits extend a bit past the actual data (in this case, from 0 to the max Concentration). You can override that behavior via the expand= argument. Check the documentation for expansion() for more details, but here I'm going to use mult=, which uses a multiplication to find the new limits based on the data. I'm using 0 for the lower limit to make the lower axis limit equal the minimum in your data (0), and 0.05 as the upper limit to expand the chart limits about 5% past the max value (this is default, I believe).
Here's the code and resulting plot.
library(ggplot2)
df <- data.frame(
Sites = c(2,4,6,7,8,9,10,11,12,13,14,15,16,17),
Concentration = c(10,16,3,15,17,10,11,19,14,12,14,13,18,16)
)
ggplot(df, aes(x=as.factor(Sites), y=Concentration)) +
geom_col(color="black", fill="lightblue") +
scale_y_continuous(expand=expansion(mult=c(0, 0.05))) +
theme_bw()

Creating bar charts with binary data

I have the following data , which I am trying to use to create a bar chart from to show how preference of fruit varies with country:
see data table here
I want to create a bar chart that shows preference of apples, oranges, grapes and bananas based on survey location (i.e x= surveyloc and Y = pref freq of oranges, apples, bananas). I am not quite sure how to do this when dealing with binary data and am hoping to get some assistance.
If you are looking to see preference for multiple variables (ex. fruits) across multiple locations (ex. locations), when only having binary data ("yes" or "no", or 0 vs 1), a bar chart is probably not the best option. My recommendation would be something like a tile plot so that you can convey at a glance preferences across the locations. Here's an example using some dummy data. I'll first show you an example of a bar plot (column plot), then the recommendation I have for you, which would be a tilemap.
Example Dataset
library(ggplot2)
library(dplyr)
library(tidyr)
set.seed(8675309)
df <- data.frame(
location = state.name[1:10],
apples = rbinom(10,1,0.3),
oranges = rbinom(10,1,0.1),
pears = rbinom(10,1,0.25),
grapes = rbinom(10,1,0.6),
mangos = rbinom(10,1,0.65)
)
# tidy data
df <- df %>% pivot_longer(cols = -location) %>%
mutate(value = factor(value))
I created df above initially in the same format you have for your dataset (location | pref1 | pref2 | pref3 | ...). It's difficult to use ggplot2 to plot this type of data easily, since it is designed to handle what is referred to as Tidy Data. This is overall a better strategy for data management and is adaptable to whatever output you wish - I'd recommend reading that vignette for more info. Needless to say, after the code above we have df formatted as a "tidy" table.
Note I've also turned the binary "value" column into a factor (since it only holds "0" or "1", and values of "0.5" and the like don't make sense here with this data).
"Bar Chart"
I put "bar chart" in quotes, because as we are plotting the value (0 or 1) on the y axis and location on the x axis, we are creating a "column chart". "Bar charts" formally only need a list of values and plot count, density, or probability on the y axis. Regardless, here's an example:
bar_plot <-
df %>%
ggplot(aes(x=location, y=value, fill=name)) +
geom_col(position="dodge", color='gray50', width=0.7) +
scale_fill_viridis_d()
bar_plot
We could think about just showing where value==1, but that's probably not going to make things clearer.
Example of Tile Plot
What I think works better here is a tilemap. The idea is that you spread location on the x axis and name (of the fruit) on the y axis, and then show the value field as the color of the resulting tiles. I think it makes things a bit easier to view, and it should work pretty much the same if your data is binary or probabilistic. For probability data, you just don't need to conver to a factor first.
tile_plot <-
df %>%
ggplot(aes(x=location, y=name, fill=value)) +
geom_tile(color='black') +
scale_fill_manual(values=c(`0`="gray90", `1`="skyblue")) +
coord_fixed() +
scale_x_discrete(expand=expansion(0)) +
scale_y_discrete(expand=expansion(0))
tile_plot
To explain a little what's going on here is that we setup the aesthetics as indicated above in ggplot(...). Then we draw the tiles with geom_tile(), where the color= represents the line around the tiles. The actual fill colors are described in scale_fill_manual(). The tiles are forced to be "sqare" via coord_fixed(), and then I remove excess area around the tiles via the scale_x_*() and scale_y_*() commands.

Growth curves in R with standard deviation

I am trying to plot my data (replicate results for each strain) and i want only one line graph for each strain, this means averaged results of replicates for each strain with points along the line with error bars (error between replicate data).
If you click on the image above, it shows the plot i have so far, which displays WT and WT.1 as seperate lines and all other replicates. However, they are replicates of each strain (WT,DrsbR,DsigB) and i want them to appear as one line of mean results for each strain instead. I am using ggplot package- and melting data with reshape package, but cannot figure out how to make my replicates appear as one line together with error bars (standard deviation of mean results between replicates).
The image in black and white is something i am looking for in my graph- seperate line with points of replicate data plotted as a mean value.
library(reshape2)
melted<-melt(abs2)
print(abs2)
melted<-melt(abs2,id=1,measured=c("WT","WT.1","DsigB","DsigB.1","DrsbR","DrsbR.1"))
View(melted)
colnames(melted)<-c("Time","Strain","Values")
##line graph for melted data
melted$Time<-as.factor(melted$Time)
abs2line=ggplot(melted,aes(Time,Values))+geom_line(aes(colour=Strain,group=Strain))
abs2line+
stat_summary(fun=mean,
geom="point",
aes(group=Time))+
stat_summary(fun.data=mean_cl_boot,
geom="errorbar",
width=.2)+
xlab("Time")+
ylab("OD600")+
theme_classic()+
labs(title="Growth Curve of Mutant Strains")
summary(melted)
print(melted)
One approach is to take your melted data frame and separate out the "variable" column into "species" and "strain" using the separate() function from tidyr. I don't have your dataset -- it is appreciated if you are able to share your dataset via dput(your.data.frame) for future questions -- so I made a dummy dataset that's similar to yours. Here we have two "species" (red and blue) and two "strains" for each species.
df <- data.frame(
time = seq(0, 40, by=10),
blue = c(0:4),
blue.1 = c(0, 1.1, 1.9, 3.1, 4.1),
red = seq(0, 8, by=2),
red.1 = c(0, 2.1, 4.2, 5.5, 8.2)
)
df.melt <- melt(df,
id.vars = 'time',
measure.vars = c('blue', 'blue.1', 'red', 'red.1'))
We can then use tidyr::separate() to separate the resulting "variable" column into a "species" column and a "strain" column. Luckily, your data contains a "." which can be a handy character to use for the separation:
df.melt.mod <- df.melt %>%
separate(col=variable, into=c('species', 'strain'), sep='\\.')
Note: The above code will give you a warning related to the point that "blue" and "red" do not have the "." character, thereby giving you NA for the "strain" column. We don't care here, because we're not using that column for anything here. In your own dataset, you can similarly not care too much.
Then, you can actually just use stat_summary() for all geoms... modify as you see fit for your own visual and thematic preference. Note that order matters for layering, so I plot geom_line first, then geom_point, then geom_errorbar. Also note that you can assign the group=species aesthetic in the base ggplot() call and that mapping applies to all geoms unless overwritten.
ggplot(df.melt.mod, aes(x=time, y=value, group=species)) +
stat_summary(
fun = mean,
geom='line',
aes(color=species)) +
stat_summary(
fun=mean,
geom='point') +
stat_summary(
fun.data=mean_cl_boot,
geom='errorbar',
width=0.5) +
theme_bw()

plotting matrices with gnuplot

I am trying to plot a matrix in Gnuplot as I would using imshow in Matplotlib. That means I just want to plot the actual matrix values, not the interpolation between values. I have been able to do this by trying
splot "file.dat" u 1:2:3 ps 5 pt 5 palette
This way we are telling the program to use columns 1,2 and 3 in the file, use squares of size 5 and space the points with very narrow gaps. However the points in my dataset are not evenly spaced and hence I get discontinuities.
Anyone a method of plotting matrix values in gnuplot regardless of not evenly spaced in Xa and y axes?
Gnuplot doesn't need to have evenly space X and Y axes. ( see another one of my answers: https://stackoverflow.com/a/10690041/748858 ). I frequently deal with grids that look like x[i] = f_x(i) and y[j] = f_y(j). This is quite trivial to plot, the datafile just looks like:
#datafile.dat
x1 y1 z11
x1 y2 z12
...
x1 yN z1N
#<--- blank line (leave these comments out of your datafile ;)
x2 y1 z21
x2 y2 z22
...
x2 yN z2N
#<--- blank line
...
...
#<--- blank line
xN y1 zN1
...
xN yN zNN
(note the blank lines)
A datafile like that can be plotted as:
set view map
splot "datafile.dat" u 1:2:3 w pm3d
the option set pm3d corners2color can be used to fine tune which corner you want to color the rectangle created.
Also note that you could make essentially the same plot doing this:
set view map
plot "datafile.dat" u 1:2:3 w image
Although I don't use this one myself, so it might fail with a non-equally spaced rectangular grid (you'll need to try it).
Response to your comment
Yes, pm3d does generate (M-1)x(N-1) quadrilaterals as you've alluded to in your comment -- It takes the 4 corners and (by default) averages their value to assign a color. You seem to dislike this -- although (in most cases) I doubt you'd be able to tell a difference in the plot for reasonably large M and N (larger than 20). So, before we go on, you may want to ask yourself if it is really necessary to plot EVERY POINT.
That being said, with a little work, gnuplot can still do what you want. The solution is to specify that a particular corner is to be used to assign the color to the entire quadrilateral.
#specify that the first corner should be used for coloring the quadrilateral
set pm3d corners2color c1 #could also be c2,c3, or c4.
Then simply append the last row and last column of your matrix to plot it twice (making up an extra gridpoint to accommodate the larger dataset. You're not quite there yet, you still need to shift your grid values by half a cell so that your quadrilaterals are centered on the point in question -- which way you shift the cells depends on your choice of corner (c1,c2,c3,c4) -- You'll need to play around with it to figure out which one you want.
Note that the problem here isn't gnuplot. It's that there isn't enough information in the datafile to construct an MxN surface given MxN triples. At each point, you need to know it's position (x,y) it's value (z) and also the size of the quadrilateral to be draw there -- which is more information than you've packed into the file. Of course, you can guess the size in the interior points (just meet halfway), but there's no guessing on the exterior points. but why not just use the size of the next interior point?. That's a good question, and it would (typically) work well for rectangular grids, but that is only a special case (although a common one) -- which would (likely) fail miserably for many other grids. The point is that gnuplot decided that averaging the corners is typically "close enough", but then gives you the option to change it.
See the explanation for the input data here. You may have to change your data file's format accordingly.