plot only those histograms with n number of counts - ggplot2

My data looks like this. With this code I get lots of histograms for all my different Space.Groups.
massaged <- read.delim("~/Downloads/dummy.tsv")
library(tidyverse)
df <- na.omit(massaged)
ggplot() +
geom_histogram(data = df, mapping = aes(x = pH.Value)) +
facet_grid(. ~ Space.Group)
As you can see, most of my histograms contain a few total number of counts.
I am interested in only those histograms with a total number of counts greater than 100. For this particular plot I was trying to do something like
ggplot(data = filter(df, Uniprot.Recommended.Name == "Beta-2-microglobulin", Space.Group == %in% c("P 1 21 1", "P 1", "P 21 21 21", "C 1 2 1"))) +
geom_histogram(mapping = aes(x = pH.Value)) +
facet_grid(. ~ Space.Group)
Which by the way it does not work and is not very automatic because I have to make the first plot and then check by hand which Space.Groups are the interesting ones.
My question: Is it possible to tell ggplot2 to only plot those histograms with a total number of counts greater than a number (100, in this case).

Related

Sorting a dataframe to create a "quilt" chart using geom_tile in ggplot2

I am looking to create a "quilt" chart that would show the best performing category in each year. This is a fairly common chart, but I have no idea how to produce this in R. Below is a simple reproducible example.
fruit <- c("Apples", "Oranges", "Pears", "Peaches")
yield2020 <- c("200", "100", "250", "125")
yield2021 <- c("40", "90", "85", "100")
yield2022 <- c("150", "110", "150", "170")
DF <- data.frame (Fruit, yield2020, yield2021, yield2022)
As you can see in DF, each year, Apples, Oranges, Peaches, and Pears have different output levels. I'm looking to create a geom_tile chart that would show Pears as the top performer in 2020, Peaches in 2021, and Peaches again in 2022, with the other fruit groups shown, color coded, below it. Any advice is greatly appreciated!
I found the below example of a different quilt chart, which might help frame my ultimate goal here
One option to achieve your desired result would be to reshape your data to long and to add a column with the rank per year which could then be mapped on the y aes. Additionally I use scale_y_reverse to put the top performers on top of the chart:
library(dplyr)
library(tidyr)
library(ggplot2)
df <- DF |>
pivot_longer(-fruit, names_to = "year", names_prefix = "yield") |>
mutate(across(c(year, value), as.numeric)) |>
group_by(year) |>
mutate(rank = rank(-value, ties.method = "first"))
ggplot(df, aes(year, rank, fill = fruit)) +
geom_tile() +
geom_text(aes(label = paste(fruit, value, sep = "\n"))) +
scale_y_reverse() +
guides(fill = "none")

Draw ggplots in the plot grid by column - plot_grid() function of the cowplot package

I am using the plot_grid() function of the cowplot package to draw ggplots in a grid and would like to know if there is a way to draw plots by column instead of by row?
library(ggplot2)
library(cowplot)
df <- data.frame(
x = c(3,1,5,5,1),
y = c(2,4,6,4,2)
)
# Create plots: say two each of path plot and polygon plot
p <- ggplot(df)
p1 <- p + geom_path(aes(x,y)) + ggtitle("Path 1")
p2 <- p + geom_polygon(aes(x,y)) + ggtitle("Polygon 1")
p3 <- p + geom_path(aes(y,x)) + ggtitle("Path 2")
p4 <- p + geom_polygon(aes(y,x)) + ggtitle("Polygon 2")
plots <- list(p1,p2,p3,p4)
plot_grid(plotlist=plots, ncol=2) # plots are drawn by row
I would like to have plots P1 and P2 in the first column and p3 and p4 in the second column, something like:
plots <- list(p1, p3, p2, p4) # plot sequence changed
plot_grid(plotlist=plots, ncol=2)
Actually I could have 4, 6, or 8 plots. The number of rows in the plot grid will vary but will always have 2 columns. In each case I would like to fill the plot grid by column (vertically) so my first 2, 3, or 4 plots, as the case maybe, appear over each other. I would like to avoid hardcode these different permutations if I can specify something like par(mfcol = c(n,2)).
As you have observed, plot_grid() draws plots by row. I don't believe there's any way to change that, so if you want to maintain using plot_grid() (which would be probably most convenient), then one approach could be to change the order of the items in your list of plots to match what you need for plot_grid(), given knowledge of the number of columns.
Here's a function I have written that does that. The basic idea is to:
create a list of indexes for number of items in your list (i.e. 1:length(your_list)),
put the index numbers into a matrix with the specified number of rows,
read back that matrix into another vector of indexes by column
reorder your list according to the newly ordered indexes
I've tried to build in a way to make this work even if the number of items in your list is not divisible by the intended number of columns (like a list of 8 items arranged in 3 columns).
reorder_by_col <- function(myData, col_num) {
x <- 1:length(myData) # create index vector
length(x) <- prod(dim(matrix(x, ncol=col_num))) # adds NAs as necessary
temp_matrix <- matrix(x, ncol=col_num, byrow = FALSE)
new_x <- unlist(split(temp_matrix, rep(1:ncol(temp_matrix), each=row(temp_matrix))))
names(new_x) <- NULL # not sure if we need this, but it forces an unnamed vector
return(myData[new_x])
}
This all was written with a little help from Google and specifically answers to questions posted here and here.
You can now see the difference without reordering:
plots <- list(p1,p2,p3,p4)
plot_grid(plotlist=plots, ncol=2)
... and with reordering using the new method:
newPlots <- reorder_by_col(myData=plots, col_num=2)
plot_grid(plotlist=newPlots, ncol=2)
The argument, byrow, has now been added to plot_grid.
In the case where you would like to have num_plots < nrow * ncol the remaining spots will be empty.
You can now call:
library(ggplot2)
df <- data.frame(
x = 1:10, y1 = 1:10, y2 = (1:10)^2, y3 = (1:10)^3, y4 = (1:10)^4
)
p1 <- ggplot(df, aes(x, y1)) + geom_point()
p2 <- ggplot(df, aes(x, y2)) + geom_point()
p3 <- ggplot(df, aes(x, y3)) + geom_point()
cowplot::plot_grid(p1, p2, p3, byrow = FALSE)

Grouping the factors in ggplot

I am trying to create a graph based on matrix similar to one below... I am trying to group the Erosion values based on "Slope"...
library(ggplot2)
new_mat<-matrix(,nrow = 135, ncol = 7)
colnames(new_mat)<-c("Scenario","Runoff (mm)","Erosion (t/ac)","Slope","Soil","Tillage","Rotation")
for ( i in 1:nrow(new_mat)){
new_mat[i,2]<-sample(10:50, 1)
new_mat[i,3]<-sample(0.1:20, 1)
new_mat[i,4]<-sample(c("S2","S3","S4","S5","S1"),1)
new_mat[i,5]<-sample(c("Deep","Moderate","Shallow"),1)
new_mat[i,7]<-sample(c("WBP","WBF","WF"),1)
new_mat[i,6]<-sample(c("Intense","Reduced","Notill"),1)
new_mat[i,1]<-paste0(new_mat[i,4],"_",new_mat[i,5],"_",new_mat[i,6],"_",new_mat[i,7],"_")
}
#### Graph part ########
grphs_mat<-as.data.frame(new_mat)
grphs_mat$`Runoff (mm)`<-as.numeric(as.character(grphs_mat$`Runoff (mm)`))
grphs_mat$`Erosion (t/ac)`<-as.numeric(as.character(grphs_mat$`Erosion (t/ac)`))
ggplot(grphs_mat, aes(Scenario, `Erosion (t/ac)`,group=Slope, colour = Slope))+
scale_y_continuous(limits=c(0,max(as.numeric((grphs_mat$`Erosion (t/ac)`)))))+
geom_point()+geom_line()
But when i run this code.. The values are distributed in x-axis for all 135 scenarios. But what i want is grouping to be done in terms of slope but it also picks up the other common factors such as Soil+Rotation+Tillage and place it in x-axis. For example:
For these five scenarios:
S1_Deep_Intense_WBF_
S2_Deep_Intense_WBF_
S3_Deep_Intense_WBF_
S4_Deep_Intense_WBF_
S5_Deep_Intense_WBF_
It separates the S1, S2, S3,S4,S5 but also be able to know that other factors are same and put them in x-axis such that the slope lines are stacked on top of each other in 135/5 = 27 x-axis points. The final figure should look like this (Refer image). Apologies for not being able to explain it better.
I think i am making a mistake in grouping or assigning the x-axis values.
I will appreciate your suggestions.
In the example you give, I didn't get every possible factor combination represented so the plots looked a bit weird. What I did instead was start with the following:
set.seed(42)
new_mat <- matrix(,nrow = 1000, ncol = 7)
And then deduplicated this by summarising the values. A possible relevant step here for you analysis is that I made new variable with the interaction() function that is the combination of three other factors.
library(tidyverse)
df <- grphs_mat
df$x <- with(df, interaction(Rotation, Soil, Tillage))
# The simulation did not yield unique combinations
df <- df %>% group_by(x, Slope) %>%
summarise(n = sum(`Erosion (t/ac)`))
Next, I plotted this new x variable on the x-axis and used "stack" positions for the lines and points.
g <- ggplot(df, aes(x, y = n, colour = Slope, group = Slope)) +
geom_line(position = "stack") +
geom_point(position = "stack")
To make the x-axis slightly more readable, you can replace the . that the interaction() function placed by newlines.
g + scale_x_discrete(labels = function(x){gsub("\\.", "\n", x)})
Another option is to simply rotate the x axis labels:
g + theme(axis.text.x.bottom = element_text(angle = 90))
There are a few additional options for the x-axis if you go into ggplot2 extension packages.

Plotly is not reading ggplot output well

I am using the following code to plot some data points and it works well in ggplot. However, when I feed this into ggplotly, the visualization and Y-axis labels change completely. Y-axis label shift to right and gets flipped, and the lines in the center get thinner.
Code
library(ggplot2)
library(tidyverse)
library(plotly)
file2 <- read.csv( text = RCurl::getURL("https://gist.githubusercontent.com/gireeshkbogu/806424c1777ff721a046b3e30e85af5a/raw/50ac0b4696f514677b4987b90305fdf879fbcd84/reproducible.examples.txt"), sep="\t")
p <- ggplot(data=subset(file2,!is.na(datetime)),
aes(x=datetime, y=Count,
color=Type,
group=Subject)) +
geom_point(size=4, alpha=0.6) +
scale_y_continuous(breaks=c(0,1))+
theme(axis.text.x=element_text(angle=90, size = 5))+
facet_grid(Subject ~ ., switch = "y") +
theme(axis.title.y=element_blank(),
axis.text.y=element_blank(),
axis.ticks.y=element_blank())+
theme(strip.text.y.left = element_text(angle = 0, size=5)) +
scale_color_manual(values=c("red", "#990000", "#330000", "#00CC99", "#0099FF"))
ggplotly(p)
Ggplot image
Ggplotly image
Reproducible Example
Subject datetime Type Count
user1 4/16/20 15:00 A1 1
user1 3/28/20 13:00 A1 1
user2 4/29/20 15:00 A1 1
user2 5/02/20 09:00 A1 1
user1 2/19/20 18:00 A2 1
user1 4/20/20 16:00 A2 1
Converting ggplot to plotly turns out to be surprisingly complicated! Many ggplot features are silently dropped or incorrectly translated over to plotly.
If I am not mistaken, switch = "y" within your facet_grid is being silently dropped.
In addition, you have too many facets in your plot. Looks like "Subject" is creating 30+ facets. I know that it is tempting to try and fit as much data into one plot, but you are really pushing the limits of what you can do with facets here.
I made some modifications. See if this is something you can work with:
library(ggplot2)
library(tidyverse)
library(plotly)
library(RCurl)
# your original file
file2 <- read.csv( text = RCurl::getURL("https://gist.githubusercontent.com/gireeshkbogu/806424c1777ff721a046b3e30e85af5a/raw/50ac0b4696f514677b4987b90305fdf879fbcd84/reproducible.examples.txt"), sep="\t")
head(file2)
# scaling down the dataframe so that you have fewer facets per plot
file3 <- file2 %>%
as_tibble() %>%
na.omit() %>%
filter(Subject %in% c("User1", "User2", "User3", "User4")) %>%
arrange(Subject, datetime)
head(file3)
# sending the smaller data frame to ggplot
p_2 <- ggplot(data=file3,
aes(x=datetime, y=Count, color=Type, group=Subject)) +
geom_point(size=4, alpha=0.6) +
scale_y_continuous(breaks=c(0,1))+
theme(axis.text.x=element_text(angle=90, size = 5)) +
facet_grid(Subject ~ .) + # removing "Switch" ; it is being dropped by plotly
theme(axis.title.y=element_blank(),
axis.text.y=element_blank(),
axis.ticks.y=element_blank(),
legend.position = "left") + # move legend to left on ggplot
theme(strip.text.y.left = element_text(angle = 0, size=5)) +
scale_color_manual(values=c("red", "#990000", "#330000", "#00CC99", "#0099FF"))
p_2
ggplotly(p_2) %>%
layout(title = "Modified & Scaled Down Plot",
legend = list(orientation = "v", # fine-tune legend directly in plotly,
y = 1, x = -0.1)) # you may need to fiddle with these
The modified code yields me this plot. You will probably need to make a few small groups by "Subject" and call a plot for each group.

How to show higher values in ggplot2 within facet_grid

I have just found the function facet_grid in ggplot2, it's awesome. The question is: I have a list with 6 countries (column HC) and destination of flights all around the world. My data look like this:
HC Reason Destination freq Perc
<chr> <chr> <chr> <int> <dbl>
1 Germany Study Germany 9 0.3651116
2 Germany Work Germany 3 0.1488095
3 Germany Others Germany 3 0.4901961
4 Hungary Study Germany 105 21.4285714
5 Hungary Work Germany 118 17.6382661
6 Hungary Others Germany 24 5.0955414
7 Luxembourg Study Germany 362 31.5056571
Is there a way that in each country only show the top ten destinations and using the function facet_grid? Im trying to make a scatter plot in this way:
Geograp %>%
gather(key=Destination, value=freq, -Reason, -Qcountry) %>%
rename(HC = Qcountry) %>%
group_by(HC,Reason) %>%
mutate(Perc=freq*100/sum(freq)) %>%
ggplot(aes(x=Perc, y=reorder(Destination,Perc))) +
geom_point(size=3) +
theme_bw() +
facet_grid(HC~Reason) +
theme(panel.grid.major.x = element_blank(),
panel.grid.minor.x = element_blank(),
panel.grid.major.y = element_line(colour = "grey60", linetype = "dashed"))
Which produces this graph: I want to avoid the overplotting in the y-axis. Thanks in advance!!!
You could create a variable indicating the rank of each destination by country and then in the ggplot call select rows with ranking <= 10, e.g.
ggplot(data = mydata[rank <= 10, ], ....)
PS: Currently you create data and plot data all in one line using pipes. I would separate the data creation and plotting step.
As You have not posted Your data in correct format (check out dput()), i have used just a sample data. Using dplyr package i grouped in this case by grp variable (group_by(grp), in Your case it is a country) and selected top 10 rows (...top_n(n = 10,...) which are sorted by x variable (wt = x, in Your case it will be freq) and plotted it further (just in this case scatter plot):
library(dplyr)
set.seed(123)
d <- data.frame(x = runif(90),grp = gl(3, 30))
d %>%
group_by(grp) %>%
top_n(n = 10, wt = x) %>%
ggplot(aes(x=x, y=grp)) + geom_point()