plot gam results with original x values (not scaled and centred) - ggplot2

I have a dataset that I am modeling with a gam. Because there are two continuous varaibles in the gam, I have centred and scaled these variables before adding them to the model. Therefore, when I use the built-in features in gratia to show the results, the x values are not the same as the original scale. I'd like to plot the results using the scale of the original data.
An example:
library(tidyverse)
library(mgcv)
library(gratia)
set.seed(42)
df <- data.frame(
doy = sample.int(90, 300, replace = TRUE),
year = sample(c(1980:2020), size = 300, replace = TRUE),
site = c(rep("A", 150), rep("B", 80), rep("C", 70)),
sex = sample(c("F", "M"), size = 300, replace = TRUE),
mass = rnorm(300, mean = 500, sd = 50)) %>%
mutate(doy.s = scale(doy, center = TRUE, scale = TRUE),
year.s = scale(year, center = TRUE, scale = TRUE),
across(c(sex, site), as.factor))
m1 <- gam(mass ~
s(year.s, site, bs = "fs", by = sex, k = 5) +
s(doy.s, site, bs = "fs", by = sex, k = 5) +
s(sex, bs = "re"),
data = df, method = "REML", family = gaussian)
draw(m1)
How do I re-plot the last two panels in this figure to show the relationship between year and mass with ggplot?

You can't do this with gratia::draw automatically (unless I'm mistaken).* But you can use gratia::smooth_estimates to get a dataframe which you can then do whatever you like with.
To answer your specific question: to re-plot the last two panels of the plot you provided, but with year unscaled, you can do the following
# Get a tibble of smooth estimates from the model
sm <- gratia::smooth_estimates(m1)
# Add a new column for the unscaled year
sm <- sm %>% mutate(year = mean(df$year) + (year.s * sd(df$year)))
# Plot the smooth s(year.s,site) for sex=F with year unscaled
pF <- sm %>% filter(smooth == "s(year.s,site):sexF" ) %>%
ggplot(aes(x = year, y = est, color=site)) +
geom_line() +
theme(legend.position = "none") +
labs(y = "Partial effect", title = "s(year.s,site)", subtitle = "By: sex; F")
# Plot the smooth s(year.s,site) for sex=M with year unscaled
pM <- sm %>% filter(smooth == "s(year.s,site):sexM" ) %>%
ggplot(aes(x = year, y = est, color=site)) +
geom_line() +
theme(legend.position = "none") +
labs(y = "Partial effect", title = "s(year.s,site)", subtitle = "By: sex; M")
library(patchwork) # use `patchwork` just for easy side-by-side plots
pF + pM
to get:
EDIT: If you also want to shift result on the y-axis as #GavinSimpson (who is the author and maintainer of gratia) mentioned, you can do this with add_constant, adding this code before plotting above:
sm <- sm %>%
add_constant(coef(m1)["(Intercept)"]) %>%
transform_fun(inv_link(m1))
[You should also in general untransform the smooth by the inverse of the model's link function. In your case this is just the identity, so it is not necessary, but in general it would be. That's what the second step above is doing.]
In your example, this results in:
*As mentioned in the custom-plotting vignette for gratia, the goal of draw not to be fully customizable, but just to be useful default. See there for recommendations about custom plots.

Related

How to add count (n) / summary statistics as a label to ggplot2 boxplots?

I am new to R and trying to add count labels to my boxplots, so the sample size per boxplot shows in the graph.
This is my code:
bp_east_EC <-total %>% filter(year %in% c(1977, 2020, 2021, 1992),
sampletype == "groundwater",
East == 1,
#EB == 1,
#N59 == 1,
variable %in% c("EC_uS")) %>%
ggplot(.,aes(x = as.character(year), y = value, colour = as.factor(year))) +
theme_ipsum() +
ggtitle("Groundwater EC, eastern Curacao") +
theme(plot.title = element_text(hjust = 0.5, size=14)) +
theme(legend.position = "none") +
labs(x="", y="uS/cm") +
geom_jitter(color="grey", size=0.4, alpha=0.9) +
geom_boxplot() +
stat_summary(fun.y=mean, geom="point", shape=23, size=2) #shows mean
I have googled a lot and tried different things (with annotate, with return functions, mtext, etc), but it keeps giving different errors. I think I am such a beginner I cannot figure out how to integrate such suggestions into my own code.
Does anybody have an idea what the best way would be for me to approach this?
I would create a new variable that contained your sample sizes per group and plot that number with geom_label. I've generated an example of how to add count/sample sizes to a boxplot using the iris dataset since your example isn't fully reproducible.
library(tidyverse)
data(iris)
# boxplot with no label
ggplot(iris, aes(x = Species, y = Sepal.Length, fill = Species)) +
geom_boxplot()
# boxplot with label
iris %>%
group_by(Species) %>%
mutate(count = n()) %>%
mutate(mean = mean(Sepal.Length)) %>%
ggplot(aes(x = Species, y = Sepal.Length, fill = Species)) +
geom_boxplot() +
geom_label(aes(label= count , y = mean + 0.75), # <- change this to move label up and down
size = 4, position = position_dodge(width = 0.75)) +
geom_jitter(alpha = 0.35, aes(color = Species)) +
stat_summary(fun = mean, geom = "point", shape = 23, size = 6)

ggplot facet different Y axis order based on value

I have a faceted plot wherein I'd like to have the Y-axis labels and the associated values appear in descending order of values (and thereby changing the order of the labels) for each facet. What I have is this, but the order of the labels (and the corresponding values) is the same for each facet.
ggplot(rf,
aes(x = revenues,
y = reorder(AgencyName, revenues))) +
geom_point(stat = "identity",
aes(color = AgencyName),
show.legend = FALSE) +
xlab(NULL) +
ylab(NULL) +
scale_x_continuous(label = scales::comma) +
facet_wrap(~year, ncol = 3, scales = "free_y") +
theme_minimal()
Can someone point me to the solution?
The functions reorder_within and scale_*_reordered from the tidytext package might come in handy.
reorder_within recodes the values into a factor with strings in the form of "VARIABLE___WITHIN". This factor is ordered by the values in each group of WITHIN.
scale_*_reordered removes the "___WITHIN" suffix when plotting the axis labels.
Add scales = "free_y" in facet_wrap to make it work as expected.
Here is an example with generated data:
library(tidyverse)
# Generate data
df <- expand.grid(
year = 2019:2021,
group = paste("Group", toupper(letters[1:8]))
)
set.seed(123)
df$value <- rnorm(nrow(df), mean = 10, sd = 2)
df %>%
mutate(group = tidytext::reorder_within(group, value, within = year)) %>%
ggplot(aes(value, group)) +
geom_point() +
tidytext::scale_y_reordered() +
facet_wrap(vars(year), scales = "free_y")

Add space argument to facet_wrap

facet_wrap() has been recognized for not having a space = "free" argument (https://github.com/tidyverse/ggplot2/issues/2933). This can causes spacing issues on the y-axis of plots.
Create the above figure using the following code:
library(tidyverse)
p <-
mtcars %>%
rownames_to_column() %>%
ggplot(aes(x = disp, y = rowname)) + geom_point() +
facet_wrap(~ carb, ncol = 1, scales = "free_y")
facet_grid on the other hand has a space = "free" argument. Allowing for nice y-axis spacing.
Create the above figure using the following code:
p <-
mtcars %>%
rownames_to_column() %>%
ggplot(aes(x = disp, y = rowname)) + geom_point() +
facet_grid(carb ~ ., scales = "free_y", space = "free_y")
The issue with this is that the label is on the side, not the top. I sometimes have longer facet labels and few rows in the facet. This means the facet label gets cut off.
There is a solution from the ggforce package (comment by ilarischeinin on https://github.com/tidyverse/ggplot2/issues/2933).
p <-
mtcars %>%
rownames_to_column() %>%
ggplot(aes(x = disp, y = rowname)) + geom_point()
p + ggforce::facet_col(vars(carb), scales = "free_y", space = "free")
But, there are limitations leaving ggplot2. For example, I ultimately want a two column figure, and this functionality does not seem possible with ggforce. Is there any way to produce the same result using facet_wrap() so that I can utilize the ncol() argument?
Here is a potential workaround based on https://stackoverflow.com/a/29022188/12957340 :
library(tidyverse)
library(gtable)
library(grid)
p1 <- mtcars %>%
rownames_to_column() %>%
ggplot(aes(x = disp, y = rowname)) + geom_point() +
facet_grid(carb ~ ., scales = "free_y", space = "free_y") +
theme(panel.spacing = unit(1, 'lines'),
strip.text.y = element_text(angle = 0))
gt <- ggplotGrob(p1)
panels <-c(subset(gt$layout, grepl("panel", gt$layout$name), se=t:r))
for(i in rev(panels$t-1)) {
gt = gtable_add_rows(gt, unit(0.5, "lines"), i)
}
panels <-c(subset(gt$layout, grepl("panel", gt$layout$name), se=t:r))
strips <- c(subset(gt$layout, grepl("strip-r", gt$layout$name), se=t:r))
stripText = gtable_filter(gt, "strip-r")
for(i in 1:length(strips$t)) {
gt = gtable_add_grob(gt, stripText$grobs[[i]]$grobs[[1]], t=panels$t[i]-1, l=5)
}
gt = gt[,-6]
for(i in panels$t) {
gt$heights[i-1] = unit(0.8, "lines")
gt$heights[i-2] = unit(0.2, "lines")
}
grid.newpage()
grid.draw(gt)
Created on 2021-12-15 by the reprex package (v2.0.1)
It's not clear to me what you mean by "I ultimately want a two column figure", but if you can come up with an example to illustrate your 'ultimate' expected outcome I can try to adapt this approach and see if it will work or not.

How to extract stat_smooth curve maxima in gpplot panel (facet_grid)?

I have created this plot with 18 grids using facet_grid command and two different fitting equations (for Jan - Apr, and May - Jun). I have two things that I need help with:
(may sound obvious, but) I haven't been able to find on the internet working codes extract a curve maximum for a stat_smooth fit. I'd appreciate if someone could show and explain what the codes mean. This is the closest I could find, but I am not sure what it means:
gb <- ggplot_build(p1)
curve_max <- gb$data[[1]]$x[which(diff(sign(diff(gb$data[[1]]$y)))==-2)+1]
How to add a vertical line to indicate max value on each curve?
Data file (rlc2 <- read_excel)
Plot
plot <- ggplot(rlc2, aes(par, etr, color=month, group=site))+
geom_point()+
stat_smooth(data = subset(rlc2, rlc2$month!="May" & rlc2$month!="Jun"),
method = "glm",
formula = y ~ x + log(x),
se = FALSE,
method.args = list(family = gaussian(link = "log"), start=c(a=0, b=0, c=0)))+
stat_smooth(data = subset(rlc2, rlc2$month=="May" | rlc2$month=="Jun"),
method = "nlsLM",
formula = y ~ M*(1 - exp(-(a*x))),
se = FALSE,
method.args = list(start=c(M=0, a=10)))+
facet_grid(rows = vars(month), cols = vars(site))
plot
field_rlc_plot
Any other advice are also welcome. I am educated as programmer so my codes are probably a bit messy. Thank you for helping.
Try this:
First, fit the data and extract the maximum of the fit.
my.fit <- function(month, site, data) {
fit <- glm(formula = etr ~ par + log(par),
data = data,
family=gaussian(link = "log")
)
#arrange the dersired output in a tibble
tibble(max = max(fit$fitted.values),
site = site,
month = month)
}
#Apply a custom function `my.fit` on each subset of data
#according to month and site using the group_by/nest/map method
# the results are rowbinded and returned in a data.frame
my.max<-
rlc2 %>%
dplyr::group_by(month, site) %>%
tidyr::nest() %>%
purrr::pmap_dfr(my.fit)
Next, join the results back on your data and plot a geom_line
rlc2 %>%
dplyr::left_join(my.max) %>%
ggplot(aes(x = par, y = etr))+
geom_point()+
stat_smooth(data = subset(rlc2, rlc2$month!="May" & rlc2$month!="Jun"),
method = "glm",
formula = y ~ x + log(x),
se = FALSE,
method.args = list(family = gaussian(link = "log"), start=c(a=0, b=0, c=0)))+
stat_smooth(data = subset(rlc2, rlc2$month=="May" | rlc2$month=="Jun"),
method = "nlsLM",
formula = y ~ M*(1 - exp(-(a*x))),
se = FALSE,
method.args = list(start=c(M=0, a=10)))+
geom_line(aes(y=max), col="red")+
facet_grid(rows = vars(month), cols = vars(site))

Depth Profiling visualization

I'm trying to create a depth profile graph with the variables depth, distance and temperature. The data collected is from 9 different points with known distances between them (distance 5m apart, 9 stations, 9 different sets of data). The temperature readings are according to these 9 stations where a sonde was dropped directly down, taking readings of temperature every 2 seconds. Max depth at each of the 9 stations were taken from the boat also.
So the data I have is:
Depth at each of the 9 stations (y axis)
Temperature readings at each of the 9 stations, at around .2m intervals vertical until the bottom was reached (fill area)
distance between the stations, (x axis)
Is it possible to create a depth profile similar to this? (obviously without the greater resolution in this graph)
I've already tried messing around with ggplot2 and raster but I just can't seem to figure out how to do this.
One of the problems I've come across is how to make ggplot2 distinguish between say 5m depth temperature reading at station 1 and 5m temperature reading at station 5 since they have the same depth value.
Even if you can guide me towards another program that would allow me to create a graph like this, that would be great
[ REVISION ]
(Please comment me if you know more suitable interpolation methods, especially not needing to cut under bottoms data.)
ggplot() needs long data form.
library(ggplot2)
# example data
max.depths <- c(1.1, 4, 4.7, 7.7, 8.2, 7.8, 10.7, 12.1, 14.3)
depth.list <- sapply(max.depths, function(x) seq(0, x, 0.2))
temp.list <- list()
set.seed(1); for(i in 1:9) temp.list[[i]] <- sapply(depth.list[[i]], function(x) rnorm(1, 20 - x*0.5, 0.2))
set.seed(1); dist <- c(0, sapply(seq(5, 40, 5), function(x) rnorm(1, x, 1)))
dist.list <- sapply(1:9, function(x) rep(dist[x], length(depth.list[[x]])))
main.df <- data.frame(dist = unlist(dist.list), depth = unlist(depth.list) * -1, temp = unlist(temp.list))
# a raw graph
ggplot(main.df, aes(x = dist, y = depth, z = temp)) +
geom_point(aes(colour = temp), size = 1) +
scale_colour_gradientn(colours = topo.colors(10))
# a relatively raw graph (don't run with this example data)
ggplot(main.df, aes(x = dist, y = depth, z = temp)) +
geom_raster(aes(fill = temp)) + # geom_contour() +
scale_fill_gradientn(colours = topo.colors(10))
If you want a graph such like you showed, you have to do interpolation. Some packages give you spatial interpolation methods. In this example, I used akima package but you should think seriously that which interpolation methods to use.
I used nx = 300 and ny = 300 in below code but I think it would be better to decide those values carefully. Large nx and ny gives a high resolution graph, but don't foreget real nx and ny (in this example, real nx is only 9 and ny is 101).
library(akima); library(dplyr)
interp.data <- interp(main.df$dist, main.df$depth, main.df$temp, nx = 300, ny = 300)
interp.df <- interp.data %>% interp2xyz() %>% as.data.frame()
names(interp.df) <- c("dist", "depth", "temp")
# draw interp.df
ggplot(interp.df, aes(x = dist, y = depth, z = temp)) +
geom_raster(aes(fill = temp)) + # geom_contour() +
scale_fill_gradientn(colours = topo.colors(10))
# to think appropriateness of interpolation (raw and interpolation data)
ggplot(interp.df, aes(x = dist, y = depth, z = temp)) +
geom_raster(aes(fill = temp), alpha = 0.3) + # interpolation
scale_fill_gradientn(colours = topo.colors(10)) +
geom_point(data = main.df, aes(colour = temp), size = 1) + # raw
scale_colour_gradientn(colours = topo.colors(10))
Bottoms don't match !!I found ?interp says "interpolation only within convex hull!", oops... I'm worrid about the interpolation around the problem-area, is it OK ? If no problem, you need only cut the data under the bottoms. If not, ... I can't answer immediately (below is an example code to cut).
bottoms <- max.depths * -1
# calculate bottom values using linear interpolation
approx.bottoms <- approx(dist, bottoms, n = 300) # n must be the same value as interp()'s nx
# change temp values under bottom into NA
library(dplyr)
interp.cut.df <- interp.df %>% cbind(bottoms = approx.bottoms$y) %>%
mutate(temp = ifelse(depth >= bottoms, temp, NA)) %>% select(-bottoms)
ggplot(interp.cut.df, aes(x = dist, y = depth, z = temp)) +
geom_raster(aes(fill = temp)) +
scale_fill_gradientn(colours = topo.colors(10)) +
geom_point(data = main.df, size = 1)
If you want to use stat_contour
It is harder to use stat_contour than geom_raster because it needs a regular grid form. As far as I see your graph, your data (depth and distance) don't form a regular grid, it means it is much difficult to use stat_contour with your raw data. So I used interp.cut.df to draw a contour plot. And stat_contour have a endemic problem (see How to fill in the contour fully using stat_contour), so you need to expand your data.
library(dplyr)
# 1st: change NA into a temp's out range value (I used 0)
interp.contour.df <- interp.cut.df
interp.contour.df[is.na(interp.contour.df)] <- 0
# 2nd: expand the df (It's a little complex, so please use this function)
contour.support.func <- function(df) {
colname <- names(df)
names(df) <- c("x", "y", "z")
Range <- as.data.frame(sapply(df, range))
Dim <- as.data.frame(t(sapply(df, function(x) length(unique(x)))))
arb_z = Range$z[1] - diff(Range$z)/20
df2 <- rbind(df,
expand.grid(x = c(Range$x[1] - diff(Range$x)/20, Range$x[2] + diff(Range$x)/20),
y = seq(Range$y[1], Range$y[2], length = Dim$y), z = arb_z),
expand.grid(x = seq(Range$x[1], Range$x[2], length = Dim$x),
y = c(Range$y[1] - diff(Range$y)/20, Range$y[2] + diff(Range$y)/20), z = arb_z))
names(df2) <- colname
return(df2)
}
interp.contour.df2 <- contour.support.func(interp.contour.df)
# 3rd: check the temp range (these values are used to define contour's border (breaks))
range(interp.cut.df$temp, na.rm=T) # 12.51622 20.18904
# 4th: draw ... the bottom border is dirty !!
ggplot(interp.contour.df2, aes(x = dist, y = depth, z = temp)) +
stat_contour(geom="polygon", breaks = seq(12.51622, 20.18904, length = 11), aes(fill = ..level..)) +
coord_cartesian(xlim = range(dist), ylim = range(bottoms), expand = F) + # cut expanded area
scale_fill_gradientn(colours = topo.colors(10)) # breaks's length is 11, so 10 colors are needed
# [Note]
# You can define the contour's border values (breaks) and colors.
contour.breaks <- c(12.5, 13.5, 14.5, 15.5, 16.5, 17.5, 18.5, 19.5, 20.5)
# = seq(12.5, 20.5, 1) or seq(12.5, 20.5, length = 9)
contour.colors <- c("darkblue", "cyan3", "cyan1", "green3", "green", "yellow2","pink", "darkred")
# breaks's length is 9, so 8 colors are needed.
# 5th: vanish the bottom border by bottom line
approx.df <- data.frame(dist = approx.bottoms$x, depth = approx.bottoms$y, temp = 0) # 0 is dummy value
ggplot(interp.contour.df2, aes(x = dist, y = depth, z = temp)) +
stat_contour(geom="polygon", breaks = contour.breaks, aes(fill = ..level..)) +
coord_cartesian(xlim=range(dist), ylim=range(bottoms), expand = F) +
scale_fill_gradientn(colours = contour.colors) +
geom_line(data = approx.df, lwd=1.5, color="gray50")
bonus: legend technic
library(dplyr)
interp.contour.df3 <- interp.contour.df2 %>% mutate(temp2 = cut(temp, breaks = contour.breaks))
interp.contour.df3$temp2 <- factor(interp.contour.df3$temp2, levels = rev(levels(interp.contour.df3$temp2)))
ggplot(interp.contour.df3, aes(x = dist, y = depth, z = temp)) +
stat_contour(geom="polygon", breaks = contour.breaks, aes(fill = ..level..)) +
coord_cartesian(xlim=range(dist), ylim=range(bottoms), expand = F) +
scale_fill_gradientn(colours = contour.colors, guide = F) + # add guide = F
geom_line(data = approx.df, lwd=1.5, color="gray50") +
geom_point(aes(colour = temp2), pch = 15, alpha = 0) + # add
guides(colour = guide_legend(override.aes = list(colour = rev(contour.colors), alpha = 1, cex = 5))) + # add
labs(colour = "temp") # add
You want to treat this as a 3-D surface with temperature as the z dimension. The given plot is a contour plot and it looks like ggplot2 can do that with stat_contour.
I'm not sure how the contour lines are computed (often it's linear interpolation along a Delaunay triangulation). If you want more control over how to interpolate between your x/y grid points, you can calculate a surface model first and feed those z coordinates into ggplot2.