How to plot confidence intervals for glm models (gamma family)? - ggplot2

I would like to plot the line and the 95% confidence interval from a glm model (family gamma). For linear models, I have previously been able to plot the confidence intervals from the predictions as they included the fit, lower and upper level and using polygons but I do not know how to do it here as the predictions does not include upper and lower levels. I have also tried ggplot but there it seems that the smoothing flattens the curve. Thanks in advance for help. See code:
library(ggplot2)
# Data
dat <- data.frame(c(45,75,85,2,14,45,45,45,45,45,55,55,65,85,15,15,315,3,40,85,125,115,13,105,
145,125,145,125,205,125,155,125,19,17,145,14,85,65,135,45,40,15,14,10,15,10,10,45,37,30),
c(1.928607e-01, 3.038813e-01, 8.041174e-02, 0.000000e+00, 1.017541e-02, 1.658876e-01, 2.084661e-01,
1.891305e-01, 2.657766e-01, 1.270864e-01, 1.720141e-01, 1.644947e-01, 7.038978e-02, 3.046604e-01,
3.111646e-02, 9.443539e-04, 3.590906e-02, 0.000000e+00, 2.384494e-01, 5.955332e-02, 7.703567e-02,
5.524471e-02, 9.915716e-04, 1.169936e-01, 1.409448e-01, 1.411809e-01, 1.025096e-01, 2.649503e-01,
6.309465e-02, 3.727837e-02, 8.855679e-02, 1.707864e-01, 1.714002e-02, 1.038789e-03, 1.208065e-01,
3.541327e-04, 7.492268e-02, 9.633591e-02, 7.414359e-02, 2.235050e-01, 1.489010e-01, 2.478929e-03,
2.573364e-03, 5.430035e-04, 1.719905e-02, 1.243006e-02, 6.822957e-03, 1.927544e-01, 1.146918e-01, 9.030385e-03))
colnames(dat) <- c("age", "wood")
# Model
model<- glm(wood+0.001 ~ log(age) + I(log(age)^2), data=dat, family = Gamma)
summary(model)
p<-predict(model, data.frame(age=1:200), interval="confidence", level=.95)
p.tr <- 1/p # inverse link according to ?glm
# Plot
plot(1:200, p.tr, type="n", ylim = c(0,.4), xlab="Forest age",
ylab="Proportion",
main="Wood production", yaxt="n")
axis(2, las=2)
lines(1:200, p.tr, ylim=range(p.tr), lwd=2, col=rgb(0, .4, 1))
# How can I add to this plot the 95% confidence intervals of the model?
# Ggplot
# I use this function because there was a warning of "Ignoring unknown parameters: family" and this solves that
binomial_smooth <- function(...) {
geom_smooth(method = "glm", method.args = list(family = "binomial"),formula=y~log(x)+I(log(x)^2),se=FALSE)
}
ggplot(dat, aes(x=age,
y=wood+0.001)) +
binomial_smooth() +
xlab("Forest age") +
ylab("Proportion") +
ggtitle("Wood production") +
xlim(0, 200) +
ylim(0,0.4) +
theme_bw() +
theme (plot.title = element_text(hjust = 0.5), legend.position = "none")
# Why I get this warning (Warning: In eval(family$initialize) : non-integer #successes in a binomial glm!)?
# Why is the curve more smooth here?

I'm not a mathematician / statistician, but I guess "family = binomial" gives you inappropriate estimates, as it isn't the correct distribution as neither wood nor age are countable number of values.
About the confidence intervals:
I used the stat_smooth(), see below. Should be the same as geom_smooth(), though.
dat <- data.frame(c(45,75,85,2,14,45,45,45,45,45,55,55,65,85,15,15,315,3,40,85,125,115,13,105,
145,125,145,125,205,125,155,125,19,17,145,14,85,65,135,45,40,15,14,10,15,10,10,45,37,30),
c(1.928607e-01, 3.038813e-01, 8.041174e-02, 0.000000e+00, 1.017541e-02, 1.658876e-01, 2.084661e-01,
1.891305e-01, 2.657766e-01, 1.270864e-01, 1.720141e-01, 1.644947e-01, 7.038978e-02, 3.046604e-01,
3.111646e-02, 9.443539e-04, 3.590906e-02, 0.000000e+00, 2.384494e-01, 5.955332e-02, 7.703567e-02,
5.524471e-02, 9.915716e-04, 1.169936e-01, 1.409448e-01, 1.411809e-01, 1.025096e-01, 2.649503e-01,
6.309465e-02, 3.727837e-02, 8.855679e-02, 1.707864e-01, 1.714002e-02, 1.038789e-03, 1.208065e-01,
3.541327e-04, 7.492268e-02, 9.633591e-02, 7.414359e-02, 2.235050e-01, 1.489010e-01, 2.478929e-03,
2.573364e-03, 5.430035e-04, 1.719905e-02, 1.243006e-02, 6.822957e-03, 1.927544e-01, 1.146918e-01,
9.030385e-03))
colnames(dat) <- c("age", "wood")
model<- glm(wood+0.001 ~ log(age) + I(log(age)^2), data=dat, family = Gamma)
#summary(model)
p<-predict(model, data.frame(age=1:200), interval="confidence", level=.95)
p.tr <- 1/p # inverse link according to ?glm
prediction <- data.frame(age = as.numeric(names(p)), wood = 1/p)
ggplot(data = dat, aes(x = age, y = wood)) +
geom_point() +
geom_line(data= prediction) +
stat_smooth(data = dat, method = "glm",
formula = y+0.001 ~ log(x) + I(log(x)^2),
method.args = c(family = Gamma))

Related

plot gam results with original x values (not scaled and centred)

I have a dataset that I am modeling with a gam. Because there are two continuous varaibles in the gam, I have centred and scaled these variables before adding them to the model. Therefore, when I use the built-in features in gratia to show the results, the x values are not the same as the original scale. I'd like to plot the results using the scale of the original data.
An example:
library(tidyverse)
library(mgcv)
library(gratia)
set.seed(42)
df <- data.frame(
doy = sample.int(90, 300, replace = TRUE),
year = sample(c(1980:2020), size = 300, replace = TRUE),
site = c(rep("A", 150), rep("B", 80), rep("C", 70)),
sex = sample(c("F", "M"), size = 300, replace = TRUE),
mass = rnorm(300, mean = 500, sd = 50)) %>%
mutate(doy.s = scale(doy, center = TRUE, scale = TRUE),
year.s = scale(year, center = TRUE, scale = TRUE),
across(c(sex, site), as.factor))
m1 <- gam(mass ~
s(year.s, site, bs = "fs", by = sex, k = 5) +
s(doy.s, site, bs = "fs", by = sex, k = 5) +
s(sex, bs = "re"),
data = df, method = "REML", family = gaussian)
draw(m1)
How do I re-plot the last two panels in this figure to show the relationship between year and mass with ggplot?
You can't do this with gratia::draw automatically (unless I'm mistaken).* But you can use gratia::smooth_estimates to get a dataframe which you can then do whatever you like with.
To answer your specific question: to re-plot the last two panels of the plot you provided, but with year unscaled, you can do the following
# Get a tibble of smooth estimates from the model
sm <- gratia::smooth_estimates(m1)
# Add a new column for the unscaled year
sm <- sm %>% mutate(year = mean(df$year) + (year.s * sd(df$year)))
# Plot the smooth s(year.s,site) for sex=F with year unscaled
pF <- sm %>% filter(smooth == "s(year.s,site):sexF" ) %>%
ggplot(aes(x = year, y = est, color=site)) +
geom_line() +
theme(legend.position = "none") +
labs(y = "Partial effect", title = "s(year.s,site)", subtitle = "By: sex; F")
# Plot the smooth s(year.s,site) for sex=M with year unscaled
pM <- sm %>% filter(smooth == "s(year.s,site):sexM" ) %>%
ggplot(aes(x = year, y = est, color=site)) +
geom_line() +
theme(legend.position = "none") +
labs(y = "Partial effect", title = "s(year.s,site)", subtitle = "By: sex; M")
library(patchwork) # use `patchwork` just for easy side-by-side plots
pF + pM
to get:
EDIT: If you also want to shift result on the y-axis as #GavinSimpson (who is the author and maintainer of gratia) mentioned, you can do this with add_constant, adding this code before plotting above:
sm <- sm %>%
add_constant(coef(m1)["(Intercept)"]) %>%
transform_fun(inv_link(m1))
[You should also in general untransform the smooth by the inverse of the model's link function. In your case this is just the identity, so it is not necessary, but in general it would be. That's what the second step above is doing.]
In your example, this results in:
*As mentioned in the custom-plotting vignette for gratia, the goal of draw not to be fully customizable, but just to be useful default. See there for recommendations about custom plots.

How to add count (n) / summary statistics as a label to ggplot2 boxplots?

I am new to R and trying to add count labels to my boxplots, so the sample size per boxplot shows in the graph.
This is my code:
bp_east_EC <-total %>% filter(year %in% c(1977, 2020, 2021, 1992),
sampletype == "groundwater",
East == 1,
#EB == 1,
#N59 == 1,
variable %in% c("EC_uS")) %>%
ggplot(.,aes(x = as.character(year), y = value, colour = as.factor(year))) +
theme_ipsum() +
ggtitle("Groundwater EC, eastern Curacao") +
theme(plot.title = element_text(hjust = 0.5, size=14)) +
theme(legend.position = "none") +
labs(x="", y="uS/cm") +
geom_jitter(color="grey", size=0.4, alpha=0.9) +
geom_boxplot() +
stat_summary(fun.y=mean, geom="point", shape=23, size=2) #shows mean
I have googled a lot and tried different things (with annotate, with return functions, mtext, etc), but it keeps giving different errors. I think I am such a beginner I cannot figure out how to integrate such suggestions into my own code.
Does anybody have an idea what the best way would be for me to approach this?
I would create a new variable that contained your sample sizes per group and plot that number with geom_label. I've generated an example of how to add count/sample sizes to a boxplot using the iris dataset since your example isn't fully reproducible.
library(tidyverse)
data(iris)
# boxplot with no label
ggplot(iris, aes(x = Species, y = Sepal.Length, fill = Species)) +
geom_boxplot()
# boxplot with label
iris %>%
group_by(Species) %>%
mutate(count = n()) %>%
mutate(mean = mean(Sepal.Length)) %>%
ggplot(aes(x = Species, y = Sepal.Length, fill = Species)) +
geom_boxplot() +
geom_label(aes(label= count , y = mean + 0.75), # <- change this to move label up and down
size = 4, position = position_dodge(width = 0.75)) +
geom_jitter(alpha = 0.35, aes(color = Species)) +
stat_summary(fun = mean, geom = "point", shape = 23, size = 6)

How to show where networkpersons live and how they are connected

I want to show where network people live and how they are connected. First, I drew a map of the 15 municipalities (based on SpatialPolygonsDataFrame, geom_polygon of ggplot2). Second, I placed the network people around the centroids of the polygons. After the third variant in "Three ways of visualizing a graph on a map" by Markus Konrad, I have so far created two layers https://datascience.blog.wzb.eu/2018/05/ 31 / three-ways-of-visualizing-a-graph-on-a-map /). As mapcoords I used coord_fixed (ratio = 1/1). To achieve a good result, I had to make manual adjustments in annotation_custom.
My questions:
First, is there a way to adapt the layers to each other without manual intervention?
Second, are there simpler solutions to geographically locate network people and their connections?my result so far
maptheme <- theme(panel.grid = element_blank()) +
theme(axis.text = element_blank()) +
theme(axis.ticks = element_blank()) +
theme(axis.title = element_blank()) +
theme(legend.position = "bottom") +
theme(panel.grid = element_blank()) +
theme(panel.background = element_rect(fill = "#596673")) +
theme(plot.margin = unit(c(0, 0, 0.5, 0), 'cm'))
mapcoords <- coord_fixed(ratio=1/1)
theme_transp_overlay <- theme(
panel.background = element_rect(fill = "transparent", color = NA),
plot.background = element_rect(fill = "transparent", color = NA))
ArlMap <- ggplot(ARLmap.data, aes(long, lat)) +
geom_polygon(aes(group=group), colour='white', fill='grey')+
theme(axis.text=element_blank())+
theme(axis.ticks=element_blank())+
theme(axis.title=element_blank())+
mapcoords + maptheme
nodes <- ggplot(nwdata) +
geom_point(aes(x = xkor, y = ykor, size = Btw),
shape = 21, fill = "white", color = "black", # draw nodes
stroke = 0.5) +
scale_size_continuous(guide = FALSE, range = c(1, 6)) +
mapcoords + maptheme + theme_transp_overlay
ArlMap +
annotation_custom(ggplotGrob(nodes), xmin = min(ARLmap.data$long)+900, xmax = max(ARLmap.data$long)-1200, ymin = min(ARLmap.data$lat)+1500, ymax = max(ARLmap.data$lat))
...
I'm at the goal. I came to the solution by consistently starting from a geographical approach: 1. The nodes of the network receive lon / lat coordinates. These are determined as rotation coordinates around the centroids of the geographical unit. 2. The connections between the nodes are provided with new start and end points on the basis of the lon / lat coordinates. 3. The plot is limited to the basic functions plot, lines and points.enter image description here

Smoothing geom_ribbon

I've created a plot with geom_line and geom_ribbon (image 1) and the result is okay, but for the sake of aesthetics, I'd like the line and ribbon to be smoother. I know I can use geom_smooth for the line (image 2), but I'm not sure if it's possible to smooth the ribbon.I could create a geom_smooth line for the top and bottom lines of the ribbon (image 3), but is there anyway to fill in the space between those two lines?
A principled way to achieve what you want is to fit a GAM model to your data using the gam() function in mgcv and then apply the predict() function to that model over a finer grid of values for your predictor variable. The grid can cover the span defined by the range of observed values for your predictor variable. The R code below illustrates this process for a concrete example.
# load R packages
library(ggplot2)
library(mgcv)
# simulate some x and y data
# x = predictor; y = response
x <- seq(-10, 10, by = 1)
y <- 1 - 0.5*x - 2*x^2 + rnorm(length(x), mean = 0, sd = 20)
d <- data.frame(x,y)
# plot the simulated data
ggplot(data = d, aes(x,y)) +
geom_point(size=3)
# fit GAM model
m <- gam(y ~ s(x), data = d)
# define finer grid of predictor values
xnew <- seq(-10, 10, by = 0.1)
# apply predict() function to the fitted GAM model
# using the finer grid of x values
p <- predict(m, newdata = data.frame(x = xnew), se = TRUE)
str(p)
# plot the estimated mean values of y (fit) at given x values
# over the finer grid of x values;
# superimpose approximate 95% confidence band for the true
# mean values of y at given x values in the finer grid
g <- data.frame(x = xnew,
fit = p$fit,
lwr = p$fit - 1.96*p$se.fit,
upr = p$fit + 1.96*p$se.fit)
head(g)
theme_set(theme_bw())
ggplot(data = g, aes(x, fit)) +
geom_ribbon(aes(ymin = lwr, ymax = upr), fill = "lightblue") +
geom_line() +
geom_point(data = d, aes(x, y), shape = 1)
This same principle would apply if you were to fit a polynomial regression model to your data using the lm() function.

Depth Profiling visualization

I'm trying to create a depth profile graph with the variables depth, distance and temperature. The data collected is from 9 different points with known distances between them (distance 5m apart, 9 stations, 9 different sets of data). The temperature readings are according to these 9 stations where a sonde was dropped directly down, taking readings of temperature every 2 seconds. Max depth at each of the 9 stations were taken from the boat also.
So the data I have is:
Depth at each of the 9 stations (y axis)
Temperature readings at each of the 9 stations, at around .2m intervals vertical until the bottom was reached (fill area)
distance between the stations, (x axis)
Is it possible to create a depth profile similar to this? (obviously without the greater resolution in this graph)
I've already tried messing around with ggplot2 and raster but I just can't seem to figure out how to do this.
One of the problems I've come across is how to make ggplot2 distinguish between say 5m depth temperature reading at station 1 and 5m temperature reading at station 5 since they have the same depth value.
Even if you can guide me towards another program that would allow me to create a graph like this, that would be great
[ REVISION ]
(Please comment me if you know more suitable interpolation methods, especially not needing to cut under bottoms data.)
ggplot() needs long data form.
library(ggplot2)
# example data
max.depths <- c(1.1, 4, 4.7, 7.7, 8.2, 7.8, 10.7, 12.1, 14.3)
depth.list <- sapply(max.depths, function(x) seq(0, x, 0.2))
temp.list <- list()
set.seed(1); for(i in 1:9) temp.list[[i]] <- sapply(depth.list[[i]], function(x) rnorm(1, 20 - x*0.5, 0.2))
set.seed(1); dist <- c(0, sapply(seq(5, 40, 5), function(x) rnorm(1, x, 1)))
dist.list <- sapply(1:9, function(x) rep(dist[x], length(depth.list[[x]])))
main.df <- data.frame(dist = unlist(dist.list), depth = unlist(depth.list) * -1, temp = unlist(temp.list))
# a raw graph
ggplot(main.df, aes(x = dist, y = depth, z = temp)) +
geom_point(aes(colour = temp), size = 1) +
scale_colour_gradientn(colours = topo.colors(10))
# a relatively raw graph (don't run with this example data)
ggplot(main.df, aes(x = dist, y = depth, z = temp)) +
geom_raster(aes(fill = temp)) + # geom_contour() +
scale_fill_gradientn(colours = topo.colors(10))
If you want a graph such like you showed, you have to do interpolation. Some packages give you spatial interpolation methods. In this example, I used akima package but you should think seriously that which interpolation methods to use.
I used nx = 300 and ny = 300 in below code but I think it would be better to decide those values carefully. Large nx and ny gives a high resolution graph, but don't foreget real nx and ny (in this example, real nx is only 9 and ny is 101).
library(akima); library(dplyr)
interp.data <- interp(main.df$dist, main.df$depth, main.df$temp, nx = 300, ny = 300)
interp.df <- interp.data %>% interp2xyz() %>% as.data.frame()
names(interp.df) <- c("dist", "depth", "temp")
# draw interp.df
ggplot(interp.df, aes(x = dist, y = depth, z = temp)) +
geom_raster(aes(fill = temp)) + # geom_contour() +
scale_fill_gradientn(colours = topo.colors(10))
# to think appropriateness of interpolation (raw and interpolation data)
ggplot(interp.df, aes(x = dist, y = depth, z = temp)) +
geom_raster(aes(fill = temp), alpha = 0.3) + # interpolation
scale_fill_gradientn(colours = topo.colors(10)) +
geom_point(data = main.df, aes(colour = temp), size = 1) + # raw
scale_colour_gradientn(colours = topo.colors(10))
Bottoms don't match !!I found ?interp says "interpolation only within convex hull!", oops... I'm worrid about the interpolation around the problem-area, is it OK ? If no problem, you need only cut the data under the bottoms. If not, ... I can't answer immediately (below is an example code to cut).
bottoms <- max.depths * -1
# calculate bottom values using linear interpolation
approx.bottoms <- approx(dist, bottoms, n = 300) # n must be the same value as interp()'s nx
# change temp values under bottom into NA
library(dplyr)
interp.cut.df <- interp.df %>% cbind(bottoms = approx.bottoms$y) %>%
mutate(temp = ifelse(depth >= bottoms, temp, NA)) %>% select(-bottoms)
ggplot(interp.cut.df, aes(x = dist, y = depth, z = temp)) +
geom_raster(aes(fill = temp)) +
scale_fill_gradientn(colours = topo.colors(10)) +
geom_point(data = main.df, size = 1)
If you want to use stat_contour
It is harder to use stat_contour than geom_raster because it needs a regular grid form. As far as I see your graph, your data (depth and distance) don't form a regular grid, it means it is much difficult to use stat_contour with your raw data. So I used interp.cut.df to draw a contour plot. And stat_contour have a endemic problem (see How to fill in the contour fully using stat_contour), so you need to expand your data.
library(dplyr)
# 1st: change NA into a temp's out range value (I used 0)
interp.contour.df <- interp.cut.df
interp.contour.df[is.na(interp.contour.df)] <- 0
# 2nd: expand the df (It's a little complex, so please use this function)
contour.support.func <- function(df) {
colname <- names(df)
names(df) <- c("x", "y", "z")
Range <- as.data.frame(sapply(df, range))
Dim <- as.data.frame(t(sapply(df, function(x) length(unique(x)))))
arb_z = Range$z[1] - diff(Range$z)/20
df2 <- rbind(df,
expand.grid(x = c(Range$x[1] - diff(Range$x)/20, Range$x[2] + diff(Range$x)/20),
y = seq(Range$y[1], Range$y[2], length = Dim$y), z = arb_z),
expand.grid(x = seq(Range$x[1], Range$x[2], length = Dim$x),
y = c(Range$y[1] - diff(Range$y)/20, Range$y[2] + diff(Range$y)/20), z = arb_z))
names(df2) <- colname
return(df2)
}
interp.contour.df2 <- contour.support.func(interp.contour.df)
# 3rd: check the temp range (these values are used to define contour's border (breaks))
range(interp.cut.df$temp, na.rm=T) # 12.51622 20.18904
# 4th: draw ... the bottom border is dirty !!
ggplot(interp.contour.df2, aes(x = dist, y = depth, z = temp)) +
stat_contour(geom="polygon", breaks = seq(12.51622, 20.18904, length = 11), aes(fill = ..level..)) +
coord_cartesian(xlim = range(dist), ylim = range(bottoms), expand = F) + # cut expanded area
scale_fill_gradientn(colours = topo.colors(10)) # breaks's length is 11, so 10 colors are needed
# [Note]
# You can define the contour's border values (breaks) and colors.
contour.breaks <- c(12.5, 13.5, 14.5, 15.5, 16.5, 17.5, 18.5, 19.5, 20.5)
# = seq(12.5, 20.5, 1) or seq(12.5, 20.5, length = 9)
contour.colors <- c("darkblue", "cyan3", "cyan1", "green3", "green", "yellow2","pink", "darkred")
# breaks's length is 9, so 8 colors are needed.
# 5th: vanish the bottom border by bottom line
approx.df <- data.frame(dist = approx.bottoms$x, depth = approx.bottoms$y, temp = 0) # 0 is dummy value
ggplot(interp.contour.df2, aes(x = dist, y = depth, z = temp)) +
stat_contour(geom="polygon", breaks = contour.breaks, aes(fill = ..level..)) +
coord_cartesian(xlim=range(dist), ylim=range(bottoms), expand = F) +
scale_fill_gradientn(colours = contour.colors) +
geom_line(data = approx.df, lwd=1.5, color="gray50")
bonus: legend technic
library(dplyr)
interp.contour.df3 <- interp.contour.df2 %>% mutate(temp2 = cut(temp, breaks = contour.breaks))
interp.contour.df3$temp2 <- factor(interp.contour.df3$temp2, levels = rev(levels(interp.contour.df3$temp2)))
ggplot(interp.contour.df3, aes(x = dist, y = depth, z = temp)) +
stat_contour(geom="polygon", breaks = contour.breaks, aes(fill = ..level..)) +
coord_cartesian(xlim=range(dist), ylim=range(bottoms), expand = F) +
scale_fill_gradientn(colours = contour.colors, guide = F) + # add guide = F
geom_line(data = approx.df, lwd=1.5, color="gray50") +
geom_point(aes(colour = temp2), pch = 15, alpha = 0) + # add
guides(colour = guide_legend(override.aes = list(colour = rev(contour.colors), alpha = 1, cex = 5))) + # add
labs(colour = "temp") # add
You want to treat this as a 3-D surface with temperature as the z dimension. The given plot is a contour plot and it looks like ggplot2 can do that with stat_contour.
I'm not sure how the contour lines are computed (often it's linear interpolation along a Delaunay triangulation). If you want more control over how to interpolate between your x/y grid points, you can calculate a surface model first and feed those z coordinates into ggplot2.