I am unsure about what to do in the following situation, where some levels of the fixed effect are missing (within a random effect) - they are unbalanced.
Imagine an aquarium with 5,000 individual fish. They are part of 100 different species. I want to test if there is a relationship between their weight (continuous) and whether they are fed by Alan or Susie (there only are two employees that feed fish). Species is the random effect.
My model looks like this: weight ~ employee + (1 + employee | species): mixed model (lmer) with random intercept and slope.
But for some species, all fish are fed by the same employee (Alan or Susie). Should I leave these observations in the model, or should I exclude them? Is there some literature on this?
This should be fine. Mixed models are well suited to this kind of missingness, unless it's really extreme (e.g. there were no species, or very few, that were measured by both employees). A small made-up example is below.
The cases where employee 1's measurements were missing have slightly wider confidence intervals; the cases where employee 2's measurements are missing have considerably wider CIs on the employee-2 effect (not sure why these aren't exactly zero, but my guess is that it has to do with the particular random effects values simulated - i.e. the random effects have zero mean overall, so these may be slightly >0 to make the overall estimates balance ... ?)
n_emp <- 2
n_spp <- 10
n_rep <- 20
dd <- expand.grid(emp = factor(seq(n_emp)),
spp = factor(seq(n_spp)),
rep = seq(n_rep))
dd2 <- subset(dd,
!((emp=="1" & (spp %in% 1:2)) |
(emp=="2" & (spp %in% 3:4))))
library(lme4)
form <- weight ~ emp + (1 + emp | spp)
## BUG/edge case: form[-2] breaks?
dd2$weight <- simulate(~ emp + (1 + emp | spp),
seed = 101,
newdata = dd2,
newparams = list(beta = c(1,1),
theta = c(1,1,1),
sigma = 1),
family = gaussian)[[1]]
m <- lmer(form, data = dd2)
rr <- as.data.frame(ranef(m))
rr$miss <- with(rr,
ifelse(grp %in% 1:2, "miss1",
ifelse(grp %in% 3:4, "miss2",
"nomiss")))
library(ggplot2)
ggplot(rr, aes(y = grp, x = condval, xmin = condval-2*condsd,
xmax = condval + 2*condsd, colour = miss)) +
geom_pointrange() +
facet_wrap(~term)
Related
currently I'm struggling to find a good way to perform the Hansen/Sargan tests of Overidentification restrictions within a Three-Stage Least Squares model (3SLS) in panel data using R. I was digging the whole day in different networks and couldn't find a way of depicting the tests in R using the well-known systemfit package.
Currently, my code is simple.
violence_c_3sls <- Crime ~ ln_GDP +I(ln_GDP^2) + ln_Gini
income_c_3sls <-ln_GDP ~ Crime + ln_Gini
gini_c_3sls <- ln_Gini ~ ln_GDP + I(ln_GDP^2) + Crime
inst <- ~ Educ_Gvmnt_Exp + I(Educ_Gvmnt_Exp^2)+ Health_Exp + Pov_Head_Count_1.9
system_c_3sls <- list(violence_c_3sls, income_c_3sls, gini_c_3sls)
fitsur_c_3sls <-systemfit(system_c_3sls, "3SLS",inst=inst, data=df_new, methodResidCov = "noDfCor" )
summary(fitsur_c_3sls)
However, adding more instruments to create an over-identified system do not yield in an output of the Hansen/Sargan test, thus I assume the test should be executed aside from the output and probably associated to systemfit class object.
Thanks in advance.
With g equations, l exogenous variables, and k regressors, the Sargan test for 3SLS is
where u is the stacked residuals, \Sigma is the estimated residual covariance, and P_W is the projection matrix on the exogenous variables. See Ch 12.4 from Davidson & MacKinnon ETM.
Calculating the Sargan test from systemfit should look something like this:
sargan.systemfit=function(results3sls){
result <- list()
u=as.matrix(resid(results3sls)) #model residuals, n x n_eq
n_eq=length(results3sls$eq) # number of equations
n=nrow(u) #number of observations
n_reg=length(coef(results3sls)) # total number of regressors
w=model.matrix(results3sls,which='z') #Matrix of instruments, in block diagonal form with one block per equation
#Need to aggregate into a single block (in case different instruments used per equation)
w_list=lapply(X = 1:n_eq,FUN = function(eq_i){
this_eq_label=results3sls$eq[[eq_i]]$eqnLabel
this_w=w[str_detect(rownames(w),this_eq_label),str_detect(colnames(w),this_eq_label)]
colnames(this_w)=str_remove(colnames(this_w),paste0(this_eq_label,'_'))
return(this_w)
})
w=do.call(cbind,w_list)
w=w[,!duplicated(colnames(w))]
n_inst=ncol(w) #w is n x n_inst, where n_inst is the number of unique instruments/exogenous variables
#estimate residual variance (or use residCov, should be asymptotically equivalent)
var_u=crossprod(u)/n #var_u=results3sls$residCov
P_w=w%*%solve(crossprod(w))%*%t(w) #Projection matrix on instruments w
#as.numeric(u) vectorizes the residuals into a n_eq*n x 1 vector.
result$statistic <- as.numeric(t(as.numeric(u))%*%kronecker(solve(var_u),P_w)%*%as.numeric(u))
result$df <- n_inst*n_eq-n_reg
result$p.value <- 1 - pchisq(result$statistic, result$df)
result$method = paste("Sargan over-identifying restrictions test")
return(result)
}
I would like to do a fuzzy sum with raster data in r to form a cumulative resistance layer for research. I have found packages and functions to do fuzzy sums with vector data and was wondering if anyone can share resources for specifically combining raster layers with fuzzy logic.
Thank you
You can use spatialEco::fuzzySum for both vector and raster data.
For example for three terra rasters: rast1, rast2, rast3 it would work as follows:
rFuzzySum <- spatialEco::fuzzySum(c(rast1, rast2, rast3))
If you write it open, it would be:
rFuzzySum <- (1 - ( (1 - rast1) *
(1 - rast2) *
(1 - rast3) ) )
Here an illustration of how you can do that, using the suggestions by MattKummu
Example data
library(terra)
x <- rast(system.file("ex/logo.tif", package="terra"))
x <- x / max(minmax(x))
Two approaches
a <- 1 - prod(1 - x)
b <- spatialEco::fuzzySum(x)
I'm working with some raster data in R using the raster package. I want to calculate and extract some geographic information (e.g., slope, aspect) from the raster, but only at specific points (I also have some data as a SpatialPointsDataFrame at which I want to calculate slope/aspect/etc.). I'm doing this for several high-resolution rasters, and it seems like a poor use of resources to calculate this for every raster cell when I only need maybe 5-10% of them.
I thought maybe the raster::stackApply function might work, but that seems to perform calculations on subsets of a rasterBrick rather than calculations on subsets of a single raster based on point locations (please correct me if I'm wrong). I also thought I could do a for loop, where I extract the surrounding cells nearest each point of interest, and iteratively calculate slope/aspect that way. That seems clunky, and I was hoping for a more elegant or built-in solution, but it should work.
These are my thoughts so far on the for loop, but I'm not sure how best to even do this.
# Attach packages
library(rgdal)
library(raster)
# Generate example raster data
r = raster()
set.seed(0)
values(r) = runif(ncell(r), min = 0, max = 1000)
# Generate example point data
df.sp = SpatialPoints(
coords = cbind(runif(25, min = -100, max = 100),
runif(25, min = -50, max = 50)),
proj4string = crs(r))
# Iterate on each row of SpatialPoints
for (i in 1:nrow(df.sp)) {
# Find cell index of current SpatialPoint
cell.idx = raster::extract(r, df.sp[i,], cellnumbers = TRUE)[1]
# Find indices of cells surrounding point of interest
neighbors.idx = raster::adjacent(r, cell.idx, directions = 16)
# Get DEM values for cell and surrounding cells
vals.local = r[c(cell.idx, neighbors.idx[,2])]
# Somehow convert this back to an appropriate georeferenced matrix
#r.local = ...
# Perform geometric calculations on local raster
#r.stack = terrain(r.local, opt = c('slope', 'aspect'))
# Remaining data extraction, etc. (I can take it from here...)
}
In summary, I need a method to calculate slope and aspect from a DEM raster only at specific points as given by a SpatialPoints object. If you know of a pre-built or more elegant solution, great! If not, some help finishing the for loop (how to best extract a neighborhood of surrounding cells and run calculations on that) would be most appreciated as well.
Interesting question. Here is a possible approach.
library(raster)
r <- raster()
set.seed(0)
values(r) <- runif(ncell(r), min = 0, max = 1000)
coords <- cbind(runif(25, min = -100, max = 100),
runif(25, min = -50, max = 50))
x <- rasterize(coords, r)
f <- focal(x, w=matrix(1, nc=3, nr=3), na.rm=TRUE)
rr <- mask(r, f)
slope <- terrain(rr, "slope")
extract(slope, coords)
# [1] 0.0019366236 0.0020670699 0.0006305257 0.0025334280 0.0023480935 0.0007527267 0.0002699272 0.0004699626
# [9] 0.0004869054 0.0025651333 0.0010415805 0.0008574920 0.0010664869 0.0017700297 0.0001666226 0.0008405391
#[17] 0.0017682167 0.0009854172 0.0015350466 0.0017714466 0.0012994945 0.0016563132 0.0003276584 0.0020499529
#[25] 0.0006582073
Probably not much efficiency gain, as it still processes all the NA values
So maybe like this, more along your line of thinking:
cells <- cellFromXY(r, coords)
ngbs <- raster::adjacent(r, cells, pairs=TRUE)
slope <- rep(NA, length(cells))
for (i in 1:length(cells)) {
ci <- ngbs[ngbs[,1] == cells[i], 2]
e <- extentFromCells(r, ci)
x <- crop(r, e)
slope[i] <- terrain(x, "slope")[5]
}
slope
#[1] 0.0019366236 0.0020670699 0.0006305257 0.0025334280 0.0023480935 0.0007527267 0.0002699272 0.0004699626
#[9] 0.0004869054 0.0025651333 0.0010415805 0.0008574920 0.0010664869 0.0017700297 0.0001666226 0.0008405391
#[17] 0.0017682167 0.0009854172 0.0015350466 0.0017714466 0.0012994945 0.0016563132 0.0003276584 0.0020499529
#[25] 0.0006582073
But I find that brute force
slope <- terrain(r, "slope")
extract(slope, coords)
is fastest, 9x faster than my first alternative and 4 times faster than the second alternative
I have this data (t, TR and h) and need to estimate the parameters (a, b, c, d and e) for this model: h = (alog(TR)+b)(c*(t^d)-e)*41.59
t <- c(120,60,50,40,30,20,10,180,120,60,50,40,30,20,10,120,60,50,40,30,20,10,120,60,50,40,30,20,10)
TR <- c(2,2,2,2,2,2,2,5,5,5,5,5,5,5,5,10,10,10,10,10,10,10,20,20,20,20,20,20,20)
h <- c(49.4,41.6,39.1,36.2,31.9,25.1,15.9,80.2,70.3,55.2,51.6,47.3,41.7,33.2,22.1,84.2,64.1,59.9,54.6,48.1,38.6,26.2,97.5,72.8,67.9,61.6,54.3,43.7,30.1)
bell <- nls(h ~ ((a1*log(TR)+a2)*(a3*(t^b)-a4)*41.59), start = list(a=0.6,
b2=0.3, c=0.4, d=0.30, e=0.4))
I tried the "nls" and the "nls2" process but these didn't work because of this:
Error in nlsModel(formula, mf, start, wts) :
singular gradient matrix at initial parameter estimates
So,I founded a possible solution following this procedure (I think it's a Linearly Constrained Optimization):
Solve this system of equations
U1 <- log(TR)*(c1*t^d1-e1)*41.59
U2 <- (c1*t^d1-e1)*41.59
U3 <- t^d1*(a1*log(TR)+b1)*41.59
U4 <- c1*t^d1*log(t)*((a1*log(TR))+b1)*41.59
U5 <- -(a1*log(TR)+b1)*41.59
From the initial values assumed for the parameters of the model,
successively the system of linear equations, in order to obtain values of the deviations each time nearest zero.
ΣZ = (a1-a) ΣU1,i + (b1-b) ΣU2,i + (c1-c) ΣU3,1 + (d1-d) ΣU4,i + (e1-e) ΣU5,i
For each iteration, the values of the parameters assume those obtained in
previous iteration, plus the deviations...
How could I do this in R?
I'm sorry cause I'm a R beginner and I don't speak english very well.
Many thanks for any help you can give me.
I need to find ranges in order to create a Uniform histogram
i.e: ages
to 4 ranges
data_set = [18,21,22,24,27,27,28,29,30,32,33,33,42,42,45,46]
is there a function that gives me the ranges so the histogram is uniform?
in this case
ranges = [(18,24), (27,29), (30,33), (42,46)]
This example is easy, I'd like to know if there is an algorithm that deals with complex data sets as well
thanks
You are looking for the quantiles that split up your data equally. This combined with cutshould work. So, suppose you want n groups.
set.seed(1)
x <- rnorm(1000) # Generate some toy data
n <- 10
uniform <- cut(x, c(-Inf, quantile(x, prob = (1:(n-1))/n), Inf)) # Determine the groups
plot(uniform)
Edit: now corrected to yield the correct cuts in the ends.
Edit2: I don't quite understand the downvote. But this also works in your example:
data_set = c(18,21,22,24,27,27,28,29,30,32,33,33,42,42,45,46)
n <- 4
groups <- cut(data_set, breaks = c(-Inf, quantile(data_set, prob = 1:(n-1)/n), Inf))
levels(groups)
With some minor renaming nessesary. For slightly better level names, you could also put in min(x) and max(x) instead of -Inf and Inf.