I would like to find a way to modify the solution provided by Didzis Elferts in response to this post.
I have a model with a random intercept and random slope, whereas in the original question was about a model with just a random intercept. I can't figure out how to modify the code from the answer for random intercepts and slopes.
The reason I am not using dotplot() or sjPlot to do all this automatically is that I wanted to construct a more customized plot, with the random intercepts and slopes plotted in a more organized manner, under facets that reflect the fixed effect groups to which the random groups belong. The only way I see being able to do this is extracting the random slopes, intercepts, and their variances and having my fixed effect categories as the other columns in my data frame. So alternatively if someone has an easy way to customize a dotplot or other type of plot to sort by fixed effect categories, I'm all ears!
I've tweaked the reproducible code to show what I mean:
Add some variable to Dyestuff dataset, which will be used as the random slope variable
library("lme4")
Dyestuff$variable <- rep(c("X","Y"), each = 15)
Run a random intercept and random slope model (instead of just random intercept model)
fit1 <- lmer(Yield ~ variable + (variable|Batch), Dyestuff)
the rest of the code is the same as in the previous post, with the exception that I used condVar=TRUE because postVar has been depreciated since the original question was asked; note though, that the attribute still needs to be named postVar
randoms<-ranef(fit1, condVar = TRUE)
qq <- attr(ranef(fit1, condVar = TRUE)[[1]], "postVar")
str(qq)
rand.interc<-randoms$Batch
df<-data.frame(Intercepts=randoms$Batch[,1],
sd.interc=2*sqrt(qq[,,1:length(qq)]),
lev.names=rownames(rand.interc))
with both random intercept and slope in qq, sd.interc doesn't work. I get this message:
Error in qq[, , 1:length(qq)] : subscript out of bounds
I've played around with it but can't seem to get it right.
I think if you want the standard deviations of the conditional modes for the intercepts you should use sqrt(qq[1,1,]); this extracts the (1,1) element of each "slice" of the conditional-variance array.
df <- data.frame(Intercepts=randoms$Batch[,1],
sd.interc=2*sqrt(qq[1,1,]),
lev.names=rownames(rand.interc))
(I'm not sure why you doubled these values, I guess as a preliminary to using them as half-widths of a confidence bar?)
However, there are easier tools to use now.
broom.mixed::tidy(fit1, "ran_vals", conf.int = TRUE)
gives
# A tibble: 12 × 8
effect group level term estimate std.error conf.low conf.high
<chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 ran_vals Batch A (Intercept) -12.2 14.8 -41.2 16.8
2 ran_vals Batch B (Intercept) -1.93 14.8 -30.9 27.0
3 ran_vals Batch C (Intercept) 14.1 14.8 -14.9 43.1
...
or as.data.frame(ranef(fit1)) (this is even built into lme4)
Related
I imputed a data set with missing data from 81 schools and 13,177 students. I used mice to compute 10 imputations. (Note: this is a simplification of the original problem I ran into, but the problem is reproduced in this simpler version.)
I have the imputed data set saved as an SPSS file called ImputedDL.sav.
When I run any multilevel model and pool the data, the degrees of freedom are much too high for School-level variables. Here is an example running two school level variables "MinoritySchC" and "DisadvSchC" which represent, respectively, the proportion of students at the school who are coded as "minority" (not White or Asian) and the proportion of students who are coded as "disadvantaged" (free or reduced lunch). I also include a student-level variable, Female, coded 0 or 1.
the mice code is:
library(mice)
library(lmerTest)
library(broom.mixed)
library(haven)
library(dplyr)
data <- read_sav("F:/scull data/NEW FILES 2022/2022 completed imputations/ImputedDl.sav")
data<-rename(data, .imp = Imputation_)
data<-rename(data, .id = ID)
summary(data)
test2<-as.mids(data)
fit<-with(test2,lmer(Outcome~
Female+MinoritySchC+DisadvSchC+ (1|SchoolID),REML=TRUE))
pool(fit)
summary(pool(fit))
The output reports the following degrees of freedom for school level variables:
Intercept df=13,124.83, MinoritySchC df = 13,167.49, DisadvSchC df=13,167.49
For the STUDENT LEVEL VARIABLE, the degrees of freedom is: Female df = 9,546.37
Note that the df are HIGHER for the school-level variables than for the student-level variable, and actually equal to the number of students in the data set minus 10 (the number of imputations, perhaps?)
Using the Satterthwaite adjustment, the df for student-level variables should be a bit less than 13,177 and the df for school-level variables should be on the order of magnitude of the number of schools (81). The output degrees of freedom seem to be wrong, and make me worry that the entire pooling program might not be working correctly.
To check my understanding, I ran lmer for Imputation 1, and indeed the program output the degrees of freedom that look correct. Here the code that first selected a single imputation and then ran a multilevel analysis.
imp1 <- data[which(data$.imp == 1),]
model2<-lmer(Outcome~
Female+MinoritySchC+DisadvSchC+(1|SchoolID),REML=TRUE,data=imp1)
summary(model2)
In contrast to the multiply imputed run, the output for the lmer program, using only a single imputation, looks like it has reasonable degrees of freedom.
The output reports the following degrees of freedom for school level variables:
Intercept df=85.6, MinoritySchC df = 73.6, DisadvSchC df=74.7
For the STUDENT LEVEL VARIABLE, the degrees of freedom is: Female df = 13,110
Is there a way to get correct degrees of freedom when pooling multilevel data? Is the error in computing df symptomatic of other errors? That is, should I trust the mice program to do my pooling for me?
I am trying to do a time series prediction with ARIMA.
So, as the first step, I am doing some series transformation
#Taking log transform
dflog=np.log(df)
#Taking exponential weighted mean`enter code here`
df_expwighted_mean = dflog.ewm(span=12).mean()
#Taking moving average
df_expwighted_mean_diff = dflog - df_expwighted_mean
#Differencing
df_diff = df_expwighted_mean_diff - df_expwighted_mean_diff.shift()
#filling zero for NaN
df_diff = df_diff.fillna(0)
And after with the below code I am very much able to reach back to the original series
# Take cumulative some to remove the differencing
bdf_expwighted_mean_diff = df_diff.cumsum()
# Add rolling mean as we originally reduced it
bdf_log=bdf_expwighted_mean_diff + df_expwighted_mean
#Take exponentiation as we originally did log transform
bdf=np.exp(bdf_log)
But the problem comes when I do this on the predicted series.
It fails on it as I do not have the EWM of the predicted series.(pdf_expwighted_mean)
SO basically, I want some way to reverse the exponentially weighted mean.
df_expwighted_mean = dflog.ewm(span=12).mean()
Any thoughts?
It doesn't make sense to reverse exponentially weighted mean in Time series prediction. Exponentially weighted mean is used smoothen a time series, basically you are trying to remove noise from the series that would otherwise make the series hard to predict.
For Example: Let red series be your actual data, blue is the EWMA series, green is predicted series based on EWMA series in the following image
Once you use the smoothened series to predict, reversing EWMA would mean you add noise to it. You are able to it on source data becuase you stored the noise data from your original data. Usualy you just use the predictions on EWMA as is, ie. no reversing of EWMA required.
In your case, just do cumsum and exp(to reverse differencing and log).
I have some code which uses scipy.integration.cumtrapz to compute the antiderivative of a sampled signal. I would like to use Simpson's rule instead of Trapezoid. However scipy.integration.simps seems not to have a cumulative counterpart... Am I missing something? Is there a simple way to get a cumulative integration with "scipy.integration.simps"?
You can always write your own:
def cumsimp(func,a,b,num):
#Integrate func from a to b using num intervals.
num*=2
a=float(a)
b=float(b)
h=(b-a)/num
output=4*func(a+h*np.arange(1,num,2))
tmp=func(a+h*np.arange(2,num-1,2))
output[1:]+=tmp
output[:-1]+=tmp
output[0]+=func(a)
output[-1]+=func(b)
return np.cumsum(output*h/3)
def integ1(x):
return x
def integ2(x):
return x**2
def integ0(x):
return np.ones(np.asarray(x).shape)*5
First look at the sum and derivative of a constant function.
print cumsimp(integ0,0,10,5)
[ 10. 20. 30. 40. 50.]
print np.diff(cumsimp(integ0,0,10,5))
[ 10. 10. 10. 10.]
Now check for a few trivial examples:
print cumsimp(integ1,0,10,5)
[ 2. 8. 18. 32. 50.]
print cumsimp(integ2,0,10,5)
[ 2.66666667 21.33333333 72. 170.66666667 333.33333333]
Writing your integrand explicitly is much easier here then reproducing the simpson's rule function of scipy in this context. Picking intervals will be difficult to do when provided a single array, do you either:
Use every other value for the edges of simpson's rule and the remaining values as centers?
Use the array as edges and interpolate values of centers?
There are also a few options for how you want the intervals summed. These complications could be why its not coded in scipy.
Your question has been answered a long time ago, but I came across the same problem recently. I wrote some functions to compute such cumulative integrals for equally spaced points; the code can be found on GitHub. The order of the interpolating polynomials ranges from 1 (trapezoidal rule) to 7. As Daniel pointed out in the previous answer, some choices have to be made on how the intervals are summed, especially at the borders; results may thus be sightly different depending on the package you use. Be also aware that the numerical integration may suffer from Runge's phenomenon (unexpected oscillations) for high orders of polynomials.
Here is an example:
import numpy as np
from scipy import integrate as sp_integrate
from gradiompy import integrate as gp_integrate
# Definition of the function (polynomial of degree 7)
x = np.linspace(-3,3,num=15)
dx = x[1]-x[0]
y = 8*x + 3*x**2 + x**3 - 2*x**5 + x**6 - 1/5*x**7
y_int = 4*x**2 + x**3 + 1/4*x**4 - 1/3*x**6 + 1/7*x**7 - 1/40*x**8
# Cumulative integral using scipy
y_int_trapz = y_int [0] + sp_integrate.cumulative_trapezoid(y,dx=dx,initial=0)
print('Integration error using scipy.integrate:')
print(' trapezoid = %9.5f' % np.linalg.norm(y_int_trapz-y_int))
# Cumulative integral using gradiompy
y_int_trapz = gp_integrate.cumulative_trapezoid(y,dx=dx,initial=y_int[0])
y_int_simps = gp_integrate.cumulative_simpson(y,dx=dx,initial=y_int[0])
print('\nIntegration error using gradiompy.integrate:')
print(' trapezoid = %9.5f' % np.linalg.norm(y_int_trapz-y_int))
print(' simpson = %9.5f' % np.linalg.norm(y_int_simps-y_int))
# Higher order cumulative integrals
for order in range(5,8,2):
y_int_composite = gp_integrate.cumulative_composite(y,dx,order=order,initial=y_int[0])
print(' order %i = %9.5f' % (order,np.linalg.norm(y_int_composite-y_int)))
# Display the values of the cumulative integral
print('\nCumulative integral (with initial offset):\n',y_int_composite)
You should get the following result:
'''
Integration error using scipy.integrate:
trapezoid = 176.10502
Integration error using gradiompy.integrate:
trapezoid = 176.10502
simpson = 2.52551
order 5 = 0.48758
order 7 = 0.00000
Cumulative integral (with initial offset):
[-6.90203571e+02 -2.29979407e+02 -5.92267425e+01 -7.66415188e+00
2.64794452e+00 2.25594840e+00 6.61937372e-01 1.14797061e-13
8.20130517e-01 3.61254267e+00 8.55804341e+00 1.48428883e+01
1.97293221e+01 1.64257877e+01 -1.13464286e+01]
'''
I would go with Daniel's solution. But you need to be careful if the function that you are integrating is itself subject to fluctuations. Simpson's requires the function to be well-behaved (meaning in this case, one that is continuous).
There are techniques for making a moderately badly behaved function look like it is better behaved than it really is (really forms of approximation of your function) but in that case you have to be sure that the function "adequately" approximates yours. In that case you might make the intervals may be non-uniform to handle the problem.
An example might be in considering the flow of a field that, over longer time scales, is approximated by a well-behaved function but which over shorter periods is subject to limited random fluctuations in its density.
!I have values in the form of (x,y,z). By creating a list_plot3d plot i can clearly see that they are not quite evenly spaced. They usually form little "blobs" of 3 to 5 points on the xy plane. So for the interpolation and the final "contour" plot to be better, or should i say smoother(?), do i have to create a rectangular grid (like the squares on a chess board) so that the blobs of data are somehow "smoothed"? I understand that this might be trivial to some people but i am trying this for the first time and i am struggling a bit. I have been looking at the scipy packages like scipy.interplate.interp2d but the graphs produced at the end are really bad. Maybe a brief tutorial on 2d interpolation in sagemath for an amateur like me? Some advice? Thank you.
EDIT:
https://docs.google.com/file/d/0Bxv8ab9PeMQVUFhBYWlldU9ib0E/edit?pli=1
This is mostly the kind of graphs it produces along with this message:
Warning: No more knots can be added because the number of B-spline
coefficients
already exceeds the number of data points m. Probably causes:
either
s or m too small. (fp>s)
kx,ky=3,3 nx,ny=17,20 m=200 fp=4696.972223 s=0.000000
To get this graph i just run this command:
f_interpolation = scipy.interpolate.interp2d(*zip(*matrix(C)),kind='cubic')
plot_interpolation = contour_plot(lambda x,y:
f_interpolation(x,y)[0], (22.419,22.439),(37.06,37.08) ,cmap='jet', contours=numpy.arange(0,1400,100), colorbar=True)
plot_all = plot_interpolation
plot_all.show(axes_labels=["m", "m"])
Where matrix(c) can be a huge matrix like 10000 X 3 or even a lot more like 1000000 x 3. The problem of bad graphs persists even with fewer data like the picture i attached now where matrix(C) was only 200 x 3. That's why i begin to think that it could be that apart from a possible glitch with the program my approach to the use of this command might be totally wrong, hence the reason for me to ask for advice about using a grid and not just "throwing" my data into a command.
I've had a similar problem using the scipy.interpolate.interp2d function. My understanding is that the issue arises because the interp1d/interp2d and related functions use an older wrapping of FITPACK for the underlying calculations. I was able to get a problem similar to yours to work using the spline functions, which rely on a newer wrapping of FITPACK. The spline functions can be identified because they seem to all have capital letters in their names here http://docs.scipy.org/doc/scipy/reference/interpolate.html. Within the scipy installation, these newer functions appear to be located in scipy/interpolate/fitpack2.py, while the functions using the older wrappings are in fitpack.py.
For your purposes, RectBivariateSpline is what I believe you want. Here is some sample code for implementing RectBivariateSpline:
import numpy as np
from scipy import interpolate
# Generate unevenly spaced x/y data for axes
npoints = 25
maxaxis = 100
x = (np.random.rand(npoints)*maxaxis) - maxaxis/2.
y = (np.random.rand(npoints)*maxaxis) - maxaxis/2.
xsort = np.sort(x)
ysort = np.sort(y)
# Generate the z-data, which first requires converting
# x/y data into grids
xg, yg = np.meshgrid(xsort,ysort)
z = xg**2 - yg**2
# Generate the interpolated, evenly spaced data
# Note that the min/max of x/y isn't necessarily 0 and 100 since
# randomly chosen points were used. If we want to avoid extrapolation,
# the explicit min/max must be found
interppoints = 100
xinterp = np.linspace(xsort[0],xsort[-1],interppoints)
yinterp = np.linspace(ysort[0],ysort[-1],interppoints)
# Generate the kernel that will be used for interpolation
# Note that the default version uses three coefficients for
# interpolation (i.e. parabolic, a*x**2 + b*x +c). Higher order
# interpolation can be used by setting kx and ky to larger
# integers, i.e. interpolate.RectBivariateSpline(xsort,ysort,z,kx=5,ky=5)
kernel = interpolate.RectBivariateSpline(xsort,ysort,z)
# Now calculate the linear, interpolated data
zinterp = kernel(xinterp, yinterp)
I'm trying to implement a variable exponential moving average on a time series of intraday data (i.e 10 seconds). By variable, I mean that the size of the window included in the moving average depends on another factor (i.e. volatility). I was thinking of the following:
MA(t)=alpha(t)*price(t) + (1-alpha(t))MA(t-1),
where alpha corresponds for example to a changing volatility index.
In a backtest on huge series (more than 100000) points, this computation causes me "troubles". I have the complete vectors alpha and price, but for the current values of MA I always need the value just calculated before. Thus, so far I do not see a vectorized solution????
Another idea, I had, was trying to directly apply the implemented EMA(..,n=f()) function to every data point, by always having a different value for f(). But I do not find a fast solution neither so far.
Would be very kind if somebody could help me with my problem??? Even other suggestions of how constructing a variable moving average would be great.
Thx a lot in advance
Martin
A very efficient moving average operation is also possible via filter():
## create a weight vector -- this one has equal weights, other schemes possible
weights <- rep(1/nobs, nobs)
## and apply it as a one-sided moving average calculations, see help(filter)
movavg <- as.vector(filter(somevector, weights, method="convolution", side=1))
That was left-sided only, other choices are possible.
For timeseries, see the function rollmean in the zoo package.
You actually don't calculate a moving average, but some kind of a weighted cumulative average. A (weighted) moving average would be something like :
price <- runif(100,10,1000)
alpha <- rbeta(100,1,0.5)
tp <- embed(price,2)
ta <- embed(alpha,2)
MA1 <- apply(cbind(tp,ta),1,function(x){
weighted.mean(x[1:2],w=2*x[3:4]/sum(x))
})
Make sure you rescale the weights so they sum to the amount of observations.
For your own calculation, you could try something like :
MAt <- price*alpha
ma.MAt <- matrix(rep(MAt,each=n),nrow=n)
ma.MAt[upper.tri(ma.MAt)] <- 0
tt1 <- sapply(1:n,function(x){
tmp <- rev(c(rep(0,n-x),1,cumprod(rev(alpha[1:(x-1)])))[1:n])
sum(ma.MAt[i,]*tmp)
})
This calculates the averages as linear combinations of MAt, with weights defined by the cumulative product of alpha.
On a sidenote : I assumed the index to lie somewhere between 0 and 1.
I just added a VMA function to the TTR package to do this. For example:
library(quantmod) # loads TTR
getSymbols("SPY")
SPY$absCMO <- abs(CMO(Cl(SPY),20))/100
SPY$vma <- VMA(Cl(SPY), SPY$absCMO)
chartSeries(SPY,TA="addTA(SPY$vma,on=1,col='blue')")
x <- xts(rnorm(1e6),Sys.time()-1e6:1)
y <- xts(runif(1e6),Sys.time()-1e6:1)
system.time(VMA(x,y)) # < 0.5s on a 2.2Ghz Centrino
A couple notes from the documentation:
‘VMA’ calculate a variable-length
moving average based on the absolute
value of ‘w’. Higher (lower) values
of ‘w’ will cause ‘VMA’ to react
faster (slower).
The pre-compiled binaries should be on R-forge within 24 hours.