Using "rollmedian" function as a input for "arima" function - zoo

My time-series data includes date-time and temperature columns as follows:
rn25_29_o:
ambtemp dt
1 -1.96 2007-09-28 23:55:00
2 -2.02 2007-09-28 23:57:00
3 -1.92 2007-09-28 23:59:00
4 -1.64 2007-09-29 00:01:00
5 -1.76 2007-09-29 00:03:00
6 -1.83 2007-09-29 00:05:00
I am using median smoothing function to enhance small fluctuations that are caused because of imprecise measurements.
unique_timeStamp <- make.time.unique(rn25_29_o$dt)
temp.zoo<-zoo(rn25_29_o$ambtemp,unique_timeStamp)
m.av<-rollmedian(temp.zoo, n,fill = list(NA, NULL, NA))
subsequently, the output of the median smoothing is used for building temporal model and achieving predictions by using the following code:
te = (x.fit = arima(m.av, order = c(1, 0, 0)))
# fit the model and print the results
x.fore = predict(te, n.ahead=50)
Finally, I encounter with the following error:
Error in seq.default(head(tt, 1), tail(tt, 1), deltat) : 'by'
argument is much too small
FYI: The modeling and prediction function works properly by using original time-series data.
Please, guide me through this error.

The problem occurred because of the properties of the zoo package.
Thus, the code can be amended to :
Median_ambtemp <- rollmedian(ambtemp,n,fill = list(NA, NULL, NA)) te = (x.fit = arima(Median_ambtemp, order = c(1, 0, 0)))
# fit the model and print the results
x.fore = predict(te, n.ahead=5)

Related

does GrADS have a "astd" (similarly to aave) command I could use?

I would like to have the spatial standard deviation for a variable (let's say temperature). in other words, does GrADS have a "astd" (similarly to aave) command I could use?
There is no command like this in GRADS. But you can actually compute the standard deviation in two ways:
[1] Compute manually. For example:
*compute the mean
x1 = ave(ts1.1,t=1,t=120)
*compute stdev
s1 = sqrt(ave(pow(ts1.1-x1,2),t=1,t=120)*(n1/(n1-1)))
n here is the number of samples.
[2] You can use the built in function 'stat' in GRADS.
You can use 'set stat on' or 'set gxout stat'
These commands will give you statics such as the following:
Data Type = grid
Dimensions = 0 1
I Dimension = 1 to 73 Linear 0 5
J Dimension = 1 to 46 Linear -90 4
Sizes = 73 46 3358
Undef value = -2.56e+33
Undef count = 1763 Valid count = 1595
Min, Max = 243.008 302.818
Cmin, cmax, cint = 245 300 5
Stats[sum,sumsqr,root(sumsqr),n]: 452778 1.29046e+08 11359.8 1595
Stats[(sum,sumsqr,root(sumsqr))/n]: 283.874 80906.7 284.441
Stats[(sum,sumsqr,root(sumsqr))/(n-1)]: 284.052 80957.4 284.53
Stats[(sigma,var)(n)]: 17.9565 322.437
Stats[(sigma,var)(n-1)]: 17.9622 322.64
Contouring: 245 to 300 interval 5
Sigma here is the standard deviation and Var is variance.
Is this what you are looking for?

Longitudinal Hierarchical Bayesian regression with JAGS

I'm completely new to JAGS/OpenBUGS so I would really appreciate a push in the right direction when it comes to specifying my model. I'm using an unbalanced longitudinal data that is compiled by 103 countries over 15 years where 12 years is picked in this case. The DV is the Gini coefficient, which shouldn't be modeled log-Normal but maybe rather Beta, although right now the focus is on just understanding how to compile the model in JAGS. I'm using a fixed effect model for the time being.
The data and code I'm running:
> head(x)
Year II2 II3 II4 ..... II24
1 1 2.956233 40.90458 4.475183 16.443553
8 1 1.257794 85.47378 2.395186 19.333433
19 1 4.139706 141.07899 2.544640 25.555404
37 1 2.233664 98.51313 3.902835 42.533333
49 1 2.879734 61.39000 1.471334 18.884444
71 1 3.381762 60.23783 3.432614 16.334222
> head(y)
Year II1
1 1 0.3240000
8 1 0.2576667
19 1 0.3132500
37 1 0.2700000
49 1 0.2744286
71 1 0.3250000
dim(x)
1224 23
length(y)
1224
Time <- 12, N <- length(y$II1)#No. of Obs.
dat <- list(x=x, y=y, N=N, Time=Time, p=dim(x)[2]),
inits <- funtion(){list(tau.1=1, tau.2=1, eta=1, alpha=0, beta1=0, beta2=0, beta3=0)}
model6 <- "model{
for(i in 1:N){for(t in 1:Time){
y[i,t]~dlnorm(mu[i,t],tau.1)
mu[i,t] <- inprod(x[i,t],beta[])+alpha[i]}
alpha[i]~dnorm(eta, tau.2)}
for (j in 1:p) {
b[j]~dnorm(0,0.001)
}
eta~dnorm(0, 0.0001)
tau.2~dgamma(0.01,0.01)
tau.1~dgamma(0.01,0.01)
}"
reg.jags <- jags.model(textConnection(model), data=dat, inits=inits, n.chains=1, n.adapt=1000)
And I keep getting this runtime error:
Error in jags.model(textConnection(model), data = dat, inits = inits, :
RUNTIME ERROR:
Compilation error on line 3.
Index out of range taking subset of y
Any suggestions on what I should do differently would be hugely appreciated! I know there are 3 "tricks" you can apply to unbalanced data but I'm still a little bit confused about how all of this works, e.i. how JAGS read the data input.
Cheers
J
Your dataframe y only has 2 columns. But Time is 12. Where you have
y[i,t]~dlnorm(mu[i,t],tau.1)
inside a loop
for(t in 1:Time){
think about what happens when t goes up to 3 (on its way to Time=12).
You are asking JAGS to look at y[i,3], which doesn't exist. Hence "Index out of range".

How to get fitted values from clogit model

I am interested in getting the fitted values at set locations from a clogit model. This includes the population level response and the confidence intervals around it. For example, I have data that looks approximately like this:
set.seed(1)
data <- data.frame(Used = rep(c(1,0,0,0),1250),
Open = round(runif(5000,0,50),0),
Activity = rep(sample(runif(24,.5,1.75),1250, replace=T), each=4),
Strata = rep(1:1250,each=4))
Within the Clogit model, activity does not vary within a strata, thus there is no activity main effect.
mod <- clogit(Used ~ Open + I(Open*Activity) + strata(Strata),data=data)
What I want to do is build a newdata frame at which I can eventually plot marginal fitted values at specified locations of Open similar to a newdata design in a traditional glm model: e.g.,
newdata <- data.frame(Open = seq(0,50,1),
Activity = rep(max(data$Activity),51))
However, when I try to run a predict function on the clogit, I get the following error:
fit<-predict(mod,newdata=newdata,type = "expected")
Error in Surv(rep(1, 5000L), Used) : object 'Used' not found
I realize this is because clogit in r is being run throught Cox.ph, and thus, the predict function is trying to predict relative risks between pairs of subjects within the same strata (in this case= Used).
My question, however is if there is a way around this. This is easily done in Stata (using the Margins Command), and manually in Excel, however I would like to automate in R since everything else is programmed there. I have also built this manually in R (example code below), however I keep ending up with what appear to be incorrect CIs in my real data, as a result I would like to rely on the predict function if possible. My code for manual prediction is:
coef<-data.frame(coef = summary(mod)$coefficients[,1],
se= summary(mod)$coefficients[,3])
coef$se <-summary(mod)$coefficients[,4]
coef$UpCI <- coef[,1] + (coef[,2]*2) ### this could be *1.96 but using 2 for simplicity
coef$LowCI <-coef[,1] - (coef[,2]*2) ### this could be *1.96 but using 2 for simplicity
fitted<-data.frame(Open= seq(0,50,2),
Activity=rep(max(data$Activity),26))
fitted$Marginal <- exp(coef[1,1]*fitted$Open +
coef[2,1]*fitted$Open*fitted$Activity)/
(1+exp(coef[1,1]*fitted$Open +
coef[2,1]*fitted$Open*fitted$Activity))
fitted$UpCI <- exp(coef[1,3]*fitted$Open +
coef[2,3]*fitted$Open*fitted$Activity)/
(1+exp(coef[1,3]*fitted$Open +
coef[2,3]*fitted$Open*fitted$Activity))
fitted$LowCI <- exp(coef[1,4]*fitted$Open +
coef[2,4]*fitted$Open*fitted$Activity)/
(1+exp(coef[1,4]*fitted$Open +
coef[2,4]*fitted$Open*fitted$Activity))
My end product would ideally look something like this but a product of the predict function....
Example output of fitted values.
Evidently Terry Therneau is less a purist on the matter of predictions from clogit models: http://markmail.org/search/?q=list%3Aorg.r-project.r-help+predict+clogit#query:list%3Aorg.r-project.r-help%20predict%20clogit%20from%3A%22Therneau%2C%20Terry%20M.%2C%20Ph.D.%22+page:1+mid:tsbl3cbnxywkafv6+state:results
Here's a modification to your code that does generate the 51 predictions. Did need to put in a dummy Strata column.
newdata <- data.frame(Open = seq(0,50,1),
Activity = rep(max(data$Activity),51), Strata=1)
risk <- predict(mod,newdata=newdata,type = "risk")
> risk/(risk+1)
1 2 3 4 5 6 7
0.5194350 0.5190029 0.5185707 0.5181385 0.5177063 0.5172741 0.5168418
8 9 10 11 12 13 14
0.5164096 0.5159773 0.5155449 0.5151126 0.5146802 0.5142478 0.5138154
15 16 17 18 19 20 21
0.5133829 0.5129505 0.5125180 0.5120855 0.5116530 0.5112205 0.5107879
22 23 24 25 26 27 28
0.5103553 0.5099228 0.5094902 0.5090575 0.5086249 0.5081923 0.5077596
29 30 31 32 33 34 35
0.5073270 0.5068943 0.5064616 0.5060289 0.5055962 0.5051635 0.5047308
36 37 38 39 40 41 42
0.5042981 0.5038653 0.5034326 0.5029999 0.5025671 0.5021344 0.5017016
43 44 45 46 47 48 49
0.5012689 0.5008361 0.5004033 0.4999706 0.4995378 0.4991051 0.4986723
50 51
0.4982396 0.4978068
{Warning} : It's actually rather difficult for mere mortals to determine which of the R-gods to believe on this one. I've learned so much R and statistics form each of those experts. I suspect there are matters of statistical concern or interpretation that I don't really understand.

Most efficient way to shift MultiIndex time series

I have a DataFrame that consists of many stacked time series. The index is (poolId, month) where both are integers, the "month" being the number of months since 2000. What's the best way to calculate one-month lagged versions of multiple variables?
Right now, I do something like:
cols_to_shift = ["bal", ...5 more columns...]
df_shift = df[cols_to_shift].groupby(level=0).transform(lambda x: x.shift(-1))
For my data, this took me a full 60 s to run. (I have 48k different pools and a total of 718k rows.)
I'm converting this from R code and the equivalent data.table call:
dt.shift <- dt[, list(bal=myshift(bal), ...), by=list(poolId)]
only takes 9 s to run. (Here "myshift" is something like "function(x) c(x[-1], NA)".)
Is there a way I can get the pandas verison to be back in line speed-wise? I tested this on 0.8.1.
Edit: Here's an example of generating a close-enough data set, so you can get some idea of what I mean:
ids = np.arange(48000)
lens = np.maximum(np.round(15+9.5*np.random.randn(48000)), 1.0).astype(int)
id_vec = np.repeat(ids, lens)
lens_shift = np.concatenate(([0], lens[:-1]))
mon_vec = np.arange(lens.sum()) - np.repeat(np.cumsum(lens_shift), lens)
n = len(mon_vec)
df = pd.DataFrame.from_items([('pool', id_vec), ('month', mon_vec)] + [(c, np.random.rand(n)) for c in 'abcde'])
df = df.set_index(['pool', 'month'])
%time df_shift = df.groupby(level=0).transform(lambda x: x.shift(-1))
That took 64 s when I tried it. This data has every series starting at month 0; really, they should all end at month np.max(lens), with ragged start dates, but good enough.
Edit 2: Here's some comparison R code. This takes 0.8 s. Factor of 80, not good.
library(data.table)
ids <- 1:48000
lens <- as.integer(pmax(1, round(rnorm(ids, mean=15, sd=9.5))))
id.vec <- rep(ids, times=lens)
lens.shift <- c(0, lens[-length(lens)])
mon.vec <- (1:sum(lens)) - rep(cumsum(lens.shift), times=lens)
n <- length(id.vec)
dt <- data.table(pool=id.vec, month=mon.vec, a=rnorm(n), b=rnorm(n), c=rnorm(n), d=rnorm(n), e=rnorm(n))
setkey(dt, pool, month)
myshift <- function(x) c(x[-1], NA)
system.time(dt.shift <- dt[, list(month=month, a=myshift(a), b=myshift(b), c=myshift(c), d=myshift(d), e=myshift(e)), by=pool])
I would suggest you reshape the data and do a single shift versus the groupby approach:
result = df.unstack(0).shift(1).stack()
This switches the order of the levels so you'd want to swap and reorder:
result = result.swaplevel(0, 1).sortlevel(0)
You can verify it's been lagged by one period (you want shift(1) instead of shift(-1)):
In [17]: result.ix[1]
Out[17]:
a b c d e
month
1 0.752511 0.600825 0.328796 0.852869 0.306379
2 0.251120 0.871167 0.977606 0.509303 0.809407
3 0.198327 0.587066 0.778885 0.565666 0.172045
4 0.298184 0.853896 0.164485 0.169562 0.923817
5 0.703668 0.852304 0.030534 0.415467 0.663602
6 0.851866 0.629567 0.918303 0.205008 0.970033
7 0.758121 0.066677 0.433014 0.005454 0.338596
8 0.561382 0.968078 0.586736 0.817569 0.842106
9 0.246986 0.829720 0.522371 0.854840 0.887886
10 0.709550 0.591733 0.919168 0.568988 0.849380
11 0.997787 0.084709 0.664845 0.808106 0.872628
12 0.008661 0.449826 0.841896 0.307360 0.092581
13 0.727409 0.791167 0.518371 0.691875 0.095718
14 0.928342 0.247725 0.754204 0.468484 0.663773
15 0.934902 0.692837 0.367644 0.061359 0.381885
16 0.828492 0.026166 0.050765 0.524551 0.296122
17 0.589907 0.775721 0.061765 0.033213 0.793401
18 0.532189 0.678184 0.747391 0.199283 0.349949
In [18]: df.ix[1]
Out[18]:
a b c d e
month
0 0.752511 0.600825 0.328796 0.852869 0.306379
1 0.251120 0.871167 0.977606 0.509303 0.809407
2 0.198327 0.587066 0.778885 0.565666 0.172045
3 0.298184 0.853896 0.164485 0.169562 0.923817
4 0.703668 0.852304 0.030534 0.415467 0.663602
5 0.851866 0.629567 0.918303 0.205008 0.970033
6 0.758121 0.066677 0.433014 0.005454 0.338596
7 0.561382 0.968078 0.586736 0.817569 0.842106
8 0.246986 0.829720 0.522371 0.854840 0.887886
9 0.709550 0.591733 0.919168 0.568988 0.849380
10 0.997787 0.084709 0.664845 0.808106 0.872628
11 0.008661 0.449826 0.841896 0.307360 0.092581
12 0.727409 0.791167 0.518371 0.691875 0.095718
13 0.928342 0.247725 0.754204 0.468484 0.663773
14 0.934902 0.692837 0.367644 0.061359 0.381885
15 0.828492 0.026166 0.050765 0.524551 0.296122
16 0.589907 0.775721 0.061765 0.033213 0.793401
17 0.532189 0.678184 0.747391 0.199283 0.349949
Perf isn't too bad with this method (it might be a touch slower in 0.9.0):
In [19]: %time result = df.unstack(0).shift(1).stack()
CPU times: user 1.46 s, sys: 0.24 s, total: 1.70 s
Wall time: 1.71 s

Comparing vectors

I am new to R and am trying to find a better solution for accomplishing this fairly simple task efficiently.
I have a data.frame M with 100,000 lines (and many columns, out of which 2 columns are relevant to this problem, I'll call it M1, M2). I have another data.frame where column V1 with about 10,000 elements is essential to this task. My task is this:
For each of the element in V1, find where does it occur in M2 and pull out the corresponding M1. I am able to do this using for-loop and it is terribly slow! I am used to Matlab and Perl and this is taking for EVER in R! Surely there's a better way. I would appreciate any valuable suggestions in accomplishing this task...
for (x in c(1:length(V$V1)) {
start[x] = M$M1[M$M2 == V$V1[x]]
}
There is only 1 element that will match, and so I can use the logical statement to directly get the element in start vector. How can I vectorize this?
Thank you!
Here is another solution using the same example by #aix.
M[match(V$V1, M$M2),]
To benchmark performance, we can use the R package rbenchmark.
library(rbenchmark)
f_ramnath = function() M[match(V$V1, M$M2),]
f_aix = function() merge(V, M, by.x='V1', by.y='M2', sort=F)
f_chase = function() M[M$M2 %in% V$V1,] # modified to return full data frame
benchmark(f_ramnath(), f_aix(), f_chase(), replications = 10000)
test replications elapsed relative
2 f_aix() 10000 12.907 7.068456
3 f_chase() 10000 2.010 1.100767
1 f_ramnath() 10000 1.826 1.000000
Another option is to use the %in% operator:
> set.seed(1)
> M <- data.frame(M1 = sample(1:20, 15, FALSE), M2 = sample(1:20, 15, FALSE))
> V <- data.frame(V1 = sample(1:20, 10, FALSE))
> M$M1[M$M2 %in% V$V1]
[1] 6 8 11 9 19 1 3 5
Sounds like you're looking for merge:
> M <- data.frame(M1=c(1,2,3,4,10,3,15), M2=c(15,6,7,8,-1,12,5))
> V <- data.frame(V1=c(-1,12,5,7))
> merge(V, M, by.x='V1', by.y='M2', sort=F)
V1 M1
1 -1 10
2 12 3
3 5 15
4 7 3
If V$V1 might contain values not present in M$M2, you may want to specify all.x=T. This will fill in the missing values with NAs instead of omitting them from the result.