Related
currently I'm struggling to find a good way to perform the Hansen/Sargan tests of Overidentification restrictions within a Three-Stage Least Squares model (3SLS) in panel data using R. I was digging the whole day in different networks and couldn't find a way of depicting the tests in R using the well-known systemfit package.
Currently, my code is simple.
violence_c_3sls <- Crime ~ ln_GDP +I(ln_GDP^2) + ln_Gini
income_c_3sls <-ln_GDP ~ Crime + ln_Gini
gini_c_3sls <- ln_Gini ~ ln_GDP + I(ln_GDP^2) + Crime
inst <- ~ Educ_Gvmnt_Exp + I(Educ_Gvmnt_Exp^2)+ Health_Exp + Pov_Head_Count_1.9
system_c_3sls <- list(violence_c_3sls, income_c_3sls, gini_c_3sls)
fitsur_c_3sls <-systemfit(system_c_3sls, "3SLS",inst=inst, data=df_new, methodResidCov = "noDfCor" )
summary(fitsur_c_3sls)
However, adding more instruments to create an over-identified system do not yield in an output of the Hansen/Sargan test, thus I assume the test should be executed aside from the output and probably associated to systemfit class object.
Thanks in advance.
With g equations, l exogenous variables, and k regressors, the Sargan test for 3SLS is
where u is the stacked residuals, \Sigma is the estimated residual covariance, and P_W is the projection matrix on the exogenous variables. See Ch 12.4 from Davidson & MacKinnon ETM.
Calculating the Sargan test from systemfit should look something like this:
sargan.systemfit=function(results3sls){
result <- list()
u=as.matrix(resid(results3sls)) #model residuals, n x n_eq
n_eq=length(results3sls$eq) # number of equations
n=nrow(u) #number of observations
n_reg=length(coef(results3sls)) # total number of regressors
w=model.matrix(results3sls,which='z') #Matrix of instruments, in block diagonal form with one block per equation
#Need to aggregate into a single block (in case different instruments used per equation)
w_list=lapply(X = 1:n_eq,FUN = function(eq_i){
this_eq_label=results3sls$eq[[eq_i]]$eqnLabel
this_w=w[str_detect(rownames(w),this_eq_label),str_detect(colnames(w),this_eq_label)]
colnames(this_w)=str_remove(colnames(this_w),paste0(this_eq_label,'_'))
return(this_w)
})
w=do.call(cbind,w_list)
w=w[,!duplicated(colnames(w))]
n_inst=ncol(w) #w is n x n_inst, where n_inst is the number of unique instruments/exogenous variables
#estimate residual variance (or use residCov, should be asymptotically equivalent)
var_u=crossprod(u)/n #var_u=results3sls$residCov
P_w=w%*%solve(crossprod(w))%*%t(w) #Projection matrix on instruments w
#as.numeric(u) vectorizes the residuals into a n_eq*n x 1 vector.
result$statistic <- as.numeric(t(as.numeric(u))%*%kronecker(solve(var_u),P_w)%*%as.numeric(u))
result$df <- n_inst*n_eq-n_reg
result$p.value <- 1 - pchisq(result$statistic, result$df)
result$method = paste("Sargan over-identifying restrictions test")
return(result)
}
I have 4 non negative real variable that are A, B, C and X. Based on the current problem that I have, I notice that the variable X must belong to the interval of [B,C] and the relation will be a bunch of if-else conditions like this:
If A < B:
x = B
elseif A > C:
x = C
elseif B<=A<=C:
x = A
As you can see, it quite difficult to reformulate as a Mixed Integer Programming problem with corresponding decision variable (d1, d2 and d3). I have try reading some instructions regarding if-then formulation using big M method at this site:
https://www.math.cuhk.edu.hk/course_builder/1415/math3220/L2%20(without%20solution).pdf but it seem that this problem is more challenging than their tutorial.
Could you kindly provide me with a formulation for this situation ?
Thank you very much !
For Colombia's Observatorio Fiscal[1], I am coding a simple tax minimization problem, using CLP(R) (in SWI-Prolog). I want to use minimize/1 to find the least solution first. It is instead listing the bigger solution first. Here is the code:
:- use_module(library(clpr)).
deduction(_,3). % Anyone can take the standard deduction.
deduction(Who,D) :- itemizedDeduction(Who,D). % Or they can itemize.
income(joe,10). % Joe makes $10 a year.
itemizedDeduction(joe,4). % He can deduct more if he itemizes.
taxableIncome(Who,TI) :-
deduction(Who,D),
income(Who,I),
TI is I - D,
minimize(TI).
Here is what an interactive session looks like:
?- taxableIncome(joe,N).
N = 7 ;
N = 6 ;
false.
If I switch the word "minimize" to "maximize" it behaves identically. If I include no minimize or maximize clause, it doesn't look for a third solution, but otherwise it behaves the same:
?- taxableIncome(joe,N).
N = 7 ;
N = 6.
[1] The Observatorio Fiscal is a new organization that aims to model the Colombian economy, in order to anticipate the effects of changes in the law, similar to what the Congressional Budget Office or the Tax Policy Center do in the United States.
First, let's add the following definition to the program:
:- op(950,fy, *).
*_.
Using (*)/1, we can generalize away individual goals in the program.
For example, let us generalize away the minimize/1 goal by placing * in front:
taxableIncome(Who,TI) :-
deduction(Who,D),
income(Who,I),
TI #= I - D,
* minimize(TI).
We now get:
?- taxableIncome(X, Y).
X = joe,
Y = 7 ;
X = joe,
Y = 6.
This shows that CLP(R) in fact has nothing to do with this issue! These answers show that everything is already instantiated at the time minimize/1 is called, so there is nothing left to minimize.
To truly benefit from minimize/1, you must express the task in the form of CLP(R)—or better: CLP(Q)— constraints, then apply minimize/1 on a constrained expression.
Note also that in SWI-Prolog, both CLP(R) and CLP(Q) have elementary mistakes, and you cannot trust their results.
Per Mat's response, I rewrote the program expressing the constraints using CLP. The tricky bit was that I had to first collect all (both) possible values for deduction, then convert those values into a CLP domain. I couldn't get that conversion to work in CLP(R), but I could in CLP(FD):
:- use_module(library(clpfd)).
deduction(_,3). % Anyone can take the same standard deduction.
deduction(Who,D) :- % Or they can itemize.
itemizedDeduction(Who,D).
income(joe,10).
itemizedDeduction(joe,4).
listToDomain([Elt],Elt).
listToDomain([Elt|MoreElts],Elt \/ MoreDom) :-
MoreElts \= []
, listToDomain(MoreElts,MoreDom).
taxableIncome(Who,TI) :-
income(Who,I)
, findall(D,deduction(Who,D),DList)
, listToDomain(DList,DDomain)
% Next are the CLP constraints.
, DD in DDomain
, TI #= I-DD
, labeling([min(TI)],[TI]).
I have this data (t, TR and h) and need to estimate the parameters (a, b, c, d and e) for this model: h = (alog(TR)+b)(c*(t^d)-e)*41.59
t <- c(120,60,50,40,30,20,10,180,120,60,50,40,30,20,10,120,60,50,40,30,20,10,120,60,50,40,30,20,10)
TR <- c(2,2,2,2,2,2,2,5,5,5,5,5,5,5,5,10,10,10,10,10,10,10,20,20,20,20,20,20,20)
h <- c(49.4,41.6,39.1,36.2,31.9,25.1,15.9,80.2,70.3,55.2,51.6,47.3,41.7,33.2,22.1,84.2,64.1,59.9,54.6,48.1,38.6,26.2,97.5,72.8,67.9,61.6,54.3,43.7,30.1)
bell <- nls(h ~ ((a1*log(TR)+a2)*(a3*(t^b)-a4)*41.59), start = list(a=0.6,
b2=0.3, c=0.4, d=0.30, e=0.4))
I tried the "nls" and the "nls2" process but these didn't work because of this:
Error in nlsModel(formula, mf, start, wts) :
singular gradient matrix at initial parameter estimates
So,I founded a possible solution following this procedure (I think it's a Linearly Constrained Optimization):
Solve this system of equations
U1 <- log(TR)*(c1*t^d1-e1)*41.59
U2 <- (c1*t^d1-e1)*41.59
U3 <- t^d1*(a1*log(TR)+b1)*41.59
U4 <- c1*t^d1*log(t)*((a1*log(TR))+b1)*41.59
U5 <- -(a1*log(TR)+b1)*41.59
From the initial values assumed for the parameters of the model,
successively the system of linear equations, in order to obtain values of the deviations each time nearest zero.
ΣZ = (a1-a) ΣU1,i + (b1-b) ΣU2,i + (c1-c) ΣU3,1 + (d1-d) ΣU4,i + (e1-e) ΣU5,i
For each iteration, the values of the parameters assume those obtained in
previous iteration, plus the deviations...
How could I do this in R?
I'm sorry cause I'm a R beginner and I don't speak english very well.
Many thanks for any help you can give me.
I'm trying to do a simple genomic track intersection in R, and running into major performance problems, probably related to my use of for loops.
In this situation, I have pre-defined windows at intervals of 100bp and I'm trying to calculate how much of each window is covered by the annotations in mylist. Graphically, it looks something like this:
0 100 200 300 400 500 600
windows: |-----|-----|-----|-----|-----|-----|
mylist: |-| |-----------|
So I wrote some code to do just that, but it's fairly slow and has become a bottleneck in my code:
##window for each 100-bp segment
windows <- numeric(6)
##second track
mylist = vector("list")
mylist[[1]] = c(1,20)
mylist[[2]] = c(120,320)
##do the intersection
for(i in 1:length(mylist)){
st <- floor(mylist[[i]][1]/100)+1
sp <- floor(mylist[[i]][2]/100)+1
for(j in st:sp){
b <- max((j-1)*100, mylist[[i]][1])
e <- min(j*100, mylist[[i]][2])
windows[j] <- windows[j] + e - b + 1
}
}
print(windows)
[1] 20 81 101 21 0 0
Naturally, this is being used on data sets that are much larger than the example I provide here. Through some profiling, I can see that the bottleneck is in the for loops, but my clumsy attempt to vectorize it using *apply functions resulted in code that runs an order of magnitude more slowly.
I suppose I could write something in C, but I'd like to avoid that if possible. Can anyone suggest another approach that will speed this calculation up?
The "Right" thing to do is to use the bioconductor IRanges package, which uses an IntervalTree data structure to represent these ranges.
Having both of your objects in their own IRanges objects, you would then use the findOverlaps function to win.
Get it here:
http://www.bioconductor.org/packages/release/bioc/html/IRanges.html
By the by, the internals of the package are written in C, so its super fast.
EDIT
On second thought, it's not as much of a slam-dunk as I'm suggesting (a one liner), but you should definitely start using this library if you're working at all with genomic intervals (or other types) ... you'll likely need to do some set operations and stuff. Sorry, don't have time to provide the exact answer, though.
I just thought it's important to point this library out to you.
So I'm not entirely sure why the third and fourth windows aren't 100 and 20 because that would make more sense to me. Here's a one liner for that behavior:
Reduce('+', lapply(mylist, function(x) hist(x[1]:x[2], breaks = (0:6) * 100, plot = F)$counts))
Note that you need to specify the upper bound in breaks, but it shouldn't be hard to make another pass to get it if you don't know it in advance.
Okay, so I wasted WAY too much time on this, and still only got a factor of 3 speed-up. Can anyone beat this?
The code:
my <- do.call(rbind,mylist)
myFloor <- floor(my/100)
myRem <- my%%100
#Add intervals, over counting interval endpoints
counts <- table(do.call(c,apply(myFloor,1,function(r) r[1]:r[2])))
windows[as.numeric(names(counts))+1] <- counts*101
#subtract off lower and upper endpoints
lowerUncovered <- tapply(myRem[,1],myFloor[,1],sum)
windows[as.numeric(names(lowerUncovered))+1] <- windows[as.numeric(names(lowerUncovered))+1] - lowerUncovered
upperUncovered <- tapply(myRem[,2],myFloor[,2],function(x) 100*length(x) - sum(x))
windows[as.numeric(names(upperUncovered))+1] <- windows[as.numeric(names(upperUncovered))+1] - upperUncovered
The test:
mylist = vector("list")
for(i in 1:20000){
d <- round(runif(1,,500))
mylist[[i]] <- c(d,d+round(runif(1,,700)))
}
windows <- numeric(200)
new_code <-function(){
my <- do.call(rbind,mylist)
myFloor <- floor(my/100)
myRem <- my%%100
counts <- table(do.call(c,apply(myFloor,1,function(r) r[1]:r[2])))
windows[as.numeric(names(counts))+1] <- counts*101
lowerUncovered <- tapply(myRem[,1],myFloor[,1],sum)
windows[as.numeric(names(lowerUncovered))+1] <- windows[as.numeric(names(lowerUncovered))+1] - lowerUncovered
upperUncovered <- tapply(myRem[,2],myFloor[,2],function(x) 100*length(x) - sum(x))
windows[as.numeric(names(upperUncovered))+1] <- windows[as.numeric(names(upperUncovered))+1] - upperUncovered
#print(windows)
}
#old code
old_code <- function(){
for(i in 1:length(mylist)){
st <- floor(mylist[[i]][1]/100)+1
sp <- floor(mylist[[i]][2]/100)+1
for(j in st:sp){
b <- max((j-1)*100, mylist[[i]][1])
e <- min(j*100, mylist[[i]][2])
windows[j] <- windows[j] + e - b + 1
}
}
#print(windows)
}
system.time(old_code())
system.time(new_code())
The result:
> system.time(old_code())
user system elapsed
2.403 0.021 2.183
> system.time(new_code())
user system elapsed
0.739 0.033 0.588
Very frustrating that the system time is basically 0, but the observed time is so great. I bet if you did go down to C you would get a 50-100X speed-up.
I think I have made it much more complicated...
System.time didn't help me in performance evaluation in such a small dataset.
windows <- numeric(6)
mylist = vector("list")
mylist[[1]] = c(1,20)
mylist[[2]] = c(120,320)
library(plyr)
l_ply(mylist, function(x) {
sapply((floor(x[1]/100)+1) : (floor(x[2]/100)+1), function(z){
eval.parent(parse(text=paste("windows[",z,"] <- ",
min(z*100, x[2]) - max((z-1)*100, x[1]) + 1,sep="")),sys.nframe())
})
})
print(windows)
EDIT
A modification to eliminate eval
g <- llply(mylist, function(x) {
ldply((floor(x[1]/100)+1) : (floor(x[2]/100)+1), function(z){
t(matrix(c(z,min(z*100, x[2]) - max((z-1)*100, x[1]) + 1),nrow=2))
})
})
for(i in 1:length(g)){
windows[unlist(g[[i]][1])] <- unlist(g[[i]][2])
}
I don't have a bright idea, but you can get rid of the inner loop, and speed up things a bit. Notice that if a window falls fully wihtin a mylist interval, then you just have to add 100 to the corresponding windows element. So only the st-th and sp-th windows need special handling.
windows <- numeric(100)
for(i in 1:length(mylist)){
win <- mylist[[i]] # for cleaner code
st <- floor(win[1]/100)+1
sp <- floor(win[2]/100)+1
# start and stop are within the same window
if (sp == st){
windows[st] <- windows[st] + (win[2]%%100) - (win[1]%%100) +1
}
# start and stop are in separate windows - take care of edges
if (sp > st){
windows[st] <- windows[st] + 100 - (win[1]%%100) + 1
windows[sp] <- windows[sp] + (win[2]%%100)
}
# windows completely inside win
if (sp > st+1){
windows[(st+1):(sp-1)] <- windows[(st+1):(sp-1)] + 100
}
}
I generated a bigger list:
cuts <- sort(sample(1:10000, 70)) # random interval endpoints
mylist <- split(cuts, gl(35,2))
and got 1.08 sec for 1000 replicates of this version versus 1.72 sec for 1000 replicates for the original. With real data the speed-up will depend on whether the intervals in mylist tend to be much longer than 100 or not.
By the way, one could rewrite the inside loop as a separate function, and then lapply it over mylist, but that does not make it work faster.