Circumventing R's `Error in if (nbins > .Machine$integer.max)` - sql

This is a saga which began with the problem of how to do survey weighting. Now that I appear to be doing that correctly, I have hit a bit of a wall (see previous post for details on the import process and where the strata variable came from):
> require(foreign)
> ipums <- read.dta('/path/to/data.dta')
> require(survey)
> ipums.design <- svydesign(id=~serial, strata=~strata, data=ipums, weights=perwt)
Error in if (nbins > .Machine$integer.max) stop("attempt to make a table with >= 2^31 elements") :
missing value where TRUE/FALSE needed
In addition: Warning messages:
1: In pd * (as.integer(cat) - 1L) : NAs produced by integer overflow
2: In pd * nl : NAs produced by integer overflow
> traceback()
9: tabulate(bin, pd)
8: as.vector(data)
7: array(tabulate(bin, pd), dims, dimnames = dn)
6: table(ids[, 1], strata[, 1])
5: inherits(x, "data.frame")
4: is.data.frame(x)
3: rowSums(table(ids[, 1], strata[, 1]) > 0)
2: svydesign.default(id = ~serial, weights = ~perwt, strata = ~strata,
data = ipums)
1: svydesign(id = ~serial, weights = ~perwt, strata = ~strata, data = ipums)
This error seems to come from the tabulate function, which I hoped would be straightforward enough to circumvent, first by changing .Machine$integer.max
> .Machine$integer.max <- 2^40
and when that didn't work the whole source code of tabulate:
> tabulate <- function(bin, nbins = max(1L, bin, na.rm=TRUE))
{
if(!is.numeric(bin) && !is.factor(bin))
stop("'bin' must be numeric or a factor")
#if (nbins > .Machine$integer.max)
if (nbins > 2^40) #replacement line
stop("attempt to make a table with >= 2^31 elements")
.C("R_tabulate",
as.integer(bin),
as.integer(length(bin)),
as.integer(nbins),
ans = integer(nbins),
NAOK = TRUE,
PACKAGE="base")$ans
}
Neither circumvented the problem. Apparently this is one reason why the ff package was created, but what worries me is the extent to which this is a problem I cannot avoid in R. This post seems to indicate that even if I were to use a package that would avoid this problem, I would only be able to access 2^31 elements at a time. My hope was to use sql (either sqlite or postgresql) to get around the memory problems, but I'm afraid I'll spend a while getting that to work, only to run into the same fundamental limit.
Attempting to switch back to Stata doesn't solve the problem either. Again see the previous post for how I use svyset, but the calculation I would like to run causes Stata to hang:
svy: mean age, over(strata)
Whether throwing more memory at it will solve the problem I don't know. I run R on my desktop which has 16 gigs, and I use Stata through a Windows server, currently setting memory allocation to 2000MB, but I could theoretically experiment with increasing that.
So in sum:
Is this a hard limit in R?
Would sql solve my R problems?
If I split it up into many separate files would that fix it (a lot of work...)?
Would throwing a lot of memory at Stata do it?
Am I seriously barking up the wrong tree somehow?

Yes, R uses 32-bit indexes for vectors so they can contain no more than 2^31-1 entries and you are trying to create something with 2^40. There is talk of introducing 64-bit indexes but that will be some way off before appearing in R. Vectors have the stated hard limit and that is it as far as base R is concerned.
I am unfamiliar with the details of what you are doing to offer any further advice on the other parts of your Q.
Why do you want to work with the full data set? Wouldn't a smaller sample that can fit in to the restrictions R places on you be just as useful? You could use SQL to store all the data and query it from R to return a random subset of more appropriate size.

Since this question was asked some time ago, I'd like to point that my answer here uses the version 3.3 of the survey package.
If you check the code of svydesign, you can see that the function that causes all the problem is within a check step that looks whether you should set the nest parameter to TRUE or not. This step can be disabled setting the option check.strata=FALSE.
Of course, you shouldn't disable a check step unless you know what you are doing. In this case, you should be able to decide yourself whether you need to set the nest option to TRUE or FALSE. nest should be set to TRUE when the same PSU (cluster) id is recycled in different strata.
Concretely for the IPUMS dataset, since you are using the serial variable for cluster identification and serial is unique for each household in a given sample, you may want to set nest to FALSE.
So, your survey design line would be:
ipums.design <- svydesign(id=~serial, strata=~strata, data=ipums, weights=perwt, check.strata=FALSE, nest=FALSE)
Extra advice: even after circumventing this problem you will find that the code is pretty slow unless you remap strata to a range from 1 to length(unique(ipums$strata)):
ipums$strata <- match(ipums$strata,unique(ipums$strata))

Both #Gavin and #Martin deserve credit for this answer, or at least leading me in the right direction. I'm mostly answering it separately to make it easier to read.
In the order I asked:
Yes 2^31 is a hard limit in R, though it seems to matter what type it is (which is a bit strange given it is the length of the vector, rather than the amount of memory (which I have plenty of) which is the stated problem. Do not convert strata or id variables to factors, that will just fix their length and nullify the effects of subsetting (which is the way to get around this problem).
sql could probably help, provided I learn how to use it correctly. I did the following test:
library(multicore) # make svy fast!
ri.ny <- subset(ipums, statefips_num %in% c(36, 44))
ri.ny.design <- svydesign(id=~serial, weights=~perwt, strata=~strata, data=ri.ny)
svyby(~incwage, ~strata, ri.ny.design, svymean, data=ri.ny, na.rm=TRUE, multicore=TRUE)
ri <- subset(ri.ny, statefips_num==44)
ri.design <- svydesign(id=~serial, weights=~perwt, strata=~strata, data=ri)
ri.mean <- svymean(~incwage, ri.design, data=ri, na.rm=TRUE)
ny <- subset(ri.ny, statefips_num==36)
ny.design <- svydesign(id=~serial, weights=~perwt, strata=~strata, data=ny)
ny.mean <- svymean(~incwage, ny.design, data=ny, na.rm=TRUE, multicore=TRUE)
And found the means to be the same, which seems like a reasonable test.
So: in theory, provided I can split up the calculation by either using plyr or sql, the results should still be fine.
See 2.
Throwing a lot of memory at Stata definitely helps, but now I'm running into annoying formatting issues. I seem to be able to perform most of the calculation I want (much quicker and with more stability as well) but I can't figure out how to get it into the form I want. Will probably ask a separate question on this. I think the short version here is that for big survey data, Stata is much better out of the box.
In many ways yes. Trying to do analysis with data this big is not something I should have taken on lightly, and I'm far from figuring it out even now. I was using the svydesign function correctly, but I didn't really know what's going on. I have a (very slightly) better grasp now, and it's heartening to know I was generally correct about how to solve the problem. #Gavin's general suggestion of trying out small data with external results to compare to is invaluable, something I should have started ages ago. Many thanks to both #Gavin and #Martin.

Related

X and Y inputs in LabVIEW

I am new to LabVIEW and I am trying to read a code written in LabVIEW. The block diagram is this:
This is the program to input x and y functions into the voltage input. It is meant to give an input voltage in different forms (sine, heartshape , etc.) into the fast-steering mirror or galvano mirror x and y axises.
x and y function controls are for inputting a formula for a function, and then we use "evaluation single value" function to input into a daq assistant.
I understand that { 2*(|-Mpi|)/N }*i + -Mpi*pi goes into the x value. However, I dont understand why we use this kind of formula. Why we need to assign a negative value and then do the absolute value of -M*pi. Also, I don`t understand why we need to divide to N and then multiply by i. And finally, why need to add -Mpi again? If you provide any hints about this I would really appreciate it.
This is just a complicated way to write the code/formula. Given what the code looks like (unnecessary wire bends, duplicate loop-input-tunnels, hidden wires, unnecessary coercion dots, failure to use appropriate built-in 'negate' function) not much care has been given in writing it. So while it probably yields the correct results you should not expect it to do so in the most readable way.
To answer you specific questions:
Why we need to assign a negative value and then do the absolute value
We don't. We can just move the negation immediately before the last addition or change that to a subtraction:
{ 2*(|Mpi|)/N }*i - Mpi*pi
And as #yair pointed out: We are not assigning a value here, we are basically flipping the sign of whatever value the user entered.
Why we need to divide to N and then multiply by i
This gives you a fraction between 0 and 1, no matter how many steps you do in your for-loop. Think of N as a sampling rate. I.e. your mirrors will always do the same movement, but a larger N just produces more steps in between.
Why need to add -Mpi again
I would strongly assume this is some kind of quick-and-dirty workaround for a bug that has not been fixed properly. Looking at the code it seems this +Mpi*pi has been added later on in the development process. And while I don't know what the expected values are I would believe that multiplying only one of the summands by Pi is probably wrong.

Setting initial values for non-linear parameters via tabuSearch

I'm trying to fit the lppl model to KLSE index to predict the most probable crash time. Many papers suggested tabuSearch to identify the initial value for non-linear parameters but none of them publish their code. I have tried to fit the mentioned index with the help of NLS And Log-Periodic Power Law (LPPL) in R. But the obtained error and p values are not significant. I believe that the initial values are not accurate. Can anyone help me on how to find the proper initial values?
library(tseries)
library(zoo)
ts<-get.hist.quote(instrument="^KLSE",start="2003-04-18",end="2008-01-30",quote="Close",provider="yahoo",origin="1970-01-01",compression="d",retclass="zoo")
df<-data.frame(ts)
df<-data.frame(Date=as.Date(rownames(df)),Y=df$Close)
df<-df[!is.na(df$Y),]
library(minpack.lm)
library(ggplot2)
df$days<-as.numeric(df$Date-df[1,]$Date)
f<-function(pars,xx){pars$a + (pars$tc - xx)^pars$m *(pars$b+ pars$c * cos(pars$omega*log(pars$tc - xx) + pars$phi))}
resids<-function(p,observed,xx){df$Y-f(p,xx)}
nls.out <- nls.lm(par=list(a=600,b=-266,tc=3000, m=.5,omega=7.8,phi=-4,c=-14),fn = resids, observed = df$Y, xx = df$days, control= nls.lm.control (maxiter =1024, ftol=1e-6, maxfev=1e6))
par<-nls.out$par
nls.final<-nls(Y~(a+(tc-days)^m*(b+c*cos(omega*log(tc-days)+phi))),data=df,start=par,algorithm="plinear",control=nls.control(maxiter=10024,minFactor=1e-8))
summary(nls.final)
I would look at some of the newer research on this topic, there is a good trig modification that will practically guarantee a singular optimization. Additionally, you can use r's built in linear equation solver, to find the linearizable parameters, ergo you will only need to optimize in 3 dimensions. The link below should get you started. I would cite recent literature and personal experience to strongly advise against a tabu search.
https://www.ethz.ch/content/dam/ethz/special-interest/mtec/chair-of-entrepreneurial-risks-dam/documents/dissertation/master%20thesis/MAS_final_Tuncay.pdf

Root finding with a kinked function using NLsolve in Julia

I am currently trying to solve a complementarity problem with a function that features a downward discontinuity, using the mcpsolve() function of the NLsolve package in Julia. The function is reproduced here for specific parameters, and the numbers below refer to the three panels of the figure.
Unfortunately, the algorithm does not always return the interior solution, even though it exists:
In (1), when starting at 0, the algorithm stays at 0, thinking that the boundary constraint binds,
In (2), when starting at 0, the algorithm stops right before the downward jump, even though the solution lies to the right of this point.
These problems occur regardless of the method used - trust region or Newton's method. Ideally, the algorithm would look for potential solutions in the entire set before stopping.
I was wondering if some of you had worked with similar functions, and had found a clever solution to bypass these issues. Note that
Starting to the right of the solution would not solve these problems, as they would also occur for other parametrization - see (3) this time,
It is not known a priori where in the parameter space the particular cases occur.
As an illustrative example, consider the following piece of code. Note that the function is smoother, and yet here as well the algorithm cannot find the solution.
function f!(x,fvec)
if x[1] <= 1.8
fvec[1] = 0.1 * (sin(3*x[1]) - 3)
else
fvec[1] = 0.1 * (x[1]^2 - 7)
end
end
NLsolve.mcpsolve(f!,[0.], [Inf], [0.], reformulation = :smooth, autodiff = true)
Once more, setting the initial value to something else than 0 would only postpone the problem. Also, as pointed out by Halirutan, fzero from the Roots package would probably work, but I'd rather use mcpsolve() as the problem is initially a complementarity problem.
Thank you very much in advance for your help.

Clearing numerical values in Mathematica

I am working on fairly large Mathematica projects and the problem arises that I have to intermittently check numerical results but want to easily revert to having all my constructs in analytical form.
The code is fairly fluid I don't want to use scoping constructs everywhere as they add work overhead. Is there an easy way for identifying and clearing all assignments that are numerical?
EDIT: I really do know that scoping is the way to do this correctly ;-). However, for my workflow I am really just looking for a dirty trick to nix all numerical assignments after the fact instead of having the foresight to put down a Block.
If your assignments are on the top level, you can use something like this:
a = 1;
b = c;
d = 3;
e = d + b;
Cases[DownValues[In],
HoldPattern[lhs_ = rhs_?NumericQ] |
HoldPattern[(lhs_ = rhs_?NumericQ;)] :> Unset[lhs],
3]
This will work if you have a sufficient history length $HistoryLength (defaults to infinity). Note however that, in the above example, e was assigned 3+c, and 3 here was not undone. So, the problem is really ambiguous in formulation, because some numbers could make it into definitions. One way to avoid this is to use SetDelayed for assignments, rather than Set.
Another alternative would be to analyze the names in say Global' context (if that is the context where your symbols live), and then say OwnValues and DownValues of the symbols, in a fashion similar to the above, and remove definitions with purely numerical r.h.s.
But IMO neither of these approaches are robust. I'd still use scoping constructs and try to isolate numerics. One possibility is to wrap you final code in Block, and assign numerical values inside this Block. This seems a much cleaner approach. The work overhead is minimal - you just have to remember which symbols you want to assign the values to. Block will automatically ensure that outside it, the symbols will have no definitions.
EDIT
Yet another possibility is to use local rules. For example, one could define rule[a] = a->1; rule[d]=d->3 instead of the assignments above. You could then apply these rules, extracting them as say
DownValues[rule][[All, 2]], whenever you want to test with some numerical arguments.
Building on Andrew Moylan's solution, one can construct a Block like function that would takes rules:
SetAttributes[BlockRules, HoldRest]
BlockRules[rules_, expr_] :=
Block ## Append[Apply[Set, Hold#rules, {2}], Unevaluated[expr]]
You can then save your numeric rules in a variable, and use BlockRules[ savedrules, code ], or even define a function that would apply a fixed set of rules, kind of like so:
In[76]:= NumericCheck =
Function[body, BlockRules[{a -> 3, b -> 2`}, body], HoldAll];
In[78]:= a + b // NumericCheck
Out[78]= 5.
EDIT In response to Timo's comment, it might be possible to use NotebookEvaluate (new in 8) to achieve the requested effect.
SetAttributes[BlockRules, HoldRest]
BlockRules[rules_, expr_] :=
Block ## Append[Apply[Set, Hold#rules, {2}], Unevaluated[expr]]
nb = CreateDocument[{ExpressionCell[
Defer[Plot[Sin[a x], {x, 0, 2 Pi}]], "Input"],
ExpressionCell[Defer[Integrate[Sin[a x^2], {x, 0, 2 Pi}]],
"Input"]}];
BlockRules[{a -> 4}, NotebookEvaluate[nb, InsertResults -> "True"];]
As the result of this evaluation you get a notebook with your commands evaluated when a was locally set to 4. In order to take it further, you would have to take the notebook
with your code, open a new notebook, evaluate Notebooks[] to identify the notebook of interest and then do :
BlockRules[variablerules,
NotebookEvaluate[NotebookPut[NotebookGet[nbobj]],
InsertResults -> "True"]]
I hope you can make this idea work.

how to optimize looped searching in R?

I am trying to scan a dataset using a loop in R, to see if the data points in the subset of data fulfill some rules, an example is pasted here:
loop.ward <- 1
loop.control.chart <- 1
while (loop.ward <= length(unique(control.chart[,"ward"])))
{
loop.weekly.count <- 3
while (loop.weekly.count <= nrow(control.chart[control.chart[,"ward"]==unique(control.chart[,"ward"])[loop.ward] ,]))
{
temp <- control.chart[control.chart[,"ward"]==unique(control.chart[,"ward"])[loop.ward] ,][(loop.weekly.count-2):(loop.weekly.count),]
if (nrow(temp[temp["UWL.out"]==1,])>=2 | nrow(temp[temp["LWL.out"]==1,])>=2)
{
control.chart[control.chart[,"ward"]==unique(control.chart[,"ward"])[loop.ward],][(loop.weekly.count),"rules.violated"] <- 2
}
loop.control.chart <- loop.control.chart + 1
loop.weekly.count <- loop.weekly.count + 1
}
loop.ward <- loop.ward + 1
}
The code is doing what I need, but in a very slow way (I have around 7000 data points total for all wards (i.e. max(loop.control.chart) = 7000), and 4 rules to do the scanning, it took me 10 minutes to finished one round.
I can't think of anyway of optimizing it (I thought of tapply, but don't know how to implement it), any suggestions?
Thanks!
(And one more trivial question, is the code here readable to you?)
Update
A portion of the data is attached here for your reference.
Could you give a sample of the data? It is difficult to divine the structure from what you've given. I think plyr might be helpful, but I can't be that much more specific without example data.
The code is not that readable. I understand what you are doing, but it seem unnecessarily verbose. It also appears based on some of the stuff you done to be a bit hacked together particularly the if conditional. There is also quite a bit of duplicate code.
I don't mean to be harsh as I'm sure the code makes perfect sense from your view, Perhaps it'll seem clearer with example data.