bartMachine (Bayesian Additive Regression Tree) in >1 million cases in R

bartMachine (Bayesian Additive Regression Tree) in >1 million cases in R - bayesian

I want to use BART via the bartMachine package for a dataframe of just over 1 million cases. With a lot of optimisation in the java memory setting, I can get R on my MacBook to complete the BART model for about 5000 cases. Anything above that is aborted as the system runs out of memory space.
Is there any chance I can use bartMachine() with an input matrix of 1 mio numbers of rows (ca. 15 predictors)?
Otherwise are there any alternative packages that would allow my to at least run predictor selection using BART?
Thanks for your help!

have you tried to increase RAM option ?
options(java.parameters="-Xmx12g") # must be set initially

I'm the maintainer of this package. Have you tried "mem_cache_for_speed = FALSE" as an option?

Related

CPLEX won't display decision variables after successful solution

I set up a code for a course scheduling optimization problem using IBM CPLEX.
The decision variable is dvar boolean x[course][roomtype][timeslot];, where x is 1 if the course takes place in a room type r during timeslot t.
The model has worked perfectly fine and is feasible for all instances and scenarios I tried it on. Now, for a new scenario, I increased the number of timeslots from 46 to 240, which increased the overall number of decision variables to over 2 million instead of around 300,000.
Now, I can still run the model and after slightly longer run time I get an optimal solution. Yet, the process I had for analysis before was displaying the decision variables, sorting for the ones with the value of 1 and copying and pasting them into Excel for further analysis.
This is now not possible anymore as CPLEX won't respond for a very long time and then does not let me do anything anymore from that point onwards (Limited extent of decision variable display). I have to close the program and start again.
I assumed the problem was the RAM or overall memory, so I opted for Cloud Services of my university. But even having 128GB of RAM, 12 cores and 500 GB of memory at hand was not sufficient and the performance is exactly the same as with my own private laptop.
Any suggestions on what could be the problem or how to export the solution anyway?
Are there variable limits with CPLEX that would make this impossible to solve?
Thanks a lot in advance!

indeed displaying huge matrixes can freeze the IDE.
You wrote:
Now, I can still run the model and after slightly longer run time I
get an optimal solution. Yet, the process I had for analysis before
was displaying the decision variables, sorting for the ones with the
value of 1 and copying and pasting them into Excel for further
analysis.
You should do that with SheetWrite.
First you build a set of the ones with value 1 and then you export with SheetWrite.
In Excel, Rocket science and optimization
.mod
range A=1..2;
range B=1..3;
range C=1..4;
dvar int X[A][B][C];
subject to
{
forall(a in A,b in B,c in C) X[a][b][c]==a*b*c;
}
tuple someTuple{
int a;
int b;
int c;
int value;
};
{someTuple} someSet = {<i,j,k,X[i][j][k]> | i in A, j in B, k in C:X[i][j][k]==1};
.dat
SheetConnection sheet("write3Darray.xlsx");
someSet to SheetWrite(sheet,"A1:D24");

Using memtier_benchmark; every key is missed

My misses/sec are populated and there are no hits.
Data contains keys range from 1 to 300 K and data stored is string type
memtier_benchmark -s xx.xxx.xxx.xxx -p xxxxx -P redis -t 1 -n 1 --ratio 0:1 -c 1 -x 2 --key-pattern S:S --authenticate=xxxxxxx --key-prefix=

memtier_benchmark is pretty poorly documented in this regard. If you use it out of the box on first run, it won't simulate any cache hits, which is pretty useless in terms of a tool designed to test cache performance.
The 2 key parameters here are:
--key-pattern=[SET:GET]
--ratio=[SET:GET]
--key-pattern defines the names given to keys that are set and the names of keys requested. For instance, if you use S:S, that means that the software will set the first key as memtier-0 and then immediately request memtier-1, then set memtier-1, then request memtier-2 (Go figure...). That's why you get a 100% miss result.
If you set R:R, that means the software will randomly set the digit in the keyname in both the Set and the Get. This will typically result in a miss ratio of > 90%, depending on how many clients and threads you set. If you're operating a cache with a miss ratio of > 90%, it questionable whether or not you should be running a cache at all, so again, this is pretty useless.
To simulate what a real world cache should be doing, you want to have a miss ratio of < 50%. To achieve this, you need to expand the number of Gets over Sets. The default for memtier_benchmark is 1:10, but again, on first run (or if you're persistently running this against a cold cache) with --key-pattern=S:S as default, you're still going to get a very high miss ratio. If you keep repeating the test against the same cache you are continually populating, you should see your miss ratio being to fall, but again, that may not be something you can rely on if you're testing in an ephemeral environment.
To get a lower miss ratio on a first run, I use:
--key-pattern=S:R --ratio=1:20
This should result in a miss ratio < 50%. That's as good as I've been able to simulate. My actual cache will have a miss ratio of < 5%. I'm still trying to figure out a way to test that with memtier_benchmark.
Also, use --hide-histogram to get rid of the annoying dump of test results.
EDIT:
For a full blown 100% hit / 0% miss ratio, do the following:
Start with a cold, empty cache
Run a test that uses the -ratio= parameter so that only Sets are included in the test, in a very narrow key range:
--hide-histogram --key-pattern=S:S --key-minimum=1 --key-maximum=50 --ratio=1:0
Now, run the test again, and this time, flip the ratio so that only Gets are included:
--hide-histogram --key-pattern=S:S --key-minimum=1 --key-maximum=50 --ratio=0:1
You can then tune the hit/miss ratio by re-running both parts and expanding the --key-maximum= value

Your ratio is 0 set add 1 get. change it to 1:1 or something similar

You need to populate your memcached before trying a read-only benchmark.
To populate it, you can run a write-only workload to write all the keys at least once.

Hyperopt set timeouts and modify space during execution

if someone can help on:
How to set a timeout for each individual test ? a timeout for the total experiment ?
How to setup a progressive strategy which would eliminate/prune a % of worst scoring branches of search space at different stage of the experiment (while using current optimization algorithms) ? ie. at 30% of the max total experiment, it could remove 50% of the worst scoring classifiers and all its branch of hyperparameters to remove it from upcoming tests. Then, same process at 60%...
Thanks a lot!

Following my exchange on hyperopt's github:
there is not a per-trial timeout but hyperopt-sklearn implements its own solution by just wrapping the function. Please look for "fn_with_timeout" at https://github.com/hyperopt/hyperopt-sklearn/ .
from issue 210: "the optimizers are stateless, and fmin stores all state of the experiment in the trials object. So if you remove some experiments from the trials object, it's as if they never happened. use fmin's "max_evals" parameter to interrupt search as often as you need to make these sorts of modifications. It should be fine to use repeated calls with e.g. max_evals increasing by 1 every time if you want really fine grained control."

Thanks for looking into this, #doxav. I've written some code that addresses question 1, taking part of fn_with_timeout from hyperopt-sklearn and adapting it for standard Hyperopt cost functions.
You can find it here:
https://gist.github.com/hunse/247d91d14aaa8f32b24533767353e35d

Circumventing R's `Error in if (nbins > .Machine$integer.max)`

This is a saga which began with the problem of how to do survey weighting. Now that I appear to be doing that correctly, I have hit a bit of a wall (see previous post for details on the import process and where the strata variable came from):
> require(foreign)
> ipums <- read.dta('/path/to/data.dta')
> require(survey)
> ipums.design <- svydesign(id=~serial, strata=~strata, data=ipums, weights=perwt)
Error in if (nbins > .Machine$integer.max) stop("attempt to make a table with >= 2^31 elements") :
missing value where TRUE/FALSE needed
In addition: Warning messages:
1: In pd * (as.integer(cat) - 1L) : NAs produced by integer overflow
2: In pd * nl : NAs produced by integer overflow
> traceback()
9: tabulate(bin, pd)
8: as.vector(data)
7: array(tabulate(bin, pd), dims, dimnames = dn)
6: table(ids[, 1], strata[, 1])
5: inherits(x, "data.frame")
4: is.data.frame(x)
3: rowSums(table(ids[, 1], strata[, 1]) > 0)
2: svydesign.default(id = ~serial, weights = ~perwt, strata = ~strata,
data = ipums)
1: svydesign(id = ~serial, weights = ~perwt, strata = ~strata, data = ipums)
This error seems to come from the tabulate function, which I hoped would be straightforward enough to circumvent, first by changing .Machine$integer.max
> .Machine$integer.max <- 2^40
and when that didn't work the whole source code of tabulate:
> tabulate <- function(bin, nbins = max(1L, bin, na.rm=TRUE))
{
if(!is.numeric(bin) && !is.factor(bin))
stop("'bin' must be numeric or a factor")
#if (nbins > .Machine$integer.max)
if (nbins > 2^40) #replacement line
stop("attempt to make a table with >= 2^31 elements")
.C("R_tabulate",
as.integer(bin),
as.integer(length(bin)),
as.integer(nbins),
ans = integer(nbins),
NAOK = TRUE,
PACKAGE="base")$ans
}
Neither circumvented the problem. Apparently this is one reason why the ff package was created, but what worries me is the extent to which this is a problem I cannot avoid in R. This post seems to indicate that even if I were to use a package that would avoid this problem, I would only be able to access 2^31 elements at a time. My hope was to use sql (either sqlite or postgresql) to get around the memory problems, but I'm afraid I'll spend a while getting that to work, only to run into the same fundamental limit.
Attempting to switch back to Stata doesn't solve the problem either. Again see the previous post for how I use svyset, but the calculation I would like to run causes Stata to hang:
svy: mean age, over(strata)
Whether throwing more memory at it will solve the problem I don't know. I run R on my desktop which has 16 gigs, and I use Stata through a Windows server, currently setting memory allocation to 2000MB, but I could theoretically experiment with increasing that.
So in sum:
Is this a hard limit in R?
Would sql solve my R problems?
If I split it up into many separate files would that fix it (a lot of work...)?
Would throwing a lot of memory at Stata do it?
Am I seriously barking up the wrong tree somehow?

Yes, R uses 32-bit indexes for vectors so they can contain no more than 2^31-1 entries and you are trying to create something with 2^40. There is talk of introducing 64-bit indexes but that will be some way off before appearing in R. Vectors have the stated hard limit and that is it as far as base R is concerned.
I am unfamiliar with the details of what you are doing to offer any further advice on the other parts of your Q.
Why do you want to work with the full data set? Wouldn't a smaller sample that can fit in to the restrictions R places on you be just as useful? You could use SQL to store all the data and query it from R to return a random subset of more appropriate size.

Since this question was asked some time ago, I'd like to point that my answer here uses the version 3.3 of the survey package.
If you check the code of svydesign, you can see that the function that causes all the problem is within a check step that looks whether you should set the nest parameter to TRUE or not. This step can be disabled setting the option check.strata=FALSE.
Of course, you shouldn't disable a check step unless you know what you are doing. In this case, you should be able to decide yourself whether you need to set the nest option to TRUE or FALSE. nest should be set to TRUE when the same PSU (cluster) id is recycled in different strata.
Concretely for the IPUMS dataset, since you are using the serial variable for cluster identification and serial is unique for each household in a given sample, you may want to set nest to FALSE.
So, your survey design line would be:
ipums.design <- svydesign(id=~serial, strata=~strata, data=ipums, weights=perwt, check.strata=FALSE, nest=FALSE)
Extra advice: even after circumventing this problem you will find that the code is pretty slow unless you remap strata to a range from 1 to length(unique(ipums$strata)):
ipums$strata <- match(ipums$strata,unique(ipums$strata))

Both #Gavin and #Martin deserve credit for this answer, or at least leading me in the right direction. I'm mostly answering it separately to make it easier to read.
In the order I asked:
Yes 2^31 is a hard limit in R, though it seems to matter what type it is (which is a bit strange given it is the length of the vector, rather than the amount of memory (which I have plenty of) which is the stated problem. Do not convert strata or id variables to factors, that will just fix their length and nullify the effects of subsetting (which is the way to get around this problem).
sql could probably help, provided I learn how to use it correctly. I did the following test:
library(multicore) # make svy fast!
ri.ny <- subset(ipums, statefips_num %in% c(36, 44))
ri.ny.design <- svydesign(id=~serial, weights=~perwt, strata=~strata, data=ri.ny)
svyby(~incwage, ~strata, ri.ny.design, svymean, data=ri.ny, na.rm=TRUE, multicore=TRUE)
ri <- subset(ri.ny, statefips_num==44)
ri.design <- svydesign(id=~serial, weights=~perwt, strata=~strata, data=ri)
ri.mean <- svymean(~incwage, ri.design, data=ri, na.rm=TRUE)
ny <- subset(ri.ny, statefips_num==36)
ny.design <- svydesign(id=~serial, weights=~perwt, strata=~strata, data=ny)
ny.mean <- svymean(~incwage, ny.design, data=ny, na.rm=TRUE, multicore=TRUE)
And found the means to be the same, which seems like a reasonable test.
So: in theory, provided I can split up the calculation by either using plyr or sql, the results should still be fine.
See 2.
Throwing a lot of memory at Stata definitely helps, but now I'm running into annoying formatting issues. I seem to be able to perform most of the calculation I want (much quicker and with more stability as well) but I can't figure out how to get it into the form I want. Will probably ask a separate question on this. I think the short version here is that for big survey data, Stata is much better out of the box.
In many ways yes. Trying to do analysis with data this big is not something I should have taken on lightly, and I'm far from figuring it out even now. I was using the svydesign function correctly, but I didn't really know what's going on. I have a (very slightly) better grasp now, and it's heartening to know I was generally correct about how to solve the problem. #Gavin's general suggestion of trying out small data with external results to compare to is invaluable, something I should have started ages ago. Many thanks to both #Gavin and #Martin.

Quick divisibility check in ZX81 BASIC

Since many of the Project Euler problems require you to do a divisibility check for quite a number of times, I've been trying to figure out the fastest way to perform this task in ZX81 BASIC.
So far I've compared (N/D) to INT(N/D) to check, whether N is dividable by D or not.
I have been thinking about doing the test in Z80 machine code, I haven't yet figured out how to use the variables in the BASIC in the machine code.
How can it be achieved?

You can do this very fast in machine code by subtracting repeatedly. Basically you have a procedure like:
set accumulator to N
subtract D
if carry flag is set then it is not divisible
if zero flag is set then it is divisible
otherwise repeat subtraction until one of the above occurs
The 8 bit version would be something like:
DIVISIBLE_TEST:
LD B,10
LD A,100
DIVISIBLE_TEST_LOOP:
SUB B
JR C, $END_DIVISIBLE_TEST
JR Z, $END_DIVISIBLE_TEST
JR $DIVISIBLE_TEST_LOOP
END_DIVISIBLE_TEST:
LD B,A
LD C,0
RET
Now, you can call from basic using USR. What USR returns is whatever's in the BC register pair, so you would probably want to do something like:
REM poke the memory addresses with the operands to load the registers
POKE X+1, D
POKE X+3, N
LET r = USR X
IF r = 0 THEN GOTO isdivisible
IF r <> 0 THEN GOTO isnotdivisible
This is an introduction I wrote to Z80 which should help you figure this out. This will explain the flags if you're not familiar with them.
There's a load more links to good Z80 stuff from the main site although it is Spectrum rather than ZX81 focused.
A 16 bit version would be quite similar but using register pair operations. If you need to go beyond 16 bits it would get a bit more convoluted.
How you load this is up to you - but the traditional method is using DATA statements and POKEs. You may prefer to have an assembler figure out the machine code for you though!

Your existing solution may be good enough. Only replace it with something faster if you find it to be a bottleneck in profiling.
(Said with a straight face, of course.)
And anyway, on the ZX81 you can just switch to FAST mode.

Don't know if RANDOMIZE USR is available in ZX81 but I think it can be used to call routines in assembly. To pass arguments you might need to use POKE to set some fixed memory locations before executing RANDOMIZE USR.
I remember to find a list of routines implemented in the ROM to support the ZX Basic. I'm sure there are a few to perform floating operation.
An alternative to floating point is to use fixed point math. It's a lot faster in these kind of situations where there is no math coprocessor.
You also might find more information in Sinclair User issues. They published some articles related to programming in the ZX Spectrum

You should place the values in some pre-known memory locations, first. Then use the same locations from within Z80 assembler. There is no parameter passing between the two.
This is based on what I (still) remember of ZX Spectrum 48. Good luck, but you might consider upgrading your hw. ;/

The problem with Z80 machine code is that it has no floating point ops (and no integer divide or multiply, for that matter). Implementing your own FP library in Z80 assembler is not trivial. Of course, you can use the built-in BASIC routines, but then you may as well just stick with BASIC.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas