I'm learning Bayesian data analysis. I try to replicate the tutorials by Trond Reitan by stan, which are originally created by WinBugs.
Specifically, I have following data and model
weta.windata<-list(numdet=c(0, 0, 1, 0, 0, 0, 0, 0, 0, 2, 1, 1, 2, 0, 3, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 2, 0, 1, 0, 3, 1, 1, 3, 1, 1, 2, 0, 2, 1, 1, 1, 1,0, 0, 0, 2, 0, 2, 4, 3, 1, 0, 0, 2, 0, 2, 2, 1, 0, 0, 1),
numvisit=c(4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 3, 3, 3, 3, 4, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5, 5, 3, 3, 4, 4, 4, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4,4, 4, 4, 4, 4, 4, 4 ,4, 4, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3),
nsites=72)
model_string1="
data{
int nsites;
real<lower=0> numdet[nsites];
real<lower=0> numvisit[nsites];
}
parameters{
real<lower=0> p;
real<lower=0> psi;
int<lower=0> z[nsites];
}
model{
p~uniform(0,1);
psi~uniform(0,1);
for(i in 1:nsites){
z[i]~ bernoulli(psi);
p.site[i]~z[i]*p;
numdet[i]~binomial(numvisit[i],p.site[i]);
}
}
"
mcmc_samples <- stan(model_code=model_string1,
data=weta.windata,
pars=c("p","psi","z"),
chains=3, iter=30000, warmup=10000)
The context is about detecting wetas in fields. There are 72 sites. for each site, researchers visited several times (i.e., numvisit) and recorded the number of times weta found (i.e., numdet).
There is a latent variable z, describing whether one site has weta or not. psi is the probability that one site has weta. p is the detection rate.
The problem I have is I can not declare z to be integers
parameters or transformed parameters cannot be integer or integer array; found declared type int, parameter name=z
Problem with declaration.
However, if I set z to be real, that is,
real<lower=0> z[nsites];
the problem becomes I cannot set the variable from bernoulli as integer...
No matches for:
real ~ bernoulli(real)
I'm very new to stan. Forgive me if this question is very silly.
Stan doesn't support integer parameters or hacks to let you pretend real variables are integers. What it does support is marginalizing the integer variables out of the density. You can then reconstruct them with much more efficiency and much higher tail resolution.
The chapter in the manual on latent discrete parameters is the place to start. It includes an implementation of the CJS population models, which may be familiar. I implemented the Dorazio and Royle occupance models as a case study and Hiroki Ito translated the entire Kery and Schaub book to Stan. They're all linked under users >> documentation on the web site.
I ran into this mysterious error with ulam while answering practice problems in Statistical Rethinking. When you're constructing a list to pass to the data argument to ulam be sure to use = rather than <- for assignment. If you don't the list you construct won't have named components, and a missing name produces this error.
I have written the following function to calculate a check digit in R.
verhoeffCheck <- function(x)
{
## calculates check digit based on Verhoeff algorithm
## note that due to the way strsplit works, to call for vector x, use sapply(x,verhoeffCheck)
## check for string since leading zeros with numbers will be lost
if (class(x)!="character"){stop("Must enter a string")}
#split and convert to numbers
digs <- strsplit(x,"")[[1]]
digs <- as.numeric(digs)
digs <- rev(digs) ## right to left algorithm
## tables required for D_5 group
d5_mult <- matrix(c(
0:9,
c(1:4,0,6:9,5),
c(2:4,0:1,7:9,5:6),
c(3:4,0:2,8:9,5:7),
c(4,0:3,9,5:8),
c(5,9:6,0,4:1),
c(6:5,9:7,1:0,4:2),
c(7:5,9:8,2:0,4:3),
c(8:5,9,3:0,4),
9:0
),10,10,byrow=T)
d5_perm <- matrix(c(
0:9,
c(1,5,7,6,2,8,3,0,9,4),
c(5,8,0,3,7,9,6,1,4,2),
c(8,9,1,6,0,4,3,5,2,7),
c(9,4,5,3,1,2,6,8,7,0),
c(4,2,8,6,5,7,3,9,0,1),
c(2,7,9,3,8,0,6,4,1,5),
c(7,0,4,6,9,1,3,2,5,8)
),8,10,byrow=T)
d5_inv <- c(0,4:1,5:9)
## apply algoritm - note 1-based indexing in R
d <- 0
for (i in 1:length(digs)){
d <- d5_mult[d+1,(d5_perm[(i%%8)+1,digs[i]+1])+1]
}
d5_inv[d+1]
}
In order to run on a vector of strings, sapply must be used. This is in part because of the use of strsplit, which returns a list of vectors. This does impact on the performance even for only moderately sized inputs.
How could this function be vectorized?
I am also aware that some performance is lost in having to create the tables in each iteration. Would storing these in a new environment be a better solution?
We begin by defining the lookup matrices. I've laid them out in a way
that should make them easier to check against a reference, e.g.
http://en.wikipedia.org/wiki/Verhoeff_algorithm.
d5_mult <- matrix(as.integer(c(
0, 1, 2, 3, 4, 5, 6, 7, 8, 9,
1, 2, 3, 4, 0, 6, 7, 8, 9, 5,
2, 3, 4, 0, 1, 7, 8, 9, 5, 6,
3, 4, 0, 1, 2, 8, 9, 5, 6, 7,
4, 0, 1, 2, 3, 9, 5, 6, 7, 8,
5, 9, 8, 7, 6, 0, 4, 3, 2, 1,
6, 5, 9, 8, 7, 1, 0, 4, 3, 2,
7, 6, 5, 9, 8, 2, 1, 0, 4, 3,
8, 7, 6, 5, 9, 3, 2, 1, 0, 4,
9, 8, 7, 6, 5, 4, 3, 2, 1, 0
)), ncol = 10, byrow = TRUE)
d5_perm <- matrix(as.integer(c(
0, 1, 2, 3, 4, 5, 6, 7, 8, 9,
1, 5, 7, 6, 2, 8, 3, 0, 9, 4,
5, 8, 0, 3, 7, 9, 6, 1, 4, 2,
8, 9, 1, 6, 0, 4, 3, 5, 2, 7,
9, 4, 5, 3, 1, 2, 6, 8, 7, 0,
4, 2, 8, 6, 5, 7, 3, 9, 0, 1,
2, 7, 9, 3, 8, 0, 6, 4, 1, 5,
7, 0, 4, 6, 9, 1, 3, 2, 5, 8
)), ncol = 10, byrow = TRUE)
d5_inv <- as.integer(c(0, 4, 3, 2, 1, 5, 6, 7, 8, 9))
Next, we'll define the check function, and try it out with a test input.
I've followed the derivation in wikipedia as closely as possible.
p <- function(i, n_i) {
d5_perm[(i %% 8) + 1, n_i + 1] + 1
}
d <- function(c, p) {
d5_mult[c + 1, p]
}
verhoeff <- function(x) {
#split and convert to numbers
digs <- strsplit(as.character(x), "")[[1]]
digs <- as.numeric(digs)
digs <- rev(digs) ## right to left algorithm
## apply algoritm - note 1-based indexing in R
c <- 0
for (i in 1:length(digs)) {
c <- d(c, p(i, digs[i]))
}
d5_inv[c + 1]
}
verhoeff(142857)
## [1] 0
This function is fundamentally iterative, as each iteration depends on
the value of the previous. This means that we're unlikely to be able to
vectorise in R, so if we want to vectorise, we'll need to use Rcpp.
However, before we turn to that, it's worth exploring if we can do the
initial split faster. First we do a little microbenchmark to see if it's
worth bothering:
library(microbenchmark)
digits <- function(x) {
digs <- strsplit(as.character(x), "")[[1]]
digs <- as.numeric(digs)
rev(digs)
}
microbenchmark(
digits(142857),
verhoeff(142857)
)
## Unit: microseconds
## expr min lq median uq max neval
## digits(142857) 11.30 12.01 12.43 12.85 28.79 100
## verhoeff(142857) 32.24 33.81 34.66 35.47 95.85 100
It looks like it! On my computer, verhoeff_prepare() accounts for
about 50% of the run time. A little searching on stackoverflow reveals
another approach to turning a number into
digits:
digits2 <- function(x) {
n <- floor(log10(x))
x %/% 10^(0:n) %% 10
}
digits2(12345)
## [1] 5 4 3 2 1
microbenchmark(
digits(142857),
digits2(142857)
)
## Unit: microseconds
## expr min lq median uq max neval
## digits(142857) 11.495 12.102 12.468 12.834 79.60 100
## digits2(142857) 2.322 2.784 3.358 3.561 13.69 100
digits2() is a lot faster than digits() but it has limited impact on
the whole runtime.
verhoeff2 <- function(x) {
digs <- digits2(x)
c <- 0
for (i in 1:length(digs)) {
c <- d(c, p(i, digs[i]))
}
d5_inv[c + 1]
}
verhoeff2(142857)
## [1] 0
microbenchmark(
verhoeff(142857),
verhoeff2(142857)
)
## Unit: microseconds
## expr min lq median uq max neval
## verhoeff(142857) 33.06 34.49 35.19 35.92 73.38 100
## verhoeff2(142857) 20.98 22.58 24.05 25.28 48.69 100
To make it even faster we could try C++.
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
int verhoeff3_c(IntegerVector digits, IntegerMatrix mult, IntegerMatrix perm,
IntegerVector inv) {
int n = digits.size();
int c = 0;
for(int i = 0; i < n; ++i) {
int p = perm(i % 8, digits[i]);
c = mult(c, p);
}
return inv[c];
}
verhoeff3 <- function(x) {
verhoeff3_c(digits(x), d5_mult, d5_perm, d5_inv)
}
verhoeff3(142857)
## [1] 3
microbenchmark(
verhoeff2(142857),
verhoeff3(142857)
)
## Unit: microseconds
## expr min lq median uq max neval
## verhoeff2(142857) 21.00 22.85 25.53 27.11 63.71 100
## verhoeff3(142857) 16.75 17.99 18.87 19.64 79.54 100
That doesn't yield much of an improvement. Maybe we can do better if we
pass the number to C++ and process the digits in a loop:
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
int verhoeff4_c(int number, IntegerMatrix mult, IntegerMatrix perm,
IntegerVector inv) {
int c = 0;
int i = 0;
for (int i = 0; number > 0; ++i, number /= 10) {
int p = perm(i % 8, number % 10);
c = mult(c, p);
}
return inv[c];
}
verhoeff4 <- function(x) {
verhoeff4_c(x, d5_mult, d5_perm, d5_inv)
}
verhoeff4(142857)
## [1] 3
microbenchmark(
verhoeff2(142857),
verhoeff3(142857),
verhoeff4(142857)
)
## Unit: microseconds
## expr min lq median uq max neval
## verhoeff2(142857) 21.808 24.910 26.838 27.797 64.22 100
## verhoeff3(142857) 17.699 18.742 19.599 20.764 81.67 100
## verhoeff4(142857) 3.143 3.797 4.095 4.396 13.21 100
And we get a pay off: verhoeff4() is about 5 times faster than
verhoeff2().
If your input strings can contain different numbers of characters, then I don't see any way round lapply calls (or a plyr equivalent). The trick is to move them inside the function, so verhoeffCheck can accept vector inputs. This way you only need to create the matrices once.
verhoeffCheckNew <- function(x)
{
## calculates check digit based on Verhoeff algorithm
## check for string since leading zeros with numbers will be lost
if (!is.character(x)) stop("Must enter a string")
#split and convert to numbers
digs <- strsplit(x, "")
digs <- lapply(digs, function(x) rev(as.numeric(x)))
## tables required for D_5 group
d5_mult <- matrix(c(
0:9,
c(1:4,0,6:9,5),
c(2:4,0:1,7:9,5:6),
c(3:4,0:2,8:9,5:7),
c(4,0:3,9,5:8),
c(5,9:6,0,4:1),
c(6:5,9:7,1:0,4:2),
c(7:5,9:8,2:0,4:3),
c(8:5,9,3:0,4),
9:0
),10,10,byrow=T)
d5_perm <- matrix(c(
0:9,
c(1,5,7,6,2,8,3,0,9,4),
c(5,8,0,3,7,9,6,1,4,2),
c(8,9,1,6,0,4,3,5,2,7),
c(9,4,5,3,1,2,6,8,7,0),
c(4,2,8,6,5,7,3,9,0,1),
c(2,7,9,3,8,0,6,4,1,5),
c(7,0,4,6,9,1,3,2,5,8)
),8,10,byrow=T)
d5_inv <- c(0,4:1,5:9)
## apply algorithm - note 1-based indexing in R
sapply(digs, function(x)
{
d <- 0
for (i in 1:length(x)){
d <- d5_mult[d + 1, (d5_perm[(i %% 8) + 1, x[i] + 1]) + 1]
}
d5_inv[d+1]
})
}
Since d depends on what it was previously, the is no easy way to vectorise the for loop.
My version runs in about half the time for 1e5 strings.
rand_string <- function(n = 12)
{
paste(sample(as.character(0:9), sample(n), replace = TRUE), collapse = "")
}
big_test <- replicate(1e5, rand_string())
tic()
res1 <- unname(sapply(big_test, verhoeffCheck))
toc()
tic()
res2 <- verhoeffCheckNew(big_test)
toc()
identical(res1, res2) #hopefully TRUE!
See this question for tic and toc.
Further thoughts:
You may want additional input checking for "" and other strings that return NA when converted in numeric.
Since you are dealing exclusively with integers, you may get a slight performance benefit from using them rather than doubles. (Use as.integer rather than as.numeric and append L to the values in your matrices.)
Richie C answered the vectorisation question nicely; as for only creatig the tables once without cluttering the global name space, one quick solution that does not require a package is
verhoeffCheck <- local(function(x)
{
## calculates check digit based on Verhoeff algorithm
## note that due to the way strsplit works, to call for vector x, use sapply(x,verhoeffCheck)
## check for string since leading zeros with numbers will be lost
if (class(x)!="character"){stop("Must enter a string")}
#split and convert to numbers
digs <- strsplit(x,"")[[1]]
digs <- as.numeric(digs)
digs <- rev(digs) ## right to left algorithm
## apply algoritm - note 1-based indexing in R
d <- 0
for (i in 1:length(digs)){
d <- d5_mult[d+1,(d5_perm[(i%%8)+1,digs[i]+1])+1]
}
d5_inv[d+1]
})
assign("d5_mult", matrix(c(
0:9, c(1:4,0,6:9,5), c(2:4,0:1,7:9,5:6), c(3:4,0:2,8:9,5:7),
c(4,0:3,9,5:8), c(5,9:6,0,4:1), c(6:5,9:7,1:0,4:2), c(7:5,9:8,2:0,4:3),
c(8:5,9,3:0,4), 9:0), 10, 10, byrow = TRUE),
envir = environment(verhoeffCheck))
assign("d5_perm", matrix(c(
0:9, c(1,5,7,6,2,8,3,0,9,4), c(5,8,0,3,7,9,6,1,4,2),
c(8,9,1,6,0,4,3,5,2,7), c(9,4,5,3,1,2,6,8,7,0), c(4,2,8,6,5,7,3,9,0,1),
c(2,7,9,3,8,0,6,4,1,5), c(7,0,4,6,9,1,3,2,5,8)), 8, 10, byrow = TRUE),
envir = environment(verhoeffCheck))
assign("d5_inv", c(0,4:1,5:9), envir = environment(verhoeffCheck))
## Now just use the function
which keeps the data in the environment of the function. You can time it to see how much faster it is.
Hope this helps.
Allan