Mixture prior not working in JAGS, only when likelihood term included - bayesian

The code at the bottom will replicate the problem, just copy and paste it into R.
What I want is for the mean and precision to be (-100, 100) 30% of the time, and (200, 1000) for 70% of the time. Think of it as lined up in a, b, and p.
So 'pick' should be 1 30% of the time, and 2 70% of the time.
What actually happens is that on every iteration, pick is 2 (or 1 if the first element of p is the larger one). You can see this in the summary, where the quantiles for 'pick', 'testa', and 'testb' remain unchanged throughout. The strangest thing is that if you remove the likelihood loop, pick then works exactly as intended.
I hope this explains the problem, if not let me know. It's my first time posting so I'm bound to have messed things up.
library(rjags)
n = 10
y <- rnorm(n, 5, 10)
a = c(-100, 200)
b = c(100, 1000)
p = c(0.3, 0.7)
## Model
mod_str = "model{
# Likelihood
for (i in 1:n){
y[i] ~ dnorm(mu, 10)
}
# ISSUE HERE: MIXTURE PRIOR
mu ~ dnorm(a[pick], b[pick])
pick ~ dcat(p[1:2])
testa = a[pick]
testb = b[pick]
}"
model = jags.model(textConnection(mod_str), data = list(y = y, n=n, a=a, b=b, p=p), n.chains=1)
update(model, 10000)
res = coda.samples(model, variable.names = c('pick', 'testa', 'testb', 'mu'), n.iter = 10000)
summary(res)

I think you are having problems for a couple of reasons. First, the data that you have supplied to the model (i.e., y) is not a mixture of normal distributions. As a result, the model itself has no need to mix. I would instead generate data something like this:
set.seed(320)
# number of samples
n <- 10
# Because it is a mixture of 2 we can just use an indicator variable.
# here, pick (in the long run), would be '1' 30% of the time.
pick <- rbinom(n, 1, p[1])
# generate the data. b is in terms of precision so we are converting this
# to standard deviations (which is what R wants).
y_det <- pick * rnorm(n, a[1], sqrt(1/b[1])) + (1 - pick) * rnorm(n, a[2], sqrt(1/b[2]))
# add a small amount of noise, can change to be more as necessary.
y <- rnorm(n, y_det, 1)
These data look more like what you would want to supply to a mixture model.
Following this, I would code the model up in a similar way as I did the data generation process. I want some indicator variable to jump between the two normal distributions. Thus, mu may change for each scalar in y.
mod_str = "model{
# Likelihood
for (i in 1:n){
y[i] ~ dnorm(mu[i], 10)
mu[i] <- mu_ind[i] * a_mu + (1 - mu_ind[i]) * b_mu
mu_ind[i] ~ dbern(p[1])
}
a_mu ~ dnorm(a[1], b[1])
b_mu ~ dnorm(a[2], b[2])
}"
model = jags.model(textConnection(mod_str), data = list(y = y, n=n, a=a, b=b, p=p), n.chains=1)
update(model, 10000)
res = coda.samples(model, variable.names = c('mu_ind', 'a_mu', 'b_mu'), n.iter = 10000)
summary(res)
2.5% 25% 50% 75% 97.5%
a_mu -100.4 -100.3 -100.2 -100.1 -100
b_mu 199.9 200.0 200.0 200.0 200
mu_ind[1] 0.0 0.0 0.0 0.0 0
mu_ind[2] 1.0 1.0 1.0 1.0 1
mu_ind[3] 0.0 0.0 0.0 0.0 0
mu_ind[4] 1.0 1.0 1.0 1.0 1
mu_ind[5] 0.0 0.0 0.0 0.0 0
mu_ind[6] 0.0 0.0 0.0 0.0 0
mu_ind[7] 1.0 1.0 1.0 1.0 1
mu_ind[8] 0.0 0.0 0.0 0.0 0
mu_ind[9] 0.0 0.0 0.0 0.0 0
mu_ind[10] 1.0 1.0 1.0 1.0 1
If you supplied more data, you would (in the long run) have the indicator variable mu_ind take the value of 1 30% of the time. If you had more than 2 distributions you could instead use dcat. Thus, an alternative and more generalized way of doing this would be (and I am borrowing heavily from this post by John Kruschke):
mod_str = "model {
# Likelihood:
for( i in 1 : n ) {
y[i] ~ dnorm( mu[i] , 10 )
mu[i] <- muOfpick[ pick[i] ]
pick[i] ~ dcat( p[1:2] )
}
# Prior:
for ( i in 1:2 ) {
muOfpick[i] ~ dnorm( a[i] , b[i] )
}
}"
model = jags.model(textConnection(mod_str), data = list(y = y, n=n, a=a, b=b, p=p), n.chains=1)
update(model, 10000)
res = coda.samples(model, variable.names = c('pick', 'muOfpick'), n.iter = 10000)
summary(res)
2.5% 25% 50% 75% 97.5%
muOfpick[1] -100.4 -100.3 -100.2 -100.1 -100
muOfpick[2] 199.9 200.0 200.0 200.0 200
pick[1] 2.0 2.0 2.0 2.0 2
pick[2] 1.0 1.0 1.0 1.0 1
pick[3] 2.0 2.0 2.0 2.0 2
pick[4] 1.0 1.0 1.0 1.0 1
pick[5] 2.0 2.0 2.0 2.0 2
pick[6] 2.0 2.0 2.0 2.0 2
pick[7] 1.0 1.0 1.0 1.0 1
pick[8] 2.0 2.0 2.0 2.0 2
pick[9] 2.0 2.0 2.0 2.0 2
pick[10] 1.0 1.0 1.0 1.0 1
The link above includes even more priors (e.g., a Dirichlet prior on the probabilities incorporated into the Categorical distribution).

Related

Julia "strange" behaviour using fill() and .+=

I observe an unexpected behaviour for ".+=" in my code (it's probably just me, I'm rather new to Julia). Consider the following example:
julia> b = fill(zeros(2,2),1,3)
1×3 Array{Array{Float64,2},2}:
[0.0 0.0; 0.0 0.0] [0.0 0.0; 0.0 0.0] [0.0 0.0; 0.0 0.0]
julia> b[1] += ones(2,2)
2×2 Array{Float64,2}:
1.0 1.0
1.0 1.0
julia> b
1×3 Array{Array{Float64,2},2}:
[1.0 1.0; 1.0 1.0] [0.0 0.0; 0.0 0.0] [0.0 0.0; 0.0 0.0]
julia> b[2] .+= ones(2,2)
2×2 Array{Float64,2}:
1.0 1.0
1.0 1.0
julia> b
1×3 Array{Array{Float64,2},2}:
[1.0 1.0; 1.0 1.0] [1.0 1.0; 1.0 1.0] [1.0 1.0; 1.0 1.0]
As it can be seen, the last command changed not only the value of b[2] but also of b[3], while b[1] remains the same as before (*), as we can confirm running:
julia> b[2] .+= ones(2,2)
2×2 Array{Float64,2}:
2.0 2.0
2.0 2.0
julia> b
1×3 Array{Array{Float64,2},2}:
[1.0 1.0; 1.0 1.0] [2.0 2.0; 2.0 2.0] [2.0 2.0; 2.0 2.0]
Now, using simply "+=" instead I can obtain the behaviour I would have expected also for ".+=", that is:
julia> b = fill(zeros(2,2),1,3); b[2]+=ones(2,2); b
1×3 Array{Array{Float64,2},2}:
[0.0 0.0; 0.0 0.0] [1.0 1.0; 1.0 1.0] [0.0 0.0; 0.0 0.0]
Can anyone explain me why does it happen? I can use of course just +=, or maybe something different from an Array of Arrays, but since I'm striving for speed (I have a code that needs to perform these operations millions of times, and on much larger matrices) and .+= is considerably faster I would like to understad if I can still exploit this feature.
Thank you all in advance!
EDIT: (*) apparently only because b[1] was not zero. If I run:
julia> b = fill(zeros(2,2),1,3); b[2]+=ones(2,2);
julia> b[1] .+= 10 .*ones(2,2); b
[10.0 10.0; 10.0 10.0] [1.0 1.0; 1.0 1.0] [10.0 10.0; 10.0 10.0]
you can see that only the zero-values are changed. This beats me.
This happens because of the combination of several factors. Let's try and make things clearer.
First, b = fill(zeros(2,2),1,3) does not create a new zeros(2,2) for each element of b; instead it creates one 2x2 array of zeros, and sets all elements of b to that unique array. In short, this line behaves equivalently to
z = zeros(2,2)
b = Array{Array{Float64,2},2}(undef, 1, 3)
for i in eachindex(b)
b[i] = z
end
therefore, modifying z[1,1] or any of the b[i,j][1,1] would modify the other values as well. To illustrate this:
julia> b = fill(zeros(2,2),1,3)
1×3 Array{Array{Float64,2},2}:
[0.0 0.0; 0.0 0.0] [0.0 0.0; 0.0 0.0] [0.0 0.0; 0.0 0.0]
# All three elements are THE SAME underlying array
julia> b[1] === b[2] === b[3]
true
# Mutating one of them mutates the others as well
julia> b[1,1][1,1] = 42
42
julia> b
1×3 Array{Array{Float64,2},2}:
[42.0 0.0; 0.0 0.0] [42.0 0.0; 0.0 0.0] [42.0 0.0; 0.0 0.0]
Second, b[1] += ones(2,2) is equivalent to b[1] = b[1] + ones(2,2). This implies a succession of operations:
a new array (let's call it tmp) is created to hold the sum of b[1] and ones(2,2)
b[1] is rebound to that new array, thereby losing its connection to z (or all other elements of b.
This is a variation on the classical theme that although both involve = signs in their notations, mutation and assignment are not the same thing. Again, to illustrate:
julia> b = fill(zeros(2,2),1,3)
1×3 Array{Array{Float64,2},2}:
[0.0 0.0; 0.0 0.0] [0.0 0.0; 0.0 0.0] [0.0 0.0; 0.0 0.0]
# All elements are THE SAME underlying array
julia> b[1] === b[2] === b[3]
true
# But that connection is lost when `b[1]` is re-bound (not mutated) to a new array
julia> b[1] = ones(2,2)
2×2 Array{Float64,2}:
1.0 1.0
1.0 1.0
# Now b[1] is no more the same underlying array as b[2]
julia> b[1] === b[2]
false
# But b[2] and b[3] still share the same array (they haven't be re-bound to anything else)
julia> b[2] === b[3]
true
Third, b[2] .+= ones(2,2) is a different beast altogether. It does not imply re-binding anything to a newly created array; instead, it mutates the array b[2] in place. It effectively behaves like:
for i in eachindex(b[2])
b[2][i] += 1 # or b[2][i] = b[2][i] + 1
end
Neither b itself nor even b[2] is re-bound to anything, only elements of it are modified in place. And in your example this affects b[3] as well, since both b[2] and b[3] are bound to the same underlying array.
Becasue b is filled with the same matrix, not 3 identical matrices. .+= change the content of the matrix, thus all content in b are changed. += on the other hand, create a new matrix and assign it back to b[1]. To see this, you can use the === operator:
b = fill(zeros(2,2),1,3)
b[1] === b[2] # true
b[1] += zeros(2, 2) # a new matrix is created and assigned back to b[1]
b[1] == b[2] # true, they are all zeros
b[1] === b[2] # false, they are not the same matrix
There is actually an example in the help message of fill function pointing out exactly this problem. You can find it by running ?fill in the REPL.
...
If x is an object reference, all elements will refer to the same object:
julia> A = fill(zeros(2), 2);
julia> A[1][1] = 42; # modifies both A[1][1] and A[2][1]
julia> A
2-element Array{Array{Float64,1},1}:
[42.0, 0.0]
[42.0, 0.0]
There are various ways to create an array of independent matrices. One is using list comprehension:
c = [zeros(2,2) for _ in 1:1, _ in 1:3]
c[1] === c[2] # false

Create histogram like bins for a range including negative numbers

I have numbers in a range from -4 to 4, including 0, as in
-0.526350041828112
-0.125648350883331
0.991377353361933
1.079241128983
1.06322905224238
1.17477528478982
-0.0651086035371559
0.818471811380787
0.0355593553368815
I need to create histogram like buckets, and have being trying to use this
BEGIN { delta = (delta == "" ? 0.1 : delta) }
{
bucketNr = int(($0+delta) / delta)
cnt[bucketNr]++
numBuckets = (numBuckets > bucketNr ? numBuckets : bucketNr)
}
END {
for (bucketNr=1; bucketNr<=numBuckets; bucketNr++) {
end = beg + delta
printf "%0.1f %0.1f %d\n", beg, end, cnt[bucketNr]
beg = end
}
}
from Create bins with awk histogram-like
The output would look like
-2.4 -2.1 8
-2.1 -1.8 25
-1.8 -1.5 108
-1.5 -1.2 298
-1.2 -0.9 773
-0.9 -0.6 1067
-0.6 -0.3 1914
-0.3 0.0 4174
0.0 0.3 3969
0.3 0.6 2826
0.6 0.9 1460
0.9 1.2 752
1.2 1.5 396
1.5 1.8 121
1.8 2.1 48
2.1 2.4 13
2.4 2.7 1
2.7 3.0 1
I'm thinking I would have to run this 2x, one with delta let's say 0.3 and another with delta -0.3, and cat the two together.
But I'm not sure this intuition is correct.
This might work for you:
BEGIN { delta = (delta == "" ? 0.1 : delta) }
{
bucketNr = int(($0<0?$0-delta:$0)/delta)
cnt[bucketNr]++
maxBucket = (maxBucket > bucketNr ? maxBucket : bucketNr)
minBucket = (minBucket < bucketNr ? minBucket : bucketNr)
}
END {
beg = minBucket*delta
for (bucketNr=minBucket; bucketNr<=maxBucket; bucketNr++) {
end = beg + delta
printf "%0.1f %0.1f %d\n", beg, end, cnt[bucketNr]
beg = end
}
}
It's basically the code you posted + handling negative numbers.

How to use abs function in a objective function of JuMP + Julia

I would like to solve a simple linear optimization problem with JuMP and Julia.
This is my code:
using JuMP
using Mosek
model = Model(solver=MosekSolver())
#variable(model, 2.5 <= z1 <= 5.0)
#variable(model, -1.0 <= z2 <= 1.0)
#objective(model, Min, abs(z1+5.0) + abs(z2-3.0))
status = solve(model)
println("Objective value: ", getobjectivevalue(model))
println("z1:",getvalue(z1))
println("z2:",getvalue(z2))
However, I got this error message.
> ERROR: LoadError: MethodError: no method matching
> abs(::JuMP.GenericAffExpr{Float64,JuMP.Variable}) Closest candidates
> are: abs(!Matched::Bool) at bool.jl:77 abs(!Matched::Float16) at
> float.jl:512 abs(!Matched::Float32) at float.jl:513
How can I use abs function in the JuMP code?
My problem is solved by #rickhg12hs's commnet.
If I use #NLobjective instead of #objective, It works.
This is the final code.
using JuMP
using Mosek
model = Model(solver=MosekSolver())
#variable(model, 2.5 <= z1 <= 5.0)
#variable(model, -1.0 <= z2 <= 1.0)
#NLobjective(model, Min, abs(z1+5.0) + abs(z2-3.0))
status = solve(model)
println("Objective value: ", getobjectivevalue(model))
println("z1:",getvalue(z1))
println("z2:",getvalue(z2))
I did it on a diffrent way
AvgOperationtime = [1 2]#[2.0 2.0 2.0 3.3333333333333335 2.5 2.0 2.0 2.5 2.5 2.0 2.0]
Operationsnumberremovecounter = [1 0;1 1]#[1.0 1.0 1.0 1.0 -0.0 1.0 1.0 1.0 -0.0 1.0 1.0; 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0; 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0; 1.0 1.0 1.0 -0.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0; 1.0 1.0 1.0 -0.0 1.0 1.0 1.0 -0.0 1.0 1.0 1.0]
Modelnumber = 2
Operationsnumber = 2
Basecaseworkload = 2
y = 0.1
Highestnumber = 999
Solver = GLPK.Optimizer
#Operationtime[1,1 X;0,9 2]
m = Model(with_optimizer(Solver));
#variable(m, Operationtime[1:Modelnumber,1:Operationsnumber]>=0);
#variable(m, Absoluttime[1:Modelnumber,1:Operationsnumber]>=0);
#variable(m, Absolutchoice[1:Modelnumber,1:Operationsnumber,1:2], Bin);
#objective(m, Max, sum(Absoluttime[M,O]*Operationsnumberremovecounter[M,O] for M=1:Modelnumber,O=1:Operationsnumber))
#How much Time can differ
#constraint(m, BorderOperationtime1[M=1:Modelnumber,O=1:Operationsnumber], AvgOperationtime[O]*(1-y) <= Operationtime[M,O]);
#constraint(m, BorderOperationtime2[M=1:Modelnumber,O=1:Operationsnumber], AvgOperationtime[O]*(1+y) >= Operationtime[M,O]);
#Workload
#constraint(m, Worklimit[O=1:Operationsnumber], sum(Operationtime[M,O]*Operationsnumberremovecounter[M,O] for M=1:Modelnumber) == Basecaseworkload);
#Absolut
#constraint(m, Absolutchoice1[M=1:Modelnumber,O=1:Operationsnumber], sum(Absolutchoice[M,O,X] for X=1:2) == 1);
#constraint(m, Absoluttime1[M=1:Modelnumber,O=1:Operationsnumber], Absoluttime[M,O] <= Operationtime[M,O]-AvgOperationtime[O]+Absolutchoice[M,O,1]*Highestnumber);
#constraint(m, Absoluttime2[M=1:Modelnumber,O=1:Operationsnumber], Absoluttime[M,O] <= AvgOperationtime[O]-Operationtime[M,O]+Absolutchoice[M,O,2]*Highestnumber);
optimize!(m);
println("Termination status: ", JuMP.termination_status(m));
println("Primal status: ", JuMP.primal_status(m));

Using pandas to plot - array error

I have a file that looks like this:
> loc.38167 h3k4me1 1.8299 1.5343 0.0 0.0 1.8299 1.5343 0.0 ....
> loc.08652 h3k4me3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ....
I want to plot 500 random 'loc.' points on a graph. Each loc. has 100 values. I use the following python script:
file = open('h3k4me3.tab.data')
data = {}
for line in file:
cols = line.strip().split('\t')
vals = map(float,cols[2:])
data[cols[0]] = vals
file.close
randomA = data.keys()[:500]
window = int(math.ceil(5000.0 / 100))
xticks = range(-2500,2500,window)
sns.tsplot([data[k] for k in randomA],time=xticks)
However, I get
ValueError: arrays must all be same length

Faster way to split a string and count characters using R?

I'm looking for a faster way to calculate GC content for DNA strings read in from a FASTA file. This boils down to taking a string and counting the number of times that the letter 'G' or 'C' appears. I also want to specify the range of characters to consider.
I have a working function that is fairly slow, and it's causing a bottleneck in my code. It looks like this:
##
## count the number of GCs in the characters between start and stop
##
gcCount <- function(line, st, sp){
chars = strsplit(as.character(line),"")[[1]]
numGC = 0
for(j in st:sp){
##nested ifs faster than an OR (|) construction
if(chars[[j]] == "g"){
numGC <- numGC + 1
}else if(chars[[j]] == "G"){
numGC <- numGC + 1
}else if(chars[[j]] == "c"){
numGC <- numGC + 1
}else if(chars[[j]] == "C"){
numGC <- numGC + 1
}
}
return(numGC)
}
Running Rprof gives me the following output:
> a = "GCCCAAAATTTTCCGGatttaagcagacataaattcgagg"
> Rprof(filename="Rprof.out")
> for(i in 1:500000){gcCount(a,1,40)};
> Rprof(NULL)
> summaryRprof(filename="Rprof.out")
self.time self.pct total.time total.pct
"gcCount" 77.36 76.8 100.74 100.0
"==" 18.30 18.2 18.30 18.2
"strsplit" 3.58 3.6 3.64 3.6
"+" 1.14 1.1 1.14 1.1
":" 0.30 0.3 0.30 0.3
"as.logical" 0.04 0.0 0.04 0.0
"as.character" 0.02 0.0 0.02 0.0
$by.total
total.time total.pct self.time self.pct
"gcCount" 100.74 100.0 77.36 76.8
"==" 18.30 18.2 18.30 18.2
"strsplit" 3.64 3.6 3.58 3.6
"+" 1.14 1.1 1.14 1.1
":" 0.30 0.3 0.30 0.3
"as.logical" 0.04 0.0 0.04 0.0
"as.character" 0.02 0.0 0.02 0.0
$sampling.time
[1] 100.74
Any advice for making this code faster?
Better to not split at all, just count the matches:
gcCount2 <- function(line, st, sp){
sum(gregexpr('[GCgc]', substr(line, st, sp))[[1]] > 0)
}
That's an order of magnitude faster.
A small C function that just iterates over the characters would be yet another order of magnitude faster.
A one liner:
table(strsplit(toupper(a), '')[[1]])
I don't know that it's any faster, but you might want to look at the R package seqinR - http://pbil.univ-lyon1.fr/software/seqinr/home.php?lang=eng. It is an excellent, general bioinformatics package with many methods for sequence analysis. It's in CRAN (which seems to be down as I write this).
GC content would be:
mysequence <- s2c("agtctggggggccccttttaagtagatagatagctagtcgta")
GC(mysequence) # 0.4761905
That's from a string, you can also read in a fasta file using "read.fasta()".
There's no need to use a loop here.
Try this:
gcCount <- function(line, st, sp){
chars = strsplit(as.character(line),"")[[1]][st:sp]
length(which(tolower(chars) == "g" | tolower(chars) == "c"))
}
Try this function from stringi package
> stri_count_fixed("GCCCAAAATTTTCCGG",c("G","C"))
[1] 3 5
or you can use regex version to count g and G
> stri_count_regex("GCCCAAAATTTTCCGGggcc",c("G|g|C|c"))
[1] 12
or you can use tolower function first and then stri_count
> stri_trans_tolower("GCCCAAAATTTTCCGGggcc")
[1] "gcccaaaattttccggggcc"
time performance
> microbenchmark(gcCount(x,1,40),gcCount2(x,1,40), stri_count_regex(x,c("[GgCc]")))
Unit: microseconds
expr min lq median uq max neval
gcCount(x, 1, 40) 109.568 112.42 113.771 116.473 146.492 100
gcCount2(x, 1, 40) 15.010 16.51 18.312 19.213 40.826 100
stri_count_regex(x, c("[GgCc]")) 15.610 16.51 18.912 20.112 61.239 100
another example for longer string. stri_dup replicates string n-times
> stri_dup("abc",3)
[1] "abcabcabc"
As you can see, for longer sequence stri_count is faster :)
> y <- stri_dup("GCCCAAAATTTTCCGGatttaagcagacataaattcgagg",100)
> microbenchmark(gcCount(y,1,40*100),gcCount2(y,1,40*100), stri_count_regex(y,c("[GgCc]")))
Unit: microseconds
expr min lq median uq max neval
gcCount(y, 1, 40 * 100) 10367.880 10597.5235 10744.4655 11655.685 12523.828 100
gcCount2(y, 1, 40 * 100) 360.225 369.5315 383.6400 399.100 438.274 100
stri_count_regex(y, c("[GgCc]")) 131.483 137.9370 151.8955 176.511 221.839 100
Thanks to all for this post,
To optimize a script in which I want to calculate GC content of 100M sequences of 200bp, I ended up testing different methods proposed here. Ken Williams' method performed best (2.5 hours), better than seqinr (3.6 hours). Using stringr str_count reduced to 1.5 hour.
In the end I coded it in C++ and called it using Rcpp, which cuts the computation time down to 10 minutes!
here is the C++ code:
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
float pGC_cpp(std::string s) {
int count = 0;
for (int i = 0; i < s.size(); i++)
if (s[i] == 'G') count++;
else if (s[i] == 'C') count++;
float pGC = (float)count / s.size();
pGC = pGC * 100;
return pGC;
}
Which I call from R typing:
sourceCpp("pGC_cpp.cpp")
pGC_cpp("ATGCCC")