Probability of a record to belong to the same data-set given a Bayesian Network built with BNlearn - bayesian-networks

I'm trying to determine the probability of a new record to belong to an existing data-set. I'm using the BNlearn R package to build a Bayesian Network using a large training set.
I then want to assess how anomalous a new record is. For this I want to get a probability for a record for which I have full evidence but don't need to predict any variable.
The pcquery method seems to require at least one variable to predict.
The documentation states that the predict method will ignore entries with full evidence.
I spent a day searching the BNlearn documentation without success. So I think it is either not possible with BNlearn or I'm missing the right vocabulary to find what I need in the docs.
Any insights from someone who has more experience with BNlearn is welcome.

The cpquery estimates the conditional probability of an event given an evidence. However, the bnlearn documentation states:
If either event or evidence is set to TRUE an unconditional probability query is performed with respect to that argument.
For example, with the asia dataset:
library(bnlearn)
data(asia)
bn.dag <- model2network("[A][S][T|A][L|S][B|S][D|B:E][E|T:L][X|E]")
bn.fitted <- bn.fit(bn.dag, asia)
for (i in c(1:1000)) {
prob[i] <- cpquery(bn.fitted,
event = (A == "no") & (S == "no") & (T == "no") & (L == "no") &
(B == "no") & (E == "no") & (X == "no") & (D == "no"),
evidence = TRUE)
}
summary(prob)
# Result:
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.2714 0.2864 0.2908 0.2909 0.2954 0.3132

Related

Running a logistic model in JAGS - Can you vectorize instead of looping over individual cases?

I'm fairly new to JAGS, so this may be a dumb question. I'm trying to run a model in JAGS that predicts the probability that a one-dimensional random walk process will cross boundary A before crossing boundary B. This model can be solved analytically via the following logistic model:
Pr(A,B) = 1/(1 + exp(-2 * (d/sigma) * theta))
where "d" is the mean drift rate (positive values indicate drift toward boundary A), "sigma" is the standard deviation of that drift rate and "theta" is the distance between the starting point and the boundary (assumed to be equal for both boundaries).
My dataset consists of 50 participants, who each provide 1800 observations. My model assumes that d is determined by a particular combination of observed environmental variables (which I'll just call 'x'), and a weighting coefficient that relates x to d (which I'll call 'beta'). Thus, there are three parameters: beta, sigma, and theta. I'd like to estimate a single set of parameters for each participant. My intention is to eventually run a hierarchical model, where group level parameters influence individual level parameters. However, for simplicity, here I will just consider a model in which I estimate a single set of parameters for one participant (and thus the model is not hierarchical).
My model in rjags would be as follows:
model{
for ( i in 1:Ntotal ) {
d[i] <- x[i] * beta
probA[i] <- 1/(1+exp(-2 * (d[i]/sigma) * theta ) )
y[i] ~ dbern(probA[i])
}
beta ~ dunif(-10,10)
sigma ~ dunif(0,10)
theta ~ dunif(0,10)
}
This model runs fine, but takes ages to run. I'm not sure how JAGS carries out the code, but if this code were run in R, it would be rather inefficient because it would have to loop over cases, running the model for each case individually. The time required to run the analysis would therefore increase rapidly as the sample size increases. I have a rather large sample, so this is a concern.
Is there a way to vectorise this code so that it can calculate the likelihood for all of the data points at once? For example, if I were to run this as a simple maximum likelihood model. I would vectorize the model and calculate the probability of the data given particular parameter values for all 1800 cases provided by the participant (and thus would not need the for loop). I would then take the log of these likelihoods and add them all together to give a single loglikelihood for the all observations given by the participant. This method has enormous time savings. Is there a way to do this in JAGS?
EDIT
Thanks for the responses, and for pointing out that the parameters in the model I showed might be unidentified. I should've pointed out that model was a simplified version. The full model is below:
model{
for ( i in 1:Ntotal ) {
aExpectancy[i] <- 1/(1+exp(-gamma*(aTimeRemaining[i] - aDiscrepancy[i]*aExpectedLag[i]) ) )
bExpectancy[i] <- 1/(1+exp(-gamma*(bTimeRemaining[i] - bDiscrepancy[i]*bExpectedLag[i]) ) )
aUtility[i] <- aValence[i]*aExpectancy[i]/(1 + discount * (aTimeRemaining[i]))
bUtility[i] <- bValence[i]*bExpectancy[i]/(1 + discount * (bTimeRemaining[i]))
aMotivationalValueMean[i] <- aUtility[i]*aQualityMean[i]
bMotivationalValueMean[i] <- bUtility[i]*bQualityMean[i]
aMotivationalValueVariance[i] <- (aUtility[i]*aQualitySD[i])^2 + (bUtility[i]*bQualitySD[i])^2
bMotivationalValueVariance[i] <- (aUtility[i]*aQualitySD[i])^2 + (bUtility[i]*bQualitySD[i])^2
mvDiffVariance[i] <- aMotivationalValueVariance[i] + bMotivationalValueVariance[i]
meanDrift[i] <- (aMotivationalValueMean[i] - bMotivationalValueMean[i])
probA[i] <- 1/(1+exp(-2*(meanDrift[i]/sqrt(mvDiffVariance[i])) *theta ) )
y[i] ~ dbern(probA[i])
}
In this model, the estimated parameters are theta, discount, and gamma, and these parameters can be recovered. When I run the model on the observations for a single participant (Ntotal = 1800), the model takes about 5 minutes to run, which is totally fine. However, when I run the model on the entire sample (45 participants x 1800 cases each = 78,900 observations), I've had it running for 24 hours and it's less than 50% of the way through. This seems odd, as I would expect it to just take 45 times as long, so 4 or 5 hours at most. Am I missing something?
I hope I am not misreading this situation (and I previously apologize if I am), but your question seems to come from a conceptual misunderstanding of how JAGS works (or WinBUGS or OpenBUGS for that matter).
Your program does not actually run, because what you wrote was not written in a programming language. So vectorizing will not help.
You wrote just a description of your model, because JAGS' language is a descriptive one.
Once JAGS reads your model, it assembles a transition matrix to run a MCMC whose stationary distribution is the posteriori distribution of your parameters given your (observed) data. JAGS does nothing else with your program.
All that time you have been waiting the program to run was actually waiting (and hoping) to reach relaxation time of your MCMC.
So, what is taking your program too long to run is that the resulting transition matrix must have bad relaxing properties or anything like that.
That is why vectorizing a program that is read and run only once will be of very little help.
So, your problem lies somewhere else.
I hope it helps and, if not, sorry.
All the best.
You can't vectorise in the same way that you would in R, but if you can group observations with the same probability expression (i.e. common d[i]) then you can use a Binomial rather than Bernoulli distribution which will help enormously. If each observation has a unique d[i] then you are stuck I'm afraid.
Another alternative is to look at Stan which is generally faster for large data sets like yours.
Matt
thanks for the responses. Yes, you make a good point that the parameters in the model I showed might be unidentified.
I should've pointed out that model was a simplified version. The full model is below:
model{
for ( i in 1:Ntotal ) {
aExpectancy[i] <- 1/(1+exp(-gamma*(aTimeRemaining[i] - aDiscrepancy[i]*aExpectedLag[i]) ) )
bExpectancy[i] <- 1/(1+exp(-gamma*(bTimeRemaining[i] - bDiscrepancy[i]*bExpectedLag[i]) ) )
aUtility[i] <- aValence[i]*aExpectancy[i]/(1 + discount * (aTimeRemaining[i]))
bUtility[i] <- bValence[i]*bExpectancy[i]/(1 + discount * (bTimeRemaining[i]))
aMotivationalValueMean[i] <- aUtility[i]*aQualityMean[i]
bMotivationalValueMean[i] <- bUtility[i]*bQualityMean[i]
aMotivationalValueVariance[i] <- (aUtility[i]*aQualitySD[i])^2 + (bUtility[i]*bQualitySD[i])^2
bMotivationalValueVariance[i] <- (aUtility[i]*aQualitySD[i])^2 + (bUtility[i]*bQualitySD[i])^2
mvDiffVariance[i] <- aMotivationalValueVariance[i] + bMotivationalValueVariance[i]
meanDrift[i] <- (aMotivationalValueMean[i] - bMotivationalValueMean[i])
probA[i] <- 1/(1+exp(-2*(meanDrift[i]/sqrt(mvDiffVariance[i])) *theta ) )
y[i] ~ dbern(probA[i])
}
theta ~ dunif(0,10)
discount ~ dunif(0,10)
gamma ~ dunif(0,1)
}
In this model, the estimated parameters are theta, discount, and gamma, and these parameters can be recovered.
When I run the model on the observations for a single participant (Ntotal = 1800), the model takes about 5 minutes to run, which is totally fine.
However, when I run the model on the entire sample (45 participants X 1800 cases each = 78,900 observations), I've had it running for 24 hours and it's less than 50% of the way through.
This seems odd, as I would expect it to just take 45 times as long, so 4 or 5 hours at most. Am I missing something?

Select the row and column element of a dataframe and decide the regression variables

I have a pandas dataframe containing 10 columns and 1000 rows. I want to perform a multinomial logit regression. However, I want to use different independent variable based on the choice.
For example, I have the Y as "mode chosen" i.e car, bus, train etc. and I have attributes of the modes as X such as "car travel time" "bus travel time" etc.
I want to perform MNLogit wherein, when the chosen mode is car, the X must be car travel time and when the chosen mode is bus, the X must be bus travel time. So what I want to do is
if data.mode_chosen == "car":
X = data["car travel time"]
elif data.mode_chosen == "bus":
X = data["bus travel time"]
# .....
I want to access the Nth row and Mth column and find whether it is car or bus or train and then choose my dependent variables as per the choice.
PS: I am an amateur programmer and new to this forum, kindly let me know if I have not made my question clear.

Statistical procedure decision

I have two problems in hand :
I have a dependant variable, lets say GDP, and many other independant variables. I need to know what procedure I can use to find which among the IVs are leading or lagging indicators. I have develop the model in SAS and Excel.
Based on some buy sell rules based out of x day ema and y day sma cross, I need to compute returns. I need to know which procedure I should use to find what values of x and y will give me the best returns (x and y being an array of prefixed values like (200,50)(300,30), etc.). Can a neural network be used here? If so can anyone give me a link to some documentation as to how to carry this out?
Ad 1: probably easiest is to calculate the linear correlation between the time series. Using both simultaneous and shifted time series will tell you something about lead/lag.
Ad 2: look into optimization, not neural networks. Initial and easiest approach is to use grid search: calculate the best returns for each combination of X and Y. Pseudocode:
x = [50:50:500]
y = [10:10:100]
for i in x:
for j in y:
return(i,j) = calculate_returns(x(i),y(j))
end
end

fmincon : impose vector greater than zero constraint

How do you impose a constraint that all values in a vector you are trying to optimize for are greater than zero, using fmincon()?
According to the documentation, I need some parameters A and b, where A*x ≤ b, but I think if I make A a vector of -1's and b 0, then I will have optimized for the sum of x>0, instead of each value of x greater than 0.
Just in case you need it, here is my code. I am trying to optimize over a vector (x) such that the (componentwise) product of x and a matrix (called multiplierMatrix) makes a matrix for which the sum of the columns is x.
function [sse] = myfun(x) % this is a nested function
bigMatrix = repmat(x,1,120) .* multiplierMatrix;
answer = sum(bigMatrix,1)';
sse = sum((expectedAnswer - answer).^2);
end
xGuess = ones(1:120,1);
[sse xVals] = fmincon(#myfun,xGuess,???);
Let me know if I need to explain my problem better. Thanks for your help in advance!
You can use the lower bound:
xGuess = ones(120,1);
lb = zeros(120,1);
[sse xVals] = fmincon(#myfun,xGuess, [],[],[],[], lb);
note that xVals and sse should probably be swapped (if their name means anything).
The lower bound lb means that elements in your decision variable x will never fall below the corresponding element in lb, which is what you are after here.
The empties ([]) indicate you're not using linear constraints (e.g., A,b, Aeq,beq), only the lower bounds lb.
Some advice: fmincon is a pretty advanced function. You'd better memorize the documentation on it, and play with it for a few hours, using many different example problems.

Maximizing in mathematica with multiple maxima

I'm trying to compute the maxima of some function of one variable (something like this:)
(which is calculated from a non-trivial convolution, so, no, I don't have an expression for it)
Using the command:
NMaximize[{f[x], 0 < x < 1}, x, AccuracyGoal -> 4, PrecisionGoal -> 4]
(I'm not that worried about super accuracy, a rough estimate of 10^-4 is already enough)
The result of this is x* = 0.55, which is not what should be. (i.e., it is picking the third peak).
Is there any way of telling mathematica that the global maxima is the first one when counting from x = 0 (I know this is always true), or make mathematica search with a better approach? (Notice, I don't want things like Stimulated Annealing approach; each evaluation is very costly!)
Thanks very much!
Try FindMaximum with a starting point of 0 or some similarly small value.