Negative binomial , Poisson-gamma mixture winbugs - bayesian

Winbugs trap error
model
{
for (i in 1:5323) {
Y[i] ~ dpois(mu[i]) # NB model as a Poisson-gamma mixture
mu[i] ~ dgamma(b[i], a[i]) # NB model as a poisson-gamma mixture
a[i] <- b[i] / Emu[i]
b[i] <- B * X[i]
Emu[i] <- beta0 * pow(X[i], beta1) # model equation
}
# Priors
beta0 ~ dunif(0,10) # parameter
beta1 ~ dunif(0,10) # parameter
B ~ dunif(0,10) # over-dispersion parameter
}
X[] Y[]
1.5 0
2.9 0
1.49 0
0.39 0
3.89 0
2.03 0
0.91 0
0.89 0
0.97 0
2.16 0
0.04 0
1.12 1s
2.26 0
3.6 1
1.94 0
0.41 1
2 0
0.9 0
0.9 0
0.9 0
0.1 0
0.88 1
0.91 0
6.84 2
3.14 3
End ```
This is just a sample of the data, the model question is coming from Ezra Hauer 8.3.2, the art of regression of road safety, the model is providing an **error undefined real result. **
The aim of model is to fully Bayesian and a one step model and not use empirical bayes.
The results should be similar to MLE where beta0 is 1.65, beta1 0.871, overdispersion is 0.531
X is the only variable and y is actual collision,
So X cannot be zero or negative, while y cannot be lower than zero, if the model in solved as Poisson gamma mixture using maximum likelihood then it can be created
How can I make this model work
Solving an error in winbugs?

the data is in excel, the model worked fine when I selected the biggest 1000 observations only.

Related

Internal node predictions of xgboost model

Is it possible to calculate the internal node predictions of an xgboost model? The R package, gbm, provides a prediction for internal nodes of each tree.
The xgboost output, however only shows predictions for the final leaves of the model.
xgboost output:
Notice that the Quality column has the final prediction for the leaf node in row 6. I would like that value for each of the internal nodes as well.
Tree Node ID Feature Split Yes No Missing Quality Cover
1: 0 0 0-0 Sex=female 0.50000 0-1 0-2 0-1 246.6042790 222.75
2: 0 1 0-1 Age 13.00000 0-3 0-4 0-4 22.3424225 144.25
3: 0 2 0-2 Pclass=3 0.50000 0-5 0-6 0-5 60.1275253 78.50
4: 0 3 0-3 SibSp 2.50000 0-7 0-8 0-7 23.6302433 9.25
5: 0 4 0-4 Fare 26.26875 0-9 0-10 0-9 21.4425507 135.00
6: 0 5 0-5 Leaf NA <NA> <NA> <NA> 0.1747126 42.50
R gbm output:
In the R gbm package output, the prediction column contains values for both leaf nodes (SplitVar == -1) and the internal nodes. I would like access to these values from the xgboost model
SplitVar SplitCodePred LeftNode RightNode MissingNode ErrorReduction Weight Prediction
0 1 0.000000000 1 8 15 32.564591 445 0.001132514
1 2 9.500000000 2 3 7 3.844470 282 -0.085827382
2 -1 0.119585850 -1 -1 -1 0.000000 15 0.119585850
3 0 1.000000000 4 5 6 3.047926 207 -0.092846157
4 -1 -0.118731665 -1 -1 -1 0.000000 165 -0.118731665
5 -1 0.008846912 -1 -1 -1 0.000000 42 0.008846912
6 -1 -0.092846157 -1 -1 -1 0.000000 207 -0.092846157
Question:
How do I access or calculate predictions for the internal nodes of an xgboost model? I would like to use them for a greedy, poor man's version of SHAP scores.
The solution to this problem is to dump the xgboost json object with all_stats=True. That adds the cover statistic to the output which can be used to distribute the leaf points through the internal nodes:
def _calculate_contribution(node: AnyNode) -> float32:
if isinstance(node, Leaf):
return node.contrib
else:
return (
node.left.cover * Node._calculate_contribution(node.left)
+ node.right.cover * Node._calculate_contribution(node.right)
) / node.cover
The internal contribution is the weighted average of the child contributions. Using this method, the generated results exactly match those returned when calling the predict method with pred_contribs=True and approx_contribs=True.

Modelling an efficiency drop of machine after certain hours of work

I am building linear programming model of machines manufacturing some parts in AMPL using CPLEX solver.
My problem is how to model 10% efficiency step drop of machine after 100 hours of work. I know that proper approach in my case is to use binary variable. Unfortunately I have no idea how to manage this.
Could you show an example how to model described above behavior in linear programming?
EDIT: attaching current AMPL code
Data:
data;
param M := 3; # machines count
param N := 5; # parts count
param efficiency: # efficiency [pc / h]
1 2 3 4 5 :=
1 0.85 1.30 0.65 1.50 0.40
2 0.65 0.80 0.55 1.50 0.70
3 1.20 0.95 0.35 1.70 0.40;
param R := 1 45 # machines costs [usd / h]
2 35
3 40;
param C := 1 60 # minimal manufactured counts of particulat parts
2 60
3 60
4 120
5 120;
Model:
model;
param M; # machines count
param N; # parts count
param efficiency {1..M, 1..N}; # efficiency [pc / h]
param R {1..M}; # machines costs [usd / h]
param C {1..N}; # minimal manufactured counts of particulat parts
var t {1..M, 1..N}; # time[machine,part]
var d {n in 1..N} = sum {m in 1..M} t[m,n] * efficiency[m,n];
minimize cost: sum {m in 1..M, n in 1..N} R[m] * t[m,n];
subject to c1 {n in 1..N}:
d[n] >= C[n];
subject to t1 {m in 1..M}:
sum {n in 1..N} t[m,n] <= 180; # machine time must be less than 180
subject to t2 {m in 1..M, n in 1..N}:
t[m,n] >= 0; # times must non-negative

Time-series prediction by separating dependent and independent variables

Suppose, I have this kind of data:
date pollution dew temp press wnd_dir wnd_spd snow rain
2010-01-02 00:00:00 129.0 -16 -4.0 1020.0 SE 1.79 0 0
2010-01-02 01:00:00 148.0 -15 -4.0 1020.0 SE 2.68 0 0
2010-01-02 02:00:00 159.0 -11 -5.0 1021.0 SE 3.57 0 0
2010-01-02 03:00:00 181.0 -7 -5.0 1022.0 SE 5.36 1 0
2010-01-02 04:00:00 138.0 -7 -5.0 1022.0 SE 6.25 2 0
I want to apply neural network for the time-series prediction of pollution.
It should be noted that other variables: dew, temp, press, wnd_dir, wnd_spd, snow, rain are independent variables of pollution.
If I implement LSTM as in here the LSTM learns for all the variables as independent; and the model can predict for all variables.
But it is not necessary to predict for all independent variables, the only requirement is pollution, a dependent variable.
Is there any way to implement LSTM or another better architecture which learns and predict for only the dependent variable, by considering other independent variables as independent, and perform much better prediction of pollution?
It seems like the example is predicting only pollution already. If you see the reframed:
var1(t-1) var2(t-1) var3(t-1) var4(t-1) var5(t-1) var6(t-1) \
1 0.129779 0.352941 0.245902 0.527273 0.666667 0.002290
2 0.148893 0.367647 0.245902 0.527273 0.666667 0.003811
3 0.159960 0.426471 0.229508 0.545454 0.666667 0.005332
4 0.182093 0.485294 0.229508 0.563637 0.666667 0.008391
5 0.138833 0.485294 0.229508 0.563637 0.666667 0.009912
var7(t-1) var8(t-1) var1(t)
1 0.000000 0.0 0.148893
2 0.000000 0.0 0.159960
3 0.000000 0.0 0.182093
4 0.037037 0.0 0.138833
5 0.074074 0.0 0.109658
The var1 seems to be pollution. As you see, you have the values from the previous step (t-1) for all variables and the value for the current step t for pollution (var1(t)).
This last variable is what the example is feeding as y, as you can see in lines:
# split into input and outputs
train_X, train_y = train[:, :-1], train[:, -1]
test_X, test_y = test[:, :-1], test[:, -1]
So the network should be already predicting only on pollution.

Predictive Maintenance - How to use Bayesian Optimization with objective function and Logistic Regression with Gradient Descent together?

I'm trying to reproduce the problem shown in arimo.com
This is an example how to build a preventive maintenance Machine Learning model for an Hard Drive failures. The section I really don't understand is how to use Bayesian Optimization with a custom objective function and Logistic Regression with Gradient Descent together. What are the hyper-parameters to be optimized? What is the flow of the problem?
As described in our previous post, Bayesian Optimization [6] is used
to find the best hyperparameter values. The objective function to be
optimized in the hyperparameter tuning is the following score measured
on the validation set:
S = alpha * fnr + (1 – alpha) * fpr
where fpr and fnr are the False Positive and False Negative rates
obtained on the validation set. Our goal is to keep False Positive
rate low, therefore we use alpha = 0.2. Since the validation set is
highly unbalanced, we found out that standard scores like Precision,
F1-score, etc… do not work well. In fact, using this custom score is
crucial for the model to obtain a good performance generally.
Note that we only use the above score when running Bayesian
Optimization. To train logistic regression models, we use Gradient
Descent with the usual ridge loss function.
My dataframe before features selection:
index date serial_number model capacity_bytes failure Read Error Rate Reallocated Sectors Count Power-On Hours (POH) Temperature Current Pending Sector Count age yesterday_temperature yesterday_age yesterday_reallocated_sectors_count yesterday_read_error_rate yesterday_current_pending_sector_count yesterday_power_on_hours tomorrow_failure
0 77947 2013-04-11 MJ0331YNG69A0A Hitachi HDS5C3030ALA630 3000592982016 0 0 0 4909 29 0 36348284.0 29.0 20799895.0 0.0 0.0 0.0 4885.0 0.0
1 79327 2013-04-11 MJ1311YNG7EWXA Hitachi HDS5C3030ALA630 3000592982016 0 0 0 8831 24 0 36829839.0 24.0 21280074.0 0.0 0.0 0.0 8807.0 0.0
2 79592 2013-04-11 MJ1311YNG2ZD9A Hitachi HDS5C3030ALA630 3000592982016 0 0 0 13732 26 0 36924206.0 26.0 21374176.0 0.0 0.0 0.0 13708.0 0.0
3 80715 2013-04-11 MJ1311YNG2ZDBA Hitachi HDS5C3030ALA630 3000592982016 0 0 0 12745 27 0 37313742.0 27.0 21762591.0 0.0 0.0 0.0 12721.0 0.0
4 79958 2013-04-11 MJ1323YNG1EK0C Hitachi HDS5C3030ALA630 3000592982016 0 524289 0 13922 27 0 37050016.0 27.0 21499620.0 0.0 0.0 0.0 13898.0 0.0

Can anyone help meevaluate testing set data in Weka

I got one training dataset and one testing dataset. I am using weka explorer, trying to create a model with Random forest (algorithm). After creating model when I use my testing set data to implement it by (supply test set/ re-evaluate on current dataset) tab, it showing some thing like that.
What am I doing wrong?
Training Model:
=== Evaluation on training set ===
Time taken to test model on training data: 0.24 seconds
=== Summary ===
Correctly Classified Instances 5243 98.9245 %
Incorrectly Classified Instances 57 1.0755 %
Kappa statistic 0.9439
Mean absolute error 0.0453
Root mean squared error 0.1137
Relative absolute error 23.2184 %
Root relative squared error 36.4074 %
Coverage of cases (0.95 level) 100 %
Mean rel. region size (0.95 level) 59.3019 %
Total Number of Instances 5300
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class
0.996 0.067 0.992 0.996 0.994 0.944 0.999 1.000 0
0.933 0.004 0.968 0.933 0.950 0.944 0.999 0.990 1
Weighted Avg. 0.989 0.060 0.989 0.989 0.989 0.944 0.999 0.999
=== Confusion Matrix ===
a b <-- classified as
4702 18 | a = 0
39 541 | b = 1
Model Implement on my testing dataset:
=== Evaluation on test set ===
Time taken to test model on supplied test set: 0.22 seconds
=== Summary ===
Total Number of Instances 0
Ignored Class Unknown Instances 4000
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class
0.000 0.000 0.000 0.000 0.000 0.000 ? ? 0
0.000 0.000 0.000 0.000 0.000 0.000 ? ? 1
Weighted Avg. NaN NaN NaN NaN NaN NaN NaN NaN
=== Confusion Matrix ===
a b <-- classified as
0 0 | a = 0
0 0 | b = 1
Your test data set does not appear to have labels.
You can only evaluate your prediction quality using labeled data.