Bayesian Networks with multiple layers - bayesian

So I'm trying to solve a problem with Bayesian networking. I know the conditional probabilities of some event, say that it will rain. Suppose that I measure (boolean) values from each of four sensors (A1 - A4). I know the probability that of rain and I know the probability of rain given the measurements on each of the sensors.
Now I add in a new twist. A4 is no longer available, but B1 and B2 are (they are also boolean sensors). I know the conditional probabilities of both B1 and B2 given the measurement of A4. How do I incorporate those probabilities into my Bayesian network to replace the lost data from A4?

Your problem fits perfectly to Multi-Entity Bayesian Networks (MEBN). This is an extension to standard BN using First Order Logic (FOL). It basically allows nodes to be added and/or removed based on the specific situation at hand. You define a template for creating BN on the fly, based on the current knwoledge available.
There are several papers on it available on the Web. A classic reference to this work is "Multi-Entity Bayesian Networks Without Multi-Tears".
We have implemented MEBN inside UnBBayes. You can get a copy of it by following the instructions # http://sourceforge.net/p/unbbayes/discussion/156015/thread/cb2e0887/. An example can be seen in the paper "Probabilistic Ontology and Knowledge Fusion for Procurement Fraud Detection in Brazil" # http://link.springer.com/chapter/10.1007/978-3-642-35975-0_2.
If you are interested in it, I can give you more pointers later on.
Cheers,
Rommel

Related

Multiple-input multiple-output CNN with custom loss function

I have a set of 2D input arrays m x n namely A,B,C and I have to predict two 2D output arrays namely d,e for which I do have the expected values. You can think of the inputs/outputs as grey images if you like.
Because of the spatial information is relevant (these are actually 2D physical domains) I want to use a Convolutional Neural Network to predict d and e. My design (not tested yet) looks as follows:
Because I have multiple inputs, I guess I should use multiple columns (or branches) to find different features for each of the inputs (they look fairly different). Each of these columns follows a encoding-decoding architecture used in segmentation (see SegNet): Conv2D block involves a convolution+batch normalisation+ReLU layer. Deconv2D involves a deconvolution+batch normalisation+ReLU.
Then, I can merge the output of each column by either concatenating, averaging or taking the maximum for example. To obtain the original m x n shape for each of the outputs I have seen I could do this with a 1 x 1 kernel convolution.
I want to predict the two outputs from that single layer. Is that okay from the network structure point of view? Finally my loss function depends on the outputs themselves compared to the target plus another relation I want to impose.
A would like to have some expert opinion on this since this is my first design of a CNN and I am not sure if I it makes sense as it is now and/or if there are better approaches (or network architectures) to this problem.
I posted this originally in datascience but I did not get much feedback. I am now posting it here since there is a bigger community on these topics plus I would be very grateful to receive implementation tips beside network architectural ones. Thanks.
I think your design makes sense in general:
since A, B, and C are fairly different, you make each input a transform sub-network, and then fuse them together, which is your intermediate representation.
from the intermediate representation, you apply additional CNN to decode D and E, respectively.
Several things:
A, B, and C looking different does not necessarily mean you can't stack them together as a 3-channel input. The decision should be made upon the fact that whether the values in A, B, and C mean differently or not. For example, if A is a gray scale image, B is a depth map, C is a also a gray image captured by a different camera. Then A and B are better processed in your suggested way, but A and C can be concatenated as one single input before feeding it to your network.
D and E are two outputs of the network and will be trained in the multi-task manner. Of course, they should share some latent feature, and one should split at this feature to apply a down-stream non-shared weight branch for each output. However, where to split is usually tricky.
It is really a broad question, asking for answers relying mostly on opinions. Here are my two cents though, which you might find interesting as it does not go along the previous answers here and on datascience.
First, I wouldn't go with separate columns for each input. AFAIK, when different inputs are processed by different columns, it is almost always the case that the network is some sort of Siemese network and the columns share the same weights; or at least the columns all need to produce a similar code. It is not your case here, so I would simply not bother.
Second, you are blessed with a problem with a dense output and no need to learn a code. This should direct you straight to U-nets, which outperforms any bottleneck-designed network without much effort. U-nets were introduced for dense segmentation but they shine at any dense-output problem really.
In short, just stack your inputs together and use a U-net.

Is multiple regression the best approach for optimization?

I am being asked to take a look at a scenario where a company has many projects that they wish to complete, but with any company budget comes into play. There is a Y value of a predefined score, with multiple X inputs. There are also 3 main constraints of Capital Costs, Expense Cost and Time for Completion in Months.
The ask is could an algorithmic approach be used to optimize which projects should be done for the year given the 3 constraints. The approach also should give different results if the constraint values change. The suggested method is multiple regression. Though I have looked into different approaches in detail. I would like to ask the wider community, if anyone has dealt with a similar problem, and what approaches have you used.
Fisrt thing we should understood, a conclution of something is not base on one argument.
this is from communication theory, that every human make a frame of knowledge (understanding conclution), where the frame construct from many piece of knowledge / information).
the concequence is we cannot use single linear regression in math to create a ML / DL system.
at least we should use two different variabel to make a sub conclution. if we push to use single variable with use linear regression (y=mx+c). it's similar to push computer predict something with low accuration. what ever optimization method that you pick...it's still low accuracy..., why...because linear regresion if you use in real life, it similar with predict 'habbit' base on data, not calculating the real condition.
that's means...., we should use multiple linear regression (y=m1x1+m2x2+ ... + c) to calculate anything in order to make computer understood / have conclution / create model of regression. but, not so simple like it. because of computer try to make a conclution from data that have multiple character / varians ... you must classified the data and the conclution.
for an example, try to make computer understood phitagoras.
we know that phitagoras formula is c=((a^2)+(b^2))^(1/2), and we want our computer can make prediction the phitagoras side (c) from two input values (a and b). so to do that, we should make a model or a mutiple linear regresion formula of phitagoras.
step 1 of course we should make a multi character data of phitagoras.
this is an example
a b c
3 4 5
8 6 10
3 14 etc..., try put 10 until 20 data
try to make a conclution of regression formula with multiple regression to predic the c base on a and b values.
you will found that some data have high accuration (higher than 98%) for some value and some value is not to accurate (under 90%). example a=3 and b=14 or b=15, will give low accuration result (under 90%).
so you must make and optimization....but how to do it...
I know many method to optimize, but i found in manual way, if I exclude the data that giving low accuracy result and put them in different group then, recalculate again to the data group that excluded, i will get more significant result. do again...until you reach the accuracy target that you want.
each group data, that have a new regression, is a new class.
means i will have several multiple regression base on data that i input (the regression come from each group of data / class) and the accuracy is really high, 99% - 99.99%.
and with the several class, the regresion have a fuction as a 'label' of the class, this is what happens in the backgroud of the automation computation. but with many module, the user of the module, feel put 'string' object as label, but the truth is, the string object binding to a regresion that constructed as label.
with some conditional parameter you can get the good ML with minimum number of data train.
try it on excel / libreoffice before step more further...
try to follow the tutorial from this video
and implement it in simple data that easy to construct in excel, like pythagoras.
so the answer is yes...the multiple regression is the best approach for optimization.

How are samples inside a PU calculated in the intra mode of HEVC?

I've read several articles about intra prediction in HEVC and I still have some questions.
For a PU of NxN pixels, we use 4xN + 1 reference samples (the row above the PU, the column at the left of the PU and the sample at the top left). Then, based on the MPM, a mode is selected to work with.
I now have a row of reference samples, a column of reference samples and a mode. Based on this, how are the samples inside the PU calculated ?
In this article http://codepaint.kaist.ac.kr/wp-content/uploads/2013/10/Intra-Coding-of-the-HEVC-Standard.pdf there are ready-to-use formulae which take coordinate and selected mode as parameters. Is it really that simple ?
Now, imagine we have a picture of a checkerboard. How intra prediction can be used ? In some cases, we might not want to use reference samples of previously decoded PU. How to deal with that ?
Thanks
I now have a row of reference samples, a column of reference samples
and a mode. Based on this, how are the samples inside the PU
calculated ?
As it is stated in this article first encoder should decide about the mode and the sizes of PUs and TUs during the RDO process. Among the list of
modes lets say mode number 25 is chosen to predict the current block. Mode number 25 is one of angular modes so we will use the mentioned formula for
angular modes and obtain the output. It worth mentioning that although formula is simple details of reference samples make it a little tricky.
Now, imagine we have a picture of a checkerboard. How intra prediction
can be used ?
First the prediction modes should be found. Lets say we decided on mode X then we should refer to the related formula to mode X and form our prediction block simililar to what discussed in previous question.
In some cases, we might not want to use reference
samples of previously decoded PU. How to deal with that ?
Intra prediction basically is formed based on these reference samples and if you are not using these pixels your not doing INTRA prediction. Maybe you should shift to INTER prediction where it uses other blocks in successive frames and MVs to predict the current block.
The question is interest for me.
I can easy to say that the mode is selected by encode.
In the HEVC encoder, it run all the mode(35, in the view of complexity,encoder uses fast algorithm to simplify the selection process, you can find some paper to read), finally encoder selects the best mode(RDO process). so,decoder can not select reference sample. decoder have to select the samples which are same with encoder.
In the SCC(screen content coding) coding which is a extension of the HEVC, using IBC(intra block copy) mode to select the reference sample in reconstructed area.

How to identify relevant features in WEKA?

I would like to perform feature analysis in WEKA. I have a data set of 8 features and 65 instances.
I would like to perform feature selection and optimization functionalities that are available for machine learning methods like SVM.
For example in Weka I would like to know how I can display which of the features contribute best to the classification result.
I think that WEKA provides a nice graphical user interface and allows a very detailed analysis of the influence of single features. But I dont know how to use it. Any help?
You have two options:
You can perform attribute selection using filters. For instance you can use the AttributeSelection tab (or filter) with the search method Ranker and the attribute evaluation metric InfoGainAttributeEval. This way you get a ranked list of the most predictive features according to its Information Gain score. I have done this many times with good results. Sometimes it helps even to increase the accuracy of SVMs, which are known not to need (too much) of feature selection. You can try with other search methods in order to find subgroups of coupled predictors, and with other metrics.
You can just look at the coefficients in the SVM output. For instance, in linear SVMs, the classifier is a polynomial like a1.f1 + a2.f2 + ... + an.fn + fn+1 > 0, being ai the attribute values for an instance, and fi the "weights" obtained in the SVM training algorithm. In consequence, those weights with values close to 0 represent attributes that do not count too much, thus being bad predictors; extreme weights (either positive or negative) represent good predictors.
Additionally, you can check the visualization options available for a particular classifier (e.g. J48 is a decision tree, the attribute used in the root test is for the best predictor). You can check the AttributeSelection tab visualization options as well.

Probability Density Function with Zero Standard Deviation

I am now implementing an email filtering application using the Naive Bayes algorithm. My application uses the Spambase Data Set from the UCI Machine Learning Repository. Since the attributes are continuous, I calculate the probability using the Probability Density Function (PDF). However, when I evaluate the data using the k-fold cross validation, a training set may contain only 0 for one of its attributes. For this reason, I got a 0 standard deviation and the PDF returns NaN and it leads to a huge number of spams are not correctly classified with that training set. What should I do to fix the problem?
You could use a discrete PDF, which will always be bounded.
Alternatively, simply ignore any attribute with zero variance. There is no point in including distributions with zero variance, because they won't actually do anything. For example, you want to know how old I am, and then I tell you that I live on planet Earth. That shouldn't change your estimate, because every single piece of data you have is for people on planet Earth.