I want to test the Bayesian Network Independence between M and D.
To test Bayesian Network Independence, I should marginal out all descent nodes of M and D.
BUT, node D is descent node of M, so I'm confusing should I marginalize out D, to test BN independence btw M and D.
Similarly, how to test BN independence btw M and I for like this situation
Thanks.
You may find easier to understand the process of keeping the ancestors (see ancestral graph) instead of the one of removing (marginalizing out) the descendants :-)
Indeed ancestors({D,M})={M,I,D}. Ancestral graph of M->I->D is M-I-D. M and D are connected and then not independent. (The same for M->I).
Related
In the GPT-2 paper, under Section 2, Page 3 it says,
Since the supervised objective is the the same as the unsupervised objective but only evaluated on a subset of the sequence, the global minimum of the unsupervised objective is also the global minimum of the supervised objective.
I didn't follow this line of reasoning. What is the logic behind concluding this?
The underlying principle here is that if f is a function with domain D and S is a subset of D, then if d maximizes f over D and d happens to be in S, then d also maximizes f over S.
In simper words "a global maximum is also a local maximum".
Now how does this apply to GPT-2? Let's look at how GPT-2 is trained.
First step: GPT-2 uses unsupervised training to learn the distribution of the next letter in a sequence by examining examples in a huge corpus of existing text. By this point, it should be able to output valid words and be able to complete things like "Hello ther" to "Hello there".
Second step: GPT-2 uses supervised training at specific tasks such as answering specific questions posed to it such as "Who wrote the book the origin of species?" Answer "Charles Darwin".
Question: Does the second step of supervised training undo general knowledge that GPT-2 learned in the first step?
Answer: No, the question-answer pair "Who wrote the book the origin of species? Charles Darwin." is itself valid English text that comes from the same distribution that the network is trying to learn in the first place. It may well even appear verbatim in the corpus of text from step 1. Therefore, these supervised examples are elements of the same domain (valid English text) and optimizing the loss function to get these supervised examples correct is working towards the same objective as optimizing the loss function to get the unsupervised examples correct.
In simpler words, supervised question-answer pairs or other specific tasks that GPT-2 was trained to do use examples from the same underlying distribution as the unsupervised corpus text, so they are optimizing towards the same goal and will have the same global optimum.
Caveat: you can still accidentally end up in a local-minimum due to (over)training using these supervised examples that you might not have run into otherwise. However, GPT-2 was revolutionary in its field and whether or not this happened with GPT-2, it still made significant progress from the state-of-the-art before it.
Is it possible to calculate an equal error rate(EER) for a multi-class classification problem?
I'm working on a biometric user authentication problem.
If yes, can someone please provide me with some information on how to calculate it?
If not, please provide some alternatives to EER?
Your question is related to this one: ROC for multiclass classification , since EER (equal error rate) is calculated from the ROC by adjusting acceptance threshold.
I will review the conceptual process of what is EER for multiclass classification.
Suppose you have n > 2 classes, for example: A, B, C and set of samples x in X with their true labels. The idea is to binarize the problem by converting it to n binary classification problems: for each class (say A) and sample (say x) there are 2 possibilities: x is in A, or x is not in A. If a in A is classified in class A, this is a true positive, if a is classified in B or C, it is false rejection or false negative. Similarly, if b in B is classified in A, it is a false acceptance or false positive. Then for each class you can compute FAR (false acceptance rate) and FRR (false rejection rate), adjust thresholds and compute EER for each class. Then you can average over the EERs of each classes. Another approach is to calculate FRR and FAR, first average them and then adjust the parameters so average FRR and average FAR are the same (this is more complicated).
An introduction for the "One vs. Rest" strategy can be found in TowardsDataScience and Data Science Stack Exchange.
Open issues:
In binary classification the meaning of threshold is clear. In multiclass classification it is not clear. sklearn probably handles it in the background when plotting the ROC.
How to implement both binary EER and non-binary EER in TensorFlow, for efficient training of deep neural networks.
I hope it helps, and I will be happy to see comments and additions which will make this issue clearer and answer the open issues I wrote above.
I have a set of 2D input arrays m x n namely A,B,C and I have to predict two 2D output arrays namely d,e for which I do have the expected values. You can think of the inputs/outputs as grey images if you like.
Because of the spatial information is relevant (these are actually 2D physical domains) I want to use a Convolutional Neural Network to predict d and e. My design (not tested yet) looks as follows:
Because I have multiple inputs, I guess I should use multiple columns (or branches) to find different features for each of the inputs (they look fairly different). Each of these columns follows a encoding-decoding architecture used in segmentation (see SegNet): Conv2D block involves a convolution+batch normalisation+ReLU layer. Deconv2D involves a deconvolution+batch normalisation+ReLU.
Then, I can merge the output of each column by either concatenating, averaging or taking the maximum for example. To obtain the original m x n shape for each of the outputs I have seen I could do this with a 1 x 1 kernel convolution.
I want to predict the two outputs from that single layer. Is that okay from the network structure point of view? Finally my loss function depends on the outputs themselves compared to the target plus another relation I want to impose.
A would like to have some expert opinion on this since this is my first design of a CNN and I am not sure if I it makes sense as it is now and/or if there are better approaches (or network architectures) to this problem.
I posted this originally in datascience but I did not get much feedback. I am now posting it here since there is a bigger community on these topics plus I would be very grateful to receive implementation tips beside network architectural ones. Thanks.
I think your design makes sense in general:
since A, B, and C are fairly different, you make each input a transform sub-network, and then fuse them together, which is your intermediate representation.
from the intermediate representation, you apply additional CNN to decode D and E, respectively.
Several things:
A, B, and C looking different does not necessarily mean you can't stack them together as a 3-channel input. The decision should be made upon the fact that whether the values in A, B, and C mean differently or not. For example, if A is a gray scale image, B is a depth map, C is a also a gray image captured by a different camera. Then A and B are better processed in your suggested way, but A and C can be concatenated as one single input before feeding it to your network.
D and E are two outputs of the network and will be trained in the multi-task manner. Of course, they should share some latent feature, and one should split at this feature to apply a down-stream non-shared weight branch for each output. However, where to split is usually tricky.
It is really a broad question, asking for answers relying mostly on opinions. Here are my two cents though, which you might find interesting as it does not go along the previous answers here and on datascience.
First, I wouldn't go with separate columns for each input. AFAIK, when different inputs are processed by different columns, it is almost always the case that the network is some sort of Siemese network and the columns share the same weights; or at least the columns all need to produce a similar code. It is not your case here, so I would simply not bother.
Second, you are blessed with a problem with a dense output and no need to learn a code. This should direct you straight to U-nets, which outperforms any bottleneck-designed network without much effort. U-nets were introduced for dense segmentation but they shine at any dense-output problem really.
In short, just stack your inputs together and use a U-net.
I'm starting to study machine learning. I have a basic knowldege about it. If I consider a generic machine learning algorithm M, I would know which are its precise inputs and outputs. I'm not referring to some kind of implementation in a such programming language. I'm talking about the theory of machine learning.
Take the example of supervised learning. The input of M should be the collection of pairs related to the function f the algorithm must learn. So, it will build some function h which approximate f. The output of M should be h?
And what about unsupervised machine learning?
The output of ML algorithms is whatever you want it to be.
For example:
Regression: 1 value
Classification: n classes (with the probability of the input is a member of that class)
Text summarization: One word, one character, a batch of them or the whole text summarized.
As you see, the output will be what you need it to be.
This question is about a concept in the paper "indentifying independence in bayesian network", page 2 and 3.
In a bayesian network, each node represents as variable and the arrow represent the dependence. The standard queries of the bayesian network is like this: giving a variale a, a Bayesian network D, the value y of a set of variables Y, the task is to compute P(b|y), giving evidence y.
Then we should determining:
(1)whether the answer to the query is sensitive to the value of a variable a
(2)whether the answer to the query is sensitive to the parameters p_a=P(c|pa(c)) stored at node a.
Here I am confused by (2).
First, I think each node represents a ramdon variable, why the information of p_a=P(c|pa(c)) also stored in the node? what does this mean?
second, consider the conditional independence between variable b and a, why we should treat (1) and (2) differently?
Thank you.
the link of the paper:
http://www.cs.technion.ac.il/~dang/journal_papers/geiger1990identifying.pdf