Bayesian probability for an final event based on two prior sequential events - bayesian

say i have a a three step process for a business outcome.
user login and register the course, say the probability of registration is P(R)
user complete 50% of the course, say the probability of is achievement is P(A)
user complete the 100% of the course,, say the probability of completion is P(C)
Now, if i assume these processes are independent then the probability a user will complete the course ( by registering for the course, complete it 50%, and then complete is 100%) is:
P(A)*P(A)*P(C)
However, these events are not independent, as P(A) depends on P(R) and P(C) depends on P(A) and P(R).
So, if I have to do it via bayesian how would I calculate this.
Sorry, I am new to this and maybe completely missing something.
thanks for the help

Related

Party people come an go: How to use Bayes to know who is in or out?

story/problem
Imagine there are N party guests and a bouncer guarding the only door. Before the party starts, all the guests are outside (that's for sure). However, once the party starts, people come and go. Each time such an event occurs, the bouncer makes a note for each potential guest of the likelihood that it could have been him or her. One could call this score the bouncer's classification confidence. For each event, this is a list of N candidates that adds up to one. All in all, T events have been observed until the next morning.
Unfortunately, some valuables were stolen that night. To narrow down the group of suspects, the host checked the bouncers notes. However, he soon found contradictory and therefore unreliable data: For example, according to the data, the same person entered the place of high confidences twice in a row, which we know is impossible. Therefore, he attempts by cleaning the data from contradictions to improve the quality of the classifications.
where I am stuck/what I tried
First I solved this issue by formulating a Linear Program and solved this in Python. However, as the number of guests increase, this soon becomes computationally infeasible. Therefore want use Bayes' theorem to compute the probability of the guests presence.
Since we have prior beliefs of the guests attending or not and repeated information updates, I felt that a bayesian approach would be the natural way to tackle this problem.
Even though this approach seemed intuitive to me, I faced some problems:
(1) The problem of being 100% sure of the first state
Being 100% sure of a person being outside at time t=0 seems to "break" the Bayesian formular. Let A be person A and data be the information available by the bouncer.
If the prior P(A present) is either 0 or 1 the formular collapses to 0 or 1 respectively. Am I wrong here?
(2) The problem of using all available information
I did not find a way not only to use data that happened before time step t but also information gathered at a later time. E.g. given three candidates the bouncer makes three observations with confidence ct:
cout->in1=(0.5, 0.4, 0.1)T,
cout->in2=(0.4, 0.4, 0.2)T,
cout->in3=(0.9, 0.05, 0.05)T.
Using only information that was available until t, the most plausible solution would be: (A entered, B entered, C entered) in. However, using all the information available this is a better solution: (B entered, C entered, A entered).
(3) The problem of noisy observations.
I was thinking of using a Bernoulli distribution as prior, however, I do not observe binary events but confidences.
Since I'm stuck, I'm looking forward to your help on Bayesian reasoning and ultimately finding the thief.

Are there benefits to having Actor and Critic use significantly different models?

In Actor-Critic methods the Actor and Critic are assigned two complimentary, but different goals. I'm trying to understand whether the differences between these goals (updating a policy and updating a value function) are large enough to warrant different models for the Actor and Critic, or if they are of similar enough complexity that the same model should be reused for simplicity. I realize that this could be very situational, but not in what way. For example, does the balance shift as the model complexity grows?
Please let me know if there are any rules of thumb for this, or if you know of a specific publication that addresses the issue.
The empirical results suggest the exact opposite - that it is important to have the same network doing both (up to some final layer/head). The main reason for this is that learning value network (critis) provides signal for shaping represntation of the policy (actor) that otherwise would be nearly impossible to get.
In fact if you think about these, these are extremely similar goals, since for optimal deterministic policy
pi(s) = arg max_a Q(s, a) = arg max_a V(T(s, a))
where T is the transition dynamics.

How to interpret "Value Loss" chart in TensorBoard?

I have a target-finding, obstacle-avoiding helicopter in Unity Machine Learning Agents. Looking at the TensorBoard for my training, I'm trying to get a feel for how to interpret the "Losses/Value Loss".
I've googled many articles on ML Loss, like this one, but I can't seem to get an intuitive understanding yet of what it all means for my little helicopter and possible changes I should implement, if any. (The helicopter is rewarded by getting closer and again for reaching the target, and punished by getting further or colliding. It measures a variety of things like relative speed, relative target position, ray sensors and so on, and it does basically work in target-finding, whereas more complicated maze type obstacles have not been tested or trained on yet. It's using 3 layers.) Thanks!
In reinforcement learning and specifically regarding actor/critic algorithms, value loss is the difference (or an average of many such differences) between the learning algorithm's expectation of a state's value and the empirically observed value of that state.
What is a state's value? A state's value is, in short, how much reward you can expect given that you start in that state. Immediate reward contributes completely to this amount. Reward that can possibly occur but not immediately contribute less, with more distant occurrences contributing less and less. We call this reduction in contribution to value a "discount", or we say that these rewards are "discounted".
Expected value is how much the critic part of the algorithm predicts the value to be. In the case of a critic implemented as a neural network, it's the output of the neural network with the state as its input.
Empirically observed value is the amount you get when you add up the rewards that you actually got when you left that state, plus any rewards (discounted by some amount) you got immediately after that for some number of steps (we'll say after these steps you ended up on state X), and (perhaps, depending on implementation) plus some discounted amount based on the value of state X.
In short, the smaller it is, the better it got at predicting how well it is going to perform. This doesn't mean that it gets better at playing - after all, one can be terrible at a game yet be accurate at predicting that they will lose and when they will lose if they learn to choose actions that will make them lose quickly!

Deep neural network diverges after convergence

I implemented the A3C network in https://arxiv.org/abs/1602.01783 in TensorFlow.
At this point I'm 90% sure the algorithm is implemented correctly. However, the network diverges after convergence. See the attached image that I got from a toy example where the maximum episode reward is 7.
When it diverges, policy network starts giving a single action very high probability (>0.9) for most states.
What should I check for this kind of problem? Is there any reference for it?
Note that in Figure 1 of the original paper the authors say:
For asynchronous methods we average over the best 5
models from 50 experiments.
That can mean that in lot of cases the algorithm does not work that well. From my experience, A3C often diverges, even after convergence. Carefull learning-rate scheduling can help. Or do what the authors did - learn several agents with different seed and pick the one performing the best on your validation data. You could also employ early stopping when validation error becomes to increase.

Maximum flow in the undirected graph

How can I find maximum flow in this undirected graph? Can anyone show the step?
There is algorithm called Ford-Fulkerson algorithm which gives the maximum flow of a flow network in polynomial time, you can look it up in the book Algorithm Design by Kleinberg and Tardos, or even in CLRS.
The only thing you need to do to solve your problem is that you should replace every edge in your undirected grah by two edges backwards and forwards with the same capacity and then solve your problem using the Ford-Fulkerson algorithm. It can be easily proven that in such conversion, flow only propagates through one of the two edges and always one of them is not used.