The proofs of the front-door adjustment that I've read take three steps:
Show P(M|do(X)) is identifiable
Show P(Y|do(M)) is identifiable
Multiply the do-free expressions for P(M|do(X)) and P(y|do(M)) to obtain P(Y|do(X))
where Y,X,M meet the assumptions for the frontdoor adjustment. A graph meeting these assumptions is:
X->M;M->Y;U->X;U->Y
I'm sure I'm being daft here, but I don't understand what justifies simply multiplying the expressions together to get P(Y|do(X)).
This is like saying:
P(Y|do(X)) = P(Y|do(M)) * P(M|do(X))
(where perhaps the assumptions for the front-door adjustment are necessary) but I don't recognize this rule in my study of causal inference.
Formal Description In a graph with a mediator and an unobserved confounder like the one you described, the path X->M->Y is confounded. Yet, the path X->M is unconfounded (Y is a collider), and we can estimate M->Y by controlling for X. Doing so gives us the two causal quantities that make up the path X->Y. For a graph with any kind of causal functions, propagating an effect along several edges means composing the functions that describe each edge. In the case of linear functions, this simply amounts to their multiplication (the assumption of linearity is very common in the context of causal inference).
Just like you said, we can write this as
P(M|do(X)) = P(M|X) # no controls necessary
P(Y|do(M)) = E_T(Y|X,T) # we marginalize out T
P(Y|do(X)) = P(M|do(X)) * P(Y|do(M)) # composing the functions
Intuition: The whole "trick" of the front door procedure lies in realizing that we can break up the estimation of a causal path along more than one edge into the estimation of its components, which can be solved without adjustment (X->M in our case), or using the backdoor criterion (M->Y in our case). Then, we can simply compose the individual parts to get the overall effect.
A nice explanation is also given in this video.
what I mean by the title is that sometimes I come across code that requires numpy operations (for example sum or average) along a specified axis. For example:
np.sum([[0, 1], [0, 5]], axis=1)
I can grasp this concept, but do we actually ever do these operations also along higher dimensions? Or is that not a thing? And if yes, how do you gain intuition for high-dimensional datasets and how do you make sure you are working along the right dimension/axis?
To be clear, I am referring to "self-attention" of the type described in Hierarchical Attention Networks for Document Classification and implemented many places, for example: here. I am not referring to the seq2seq type of attention used in encoder-decoder models (i.e. Bahdanau), although my question might apply to that as well... I am just not as familiar with it.
Self-attention basically just computes a weighted average of RNN hidden states (a generalization of mean-pooling, i.e. un-weighted average). When there are variable length sequences in the same batch, they will typically be zero-padded to the length of the longest sequence in the batch (if using dynamic RNN). When the attention weights are computed for each sequence, the final step is a softmax, so the attention weights sum to 1.
However, in every attention implementation I have seen, there is no care taken to mask out, or otherwise cancel, the effects of the zero-padding on the attention weights. This seems wrong to me, but I fear maybe I am missing something since nobody else seems bothered by this.
For example, consider a sequence of length 2, zero-padded to length 5. Ultimately this leads to the attention weights being computed as the softmax of a similarly 0-padded vector, e.g.:
weights = softmax([0.1, 0.2, 0, 0, 0]) = [0.20, 0.23, 0.19, 0.19, 0.19]
and because exp(0)=1, the zero-padding in effect "waters down" the attention weights. This can be easily fixed, after the softmax operation, by multiplying the weights with a binary mask, i.e.
mask = [1, 1, 0, 0, 0]
and then re-normalizing the weights to sum to 1. Which would result in:
weights = [0.48, 0.52, 0, 0, 0]
When I do this, I almost always see a performance boost (in the accuracy of my models - I am doing document classification/regression). So why does nobody do this?
For a while I considered that maybe all that matters is the relative values of the attention weights (i.e., ratios), since the gradient doesn't pass through the zero-padding anyway. But then why would we use softmax at all, as opposed to just exp(.), if normalization doesn't matter? (plus, that wouldn't explain the performance boost...)
Great question! I believe your concern is valid and zero attention scores for the padded encoder outputs do affect the attention. However, there are few aspects that you have to keep in mind:
There are different score functions, the one in tf-rnn-attention uses simple linear + tanh + linear transformation. But even this score function can learn to output negative scores. If you look at the code and imagine inputs consists of zeros, vector v is not necessarily zero due to bias and the dot product with u_omega can boost it further to low negative numbers (in other words, plain simple NN with a non-linearity can make both positive and negative predictions). Low negative scores don't water down the high scores in softmax.
Due to bucketing technique, the sequences within a bucket usually have roughly the same length, so it's unlikely to have half of the input sequence padded with zeros. Of course, it doesn't fix anything, it just means that in real applications negative effect from the padding is naturally limited.
You mentioned it in the end, but I'd like to stress it too: the final attended output is the weighted sum of encoder outputs, i.e. relative values actually matter. Take your own example and compute the weighted sum in this case:
the first one is 0.2 * o1 + 0.23 * o2 (the rest is zero)
the second one is 0.48 * o1 + 0.52 * o2 (the rest is zero too)
Yes, the magnitude of the second vector is two times bigger and it isn't a critical issue, because it goes then to the linear layer. But relative attention on o2 is just 7% higher, than it would have been with masking.
What this means is that even if the attention weights won't do a good job in learning to ignore zero outputs, the end effect on the output vector is still good enough for the decoder to take the right outputs into account, in this case to concentrate on o2.
Hope this convinces you that re-normalization isn't that critical, though probably will speed-up learning if actually applied.
BERT implementation applies a padding mask for calculating attention score.
Adds 0 to the non-padding attention score and adds -10000 to padding attention scores. the e^-10000 is very small w.r.t to other attention score values.
attention_score = [0.1, 0.2, 0, 0, 0]
mask = [0, 0, -10000, -10000] # -10000 is a large negative value
attention_score += mask
weights = softmax(attention_score)
In the process of applying an algorithm in data science , we need to do feature scaling on input data set. I would like to know whether is it mandatory step or is there any technique which will decide to perform feature scaling
1) Data Visualization
2) Statistical values
Feature scaling is needed if your inputs have a wide range of variation, if they are already normalized then you don't need it.
There is not a precise rule to follow. As a basic rule consider that normalized inputs work better then non normalized ones.
If you create a model with two numerical features and suppose one is having high values, like salary, (e.g. 2345, 1756, 34521 etc) and one is having low values, like age, (e.g. 33, 17, 29 etc). Obviously the numerical feature with higher values will have clear impact on model.
So to avoid this, we should scale both features to same level and do the modeling.
And it depends on the algorithm you are using to build the model. Only few models need feature scaling, not all.
I've found a grammar online which I want to rewrite to BNF so I can use it in a grammatical evolution experiment. From what I've read online BNF is given by this form:
<symbol> := <expression> | <term>
...but I don't see where probabilities factor into it.
In a probabilistic context-free grammar (PCFG), every production also is assigned a probability. How you choose to write this probability is up to you; I don't know of a standard notation.
Generally, the probabilities are learned rather than assigned, so the representation issue doesn't come up; the system is given a normal CFG as well as a large corpus with corresponding parse trees, and it derives probabilities by analysing the parse trees.
Note that PCFGs are usually ambiguous. Probabilities are not used to decide whether a sentence is in the language but rather which parse is correct, so with an unambiguous grammar, the probabilities would be of little use.