How to find the input that maximize the Neural Network output in Tensorflow - tensorflow

I'm using Tensorflow (2.4) and Keras to build my neural network model. It takes two tensors as inputs and gives a scalar output. The network is already trained and, from now on, it has fixed weights. It is possible, given one of the two inputs, to find the value of the other input that maximise the output value?
Thank you in advance

In theory, yes.
Lets call your network model f. It takes two inputs x and y and outputs f(x, y). Then, assuming x and f are fixed, you can find the value y* that maximize f(x, y) as follows:
calculate the gradient of f with respect to y. Then, there are two possibilities.
there exists stationary points. Just set df/dy = 0 and solve for y. This gives the y* at which there is either a maximum or a minimum. Compute f(x, y*) to check weather y* gives a maximum or a minimum.
there are no stationary points (or there is no maximum). Here, you need to study where f decreases or increases if y varies. To do this, look for df/dy > 0 (increases) and df/dy < 0 (decreases). You will find that, asymptotically, the function increases. Simply take y*=a, where a is the closest value to such asymptote that you can take (given your data type precision).

Related

Does variational autoencoder make distribution based on only latent representation?

If my latent representation of variational autoencoder(VAE) is r, and my dataset is x, does vae's latent representation follows normalization based on r or x?
If r= 10, that means it has 10 means and variance (multi-gussain) and distribution comes from data whole data x?
Or r = 10 constructs one distribution based on r, and every sample try to follow this distribution
I'm confused about which one is correct
VAE constructs a mapping e(x) -> Z (encoder), and d(z) -> X (decoder). This means that every elements of your input space x will be mapped through an encoder e(x) into a single, r-dimensional Gaussian. It is not a "mixture", it is just a single gaussian with diagonal covariance matrix.
I'll add my 2 cents to #lejlot answer.
Your encoder in VAE will map your sample to a distribution, that in your case has 10 dimensions... that distribution is used to say "ok my best estimate of this property of this sample is mu, but I'm not too sure, so consider that it might have variance sigma"
Therefore, you have a distribution for each sample.
However, in order to make sampling easier in VAE, we ask the VAE to keep the distributions as close to a known one, that is the standard normal distribution (we know "where the distributions are located", if you check the latent space in a normal AE you will see that you will have groups far from eachother).

STAN - Defining priors for dependent random variables

Background: I have a simulation model which has unobserved parameters. I created a metamodel using artificial neural networks (ANN) because the runtime was very long for the simulation model. I am trying to estimate the unobserved parameters using Bayesian calibration, where priors are based on current knowledge, and the likelihood of observing data is being estimated from the metamodel.
Query: I have two random variables X and Y for which I am trying to get the posterior distribution using STAN. The prior distribution of X is uniform, U(0,2). The prior for Y is also uniform, but it will always exceed X i.e., Y ~ U(X,2). Since Y is linked to X, how can I define the prior distribution for Y in STAN such that the constraint Y>X holds? I am new to STAN, so I would appreciate any suggestions or guidance on how to proceed. Thank you so much!
Stan's ordered vectors are what you need. Create an ordered vector of length 2 (I'll call it beta) in the parameters block, like this:
parameters {
ordered<lower=0,upper=2>[2] beta;
}
Ordered vectors are constrained such that each element is greater than the previous element. So beta[1] will be your estimate of X and beta[2] will be your estimate of Y.
(To make sure I understand your model correctly: you have two parameters, X and Y, and your only prior knowledge about them is that they both lie in [0, 2] and Y > X. X and Y describe some aspect of the distribution of your data - for example, maybe X is the mean of some other random variable Z, for which you have observations. Do I have that right?)
I believe Stan's priors are uniform by default, but you can make sure of this by specifying a prior for beta in the model block:
model {
beta ~ uniform(0, 2);
...
}

Self-Attention Explainability of the Output Score Matrix

I am learning about attention models, and following along with Jay Alammar's amazing blog tutorial on The Illustrated Transformer. He gives a great walkthrough for how the attention scores are calculated, but I get a bit lost at a certain point, and am not seeing how the attention score Z matrix he explains is used to interpret strength of associations between different words within an input sequence.
He mentions that given some input matrix X, with shape N x D, where N is the number of elements in an input sequence, and D is the input dimensionality, we multiply X with three separate weight matrices of shape D x d, where d is some lower dimensionality that represents the projected space of the query, key, and value matrices:
The query and key matrices are dotted, and then divided by a scaling factor usually the square root of the projected dimensionality, and then run through a softmax function. This produces a weight matrix of size N x N, which is multiplied by the value matrix to get an output Z of shape N x d, which Jay says
That concludes the self-attention calculation. The resulting vector is
one we can send along to the feed-forward neural network.
The screenshot from his blog for this calculation is below:
However, this is where I'm confused. Z is N x d. However, I don't particularly understand what I'm supposed to do with this matrix from an interpretability sense, and as far as I understand, for a particular sequence element (ie. the word cats in the sequence I love pets, especially cats), self-attention is supposed to score other parts of the sequence high when it is relevant or strong associated with that word embedding. However, I'd expect then that Z is N x N, so I could say that I can select the Z[i,j] and say for the i-th word in the sequence, the j-th word relates or associates with it this or that much.
In fact, wouldn't it make much more sense to use only the softmax output of the weights (without multiplying them by the value matrix), since it already is N x N? In essence, how is Jay determining the strength of these associations in this particular sequence with the word it?
This is an N by 1 relationship he is showing - there are N values that correspond with the strength of association to the word it.

taking the gradient in Tensorflow, tf.gradient

I am using this function of tensorflow to get my function jacobian. Came across two problems:
The tensorflow documentation is contradicted to itself in the following two paragraph if I am not mistaken:
gradients() adds ops to the graph to output the partial derivatives of ys with respect to xs. It returns a list of Tensor of length len(xs) where each tensor is the sum(dy/dx) for y in ys.
Blockquote
Blockquote
Returns:
A list of sum(dy/dx) for each x in xs.
Blockquote
According to my test, it is, in fact, return a vector of len(ys) which is the sum(dy/dx) for each x in xs.
I do not understand why they designed it in a way that the return is the sum of the columns(or row, depending on how you define your Jacobian).
How can I really get the Jacobian?
4.In the loss, I need the partial derivative of my function with respect to input (x), but when I am optimizing with respect to the network weights, I define x as a placeholder whose value is fed later, and weights are variable, in this case, can I still define the symbolic derivative of function with respect to input (x)? and put it in the loss? ( which later when we optimize with respect to weights will bring second order derivative of the function.)
I think you are right and there is a typo there, it was probably meant to be "of length len(ys)".
For efficiency. I can't explain exactly the reasoning, but this seems to be a pretty fundamental characteristic of how TensorFlow handles automatic differentiation. See issue #675.
There is no straightforward way to get the Jacobian matrix in TensorFlow. Take a look at this answer and again issue #675. Basically, you need one call to tf.gradients per column/row.
Yes, of course. You can compute whatever gradients you want, there is no real difference between a placeholder and any other operation really. There are a few operations that do not have a gradient because it is not well defined or not implemented (in which case it will generally return 0), but that's all.

Faster way to perform point-wise interplation of numpy array?

I have a 3D datacube, with two spatial dimensions and the third being a multi-band spectrum at each point of the 2D image.
H[x, y, bands]
Given a wavelength (or band number), I would like to extract the 2D image corresponding to that wavelength. This would be simply an array slice like H[:,:,bnd]. Similarly, given a spatial location (i,j) the spectrum at that location is H[i,j].
I would also like to 'smooth' the image spectrally, to counter low-light noise in the spectra. That is for band bnd, I choose a window of size wind and fit a n-degree polynomial to the spectrum in that window. With polyfit and polyval I can find the fitted spectral value at that point for band bnd.
Now, if I want the whole image of bnd from the fitted value, then I have to perform this windowed-fitting at each (i,j) of the image. I also want the 2nd-derivative image of bnd, that is, the value of the 2nd-derivative of the fitted spectrum at each point.
Running over the points, I could polyfit-polyval-polyder each of the x*y spectra. While this works, this is a point-wise operation. Is there some pytho-numponic way to do this faster?
If you do least-squares polynomial fitting to points (x+dx[i],y[i]) for a fixed set of dx and then evaluate the resulting polynomial at x, the result is a (fixed) linear combination of the y[i]. The same is true for the derivatives of the polynomial. So you just need a linear combination of the slices. Look up "Savitzky-Golay filters".
EDITED to add a brief example of how S-G filters work. I haven't checked any of the details and you should therefore not rely on it to be correct.
So, suppose you take a filter of width 5 and degree 2. That is, for each band (ignoring, for the moment, ones at the start and end) we'll take that one and the two on either side, fit a quadratic curve, and look at its value in the middle.
So, if f(x) ~= ax^2+bx+c and f(-2),f(-1),f(0),f(1),f(2) = p,q,r,s,t then we want 4a-2b+c ~= p, a-b+c ~= q, etc. Least-squares fitting means minimizing (4a-2b+c-p)^2 + (a-b+c-q)^2 + (c-r)^2 + (a+b+c-s)^2 + (4a+2b+c-t)^2, which means (taking partial derivatives w.r.t. a,b,c):
4(4a-2b+c-p)+(a-b+c-q)+(a+b+c-s)+4(4a+2b+c-t)=0
-2(4a-2b+c-p)-(a-b+c-q)+(a+b+c-s)+2(4a+2b+c-t)=0
(4a-2b+c-p)+(a-b+c-q)+(c-r)+(a+b+c-s)+(4a+2b+c-t)=0
or, simplifying,
22a+10c = 4p+q+s+4t
10b = -2p-q+s+2t
10a+5c = p+q+r+s+t
so a,b,c = p-q/2-r-s/2+t, (2(t-p)+(s-q))/10, (p+q+r+s+t)/5-(2p-q-2r-s+2t).
And of course c is the value of the fitted polynomial at 0, and therefore is the smoothed value we want. So for each spatial position, we have a vector of input spectral data, from which we compute the smoothed spectral data by multiplying by a matrix whose rows (apart from the first and last couple) look like [0 ... 0 -9/5 4/5 11/5 4/5 -9/5 0 ... 0], with the central 11/5 on the main diagonal of the matrix.
So you could do a matrix multiplication for each spatial position; but since it's the same matrix everywhere you can do it with a single call to tensordot. So if S contains the matrix I just described (er, wait, no, the transpose of the matrix I just described) and A is your 3-dimensional data cube, your spectrally-smoothed data cube would be numpy.tensordot(A,S).
This would be a good point at which to repeat my warning: I haven't checked any of the details in the few paragraphs above, which are just meant to give an indication of how it all works and why you can do the whole thing in a single linear-algebra operation.