Gradient of LSTM one to many model output w.r.t input - tensorflow

I have an LSTM model that takes as input a vector x0 (dimension : n) and returns a sequence of vectors (size : T x n). I require the derivative of each sequence w.r.t x0 (size: n x n). Thus I require a Jacobian matrix of size (T x n x n). What is the most efficient way to do this in TensorFLow. I need this for optimization research that requires a function that takes x0 and returns the derivative information. Having searched all available options, I don't have any good way to approach this. Any help (psuedo code, documentations, posts etc) will be very beneficial.

Not entirely sure if this is what you are looking for, but Tensorflow has eager execution (allows for interactive debugging) and a pretty comprehensive framework for automatic computation of gradients. The combination of those two might assist. This link includes some details and example code which may lend itself to your specific needs: https://www.tensorflow.org/guide/eager and in particular this section: https://www.tensorflow.org/guide/eager#computing_gradients along with others on the post.
I hope this helps.

Related

Understanding Time2Vec embedding for implementing this as a keras layer

The paper time2vector link (the relevant theory is in section 4) shows an approach to include a time embedding for features to improve model performance. I would like to give this a try. I found a implementation as keras layer which I changed a little bit. Basically it creates two matrices for one feature:
(1) linear = w * x + b
(2) periodic = sin(w * x + b)
Currently I choose this feature manually. Concerning the paper there are a few things i don't understand. The first thing is the term k as the number of sinusoids. The authors use up to 64 sinusoids. What does this mean? I have just 1 sinusoid at the moment, right? Secondly I'm about to put every feature I have through the sinus transformation for me dataset that would make 6 (sinusoids) periodic features. The authors use only one linear term. How should I choose the feature for the linear term? Unfortunately the code from the paper is not available anymore. Has anyone worked with time embeddings or even with this particularly approach?
For my limited understanding, the linear transformation of time is a fixed element of the produced embedding and the parameter K allows you to select how many different learned time representations you want to use in your model. So, the resulting embedding has a size of K+1 elements.

taking the gradient in Tensorflow, tf.gradient

I am using this function of tensorflow to get my function jacobian. Came across two problems:
The tensorflow documentation is contradicted to itself in the following two paragraph if I am not mistaken:
gradients() adds ops to the graph to output the partial derivatives of ys with respect to xs. It returns a list of Tensor of length len(xs) where each tensor is the sum(dy/dx) for y in ys.
Blockquote
Blockquote
Returns:
A list of sum(dy/dx) for each x in xs.
Blockquote
According to my test, it is, in fact, return a vector of len(ys) which is the sum(dy/dx) for each x in xs.
I do not understand why they designed it in a way that the return is the sum of the columns(or row, depending on how you define your Jacobian).
How can I really get the Jacobian?
4.In the loss, I need the partial derivative of my function with respect to input (x), but when I am optimizing with respect to the network weights, I define x as a placeholder whose value is fed later, and weights are variable, in this case, can I still define the symbolic derivative of function with respect to input (x)? and put it in the loss? ( which later when we optimize with respect to weights will bring second order derivative of the function.)
I think you are right and there is a typo there, it was probably meant to be "of length len(ys)".
For efficiency. I can't explain exactly the reasoning, but this seems to be a pretty fundamental characteristic of how TensorFlow handles automatic differentiation. See issue #675.
There is no straightforward way to get the Jacobian matrix in TensorFlow. Take a look at this answer and again issue #675. Basically, you need one call to tf.gradients per column/row.
Yes, of course. You can compute whatever gradients you want, there is no real difference between a placeholder and any other operation really. There are a few operations that do not have a gradient because it is not well defined or not implemented (in which case it will generally return 0), but that's all.

how tensorflow handles complex gradient?

Let z is a complex variable, C(z) is its conjugation.
In complex analysis theory, the derivative of C(z) w.r.t z don't exist. But in tesnsorflow, we can calculate dC(z)/dz and the result is just 1.
Here is an example:
x = tf.placeholder('complex64',(2,2))
y = tf.reduce_sum(tf.conj(x))
z = tf.gradients(y,x)
sess = tf.Session()
X = np.random.rand(2,2)+1.j*np.random.rand(2,2)
X = X.astype('complex64')
Z = sess.run(z,{x:X})[0]
The input X is
[[0.17014372+0.71475762j 0.57455420+0.00144318j]
[0.57871044+0.61303568j 0.48074263+0.7623235j ]]
and the result Z is
[[1.-0.j 1.-0.j]
[1.-0.j 1.-0.j]]
I don't understand why the gradient is set to be 1?
And I want to know how tensorflow handles the complex gradients in general.
How?
The equation used by Tensorflow for the gradient is:
Where the '*' means conjugate.
When using the definition of the partial derivatives wrt z and z* it uses Wirtinger Calculus. Wirtinger calculus enables to calculate the derivative wrt a complex variable for non-holomorphic functions. The Wirtinger definition is:
Why this definition?
When using for example Complex-Valued Neural Networks (CVNN) the gradients will be used over non-holomorphic, real-valued scalar function of one or several complex variables, tensorflow definition of a gradient can then be written as:
This definition corresponds with the literature of CVNN like for example chapter 4 section 4.3 of this book or Amin et al. (between countless examples).
Bit late, but I came across this issue recently too.
The key point is that TensorFlow defines the "gradient" of a complex-valued function f(z) of a complex variable as "the gradient of the real map F: (x,y) -> Re(f(x+iy)), expressed as a complex number" (the gradient of that real map is a vector in R^2, so we can express it as a complex number in the obvious way).
Presumably the reason for that definition is that in TF one is usually concerned with gradients for the purpose of running gradient descent on a loss function, and in particular for identifying the direction of maximum increase/decrease of that loss function. Using the above definition of gradient means that a complex-valued function of complex variables can be used as a loss function in a standard gradient descent algorithm, and the result will be that the real part of the function gets minimised (which seems to me a somewhat reasonable interpretation of "optimise this complex-valued function").
Now, to your question, an equivalent way to write that definition of gradient is
gradient(f) := dF/dx + idF/dy = conj(df/dz + dconj(f)/dz)
(you can easily verify that using the definition of d/dz). That's how TensorFlow handles complex gradients. As for the case of f(z):=conj(z), we have df/dz=0 (as you mention) and dconj(f)/dz=1, giving gradient(f)=1.
I wrote up a longer explanation here, if you're interested: https://github.com/tensorflow/tensorflow/issues/3348#issuecomment-512101921

Elementwise Sampling with map_fn Slow

Say that I want to sample a matrix with each entry sampled from a distribution defined by an entry in another matrix. I unroll my matrix and apply map_fn to each element. With a relatively small matrix (128 x 128), the following gives me several PoolAllocator warnings (GTX TITAN Black) and does not train in any reasonable amount of time.
def sample(x):
samples = tf.map_fn(lambda z:
tf.random_normal([1], mean=z,
stddev=tf.sqrt(z * (1 - z))),
tf.reshape(x, [-1])) # apply to each element
return tf.cond(is_training, lambda: tf.reshape(samples, shape=tf.shape(x)),
lambda: tf.tanh(x))
Is there a better way to apply an elementwise operation like this?
Your code will run much faster if you can use Tensor-at-a-time operations instead of elementwise operations like tf.map_fn.
Here it looks like you want to sample from a normal distribution for each element, where the parameters of the distribution are different for each value in an input tensor. Try something like this:
def sample(x):
samples = tf.random_normal(shape=[128, 128]) * tf.sqrt(x * (1 - x)) + x
tf.random_normal() generates a normal distribution with mean 0.0 and standard deviation 1.0 by default. You can use point-wise tensor operations to fix up the standard deviation (by multiplying) and the mean (by adding) for each element. In fact, if you look at how tf.random_normal() is implemented, that's precisely what it does internally.
(You would probably also do better using a Python conditional to distinguish training from test time.)
If you plan to do this sort of thing a lot, you might file a feature request on github asking to generalize tf.random_normal to accept Tensors with more general shapes for mean and stddev. I see no reason why that shouldn't be supported.
Hope that helps!
See the tensorflow.contrib.distributions module, which has a Normal class with a sample method that does this for you.

Check how fast numpy.linalg.lstsq is finding convergence

I have a question concerning NumPy module linalg.lstsq(a,b). There is any possibility to check how fast this method is finding convergence? I mean some of characteristics which indicate how fast computation is going to convergence?
Thank you in advanced for brain storm.
The Numpy function linalg.lstsq uses singular value decomposition (SVD) to solve the least-square problem. Thus, if your matrix A is n by n, it will require n^3 flops.
More precisely, I think that the function uses the Householder Bidiagonalization to compute the SVD and so, if your matrix is m by n, the complexity will be O(max(m, n) * min(m, n)^2).