Does tensorflow recompute these values? - tensorflow

If I call x,y = sess.run([X,f(X)]), is X computed once or twice? I'm asking because in my case the value of X is not deterministic, and it's necessary that f be evaluated on the same 'instance' of X.

To make sure that f uses the current X you can set up dependencies.
with tf.control_dependencies([X]):
y = f(X)
x, y_ = sess.run([X, y])

It will only compute it once. It would not make sense if it recomputed the dependent variables. Just about all variables in a tensorflow model are dependent on one another.

Related

Automatic Differentiation with respect to rank-based computations

I'm new to automatic differentiation programming, so this maybe a naive question. Below is a simplified version of what I'm trying to solve.
I have two input arrays - a vector A of size N and a matrix B of shape (N, M), as well a parameter vector theta of size M. I define a new array C(theta) = B * theta to get a new vector of size N. I then obtain the indices of elements that fall in the upper and lower quartile of C, and use them to create a new array A_low(theta) = A[lower quartile indices of C] and A_high(theta) = A[upper quartile indices of C]. Clearly these two do depend on theta, but is it possible to differentiate A_low and A_high w.r.t theta?
My attempts so far seem to suggest no - I have using the python libraries of autograd, JAX and tensorflow, but they all return a gradient of zero. (The approaches I have tried so far involve using argsort or extracting the relevant sub-arrays using tf.top_k.)
What I'm seeking help with is either a proof that the derivative is not defined (or cannot be analytically computed) or if it does exist, a suggestion on how to estimate it. My eventual goal is to minimize some function f(A_low, A_high) wrt theta.
This is the JAX computation that I wrote based on your description:
import numpy as np
import jax.numpy as jnp
import jax
N = 10
M = 20
rng = np.random.default_rng(0)
A = jnp.array(rng.random((N,)))
B = jnp.array(rng.random((N, M)))
theta = jnp.array(rng.random(M))
def f(A, B, theta, k=3):
C = B # theta
_, i_upper = lax.top_k(C, k)
_, i_lower = lax.top_k(-C, k)
return A[i_lower], A[i_upper]
x, y = f(A, B, theta)
dx_dtheta, dy_dtheta = jax.jacobian(f, argnums=2)(A, B, theta)
The derivatives are all zero, and I believe this is correct, because the change in value of the outputs does not depend on the change in value of theta.
But, you might ask, how can this be? After all, theta enters into the computation, and if you put in a different value for theta, you get different outputs. How could the gradient be zero?
What you must keep in mind, though, is that differentiation doesn't measure whether an input affects an output. It measures the change in output given an infinitesimal change in input.
Let's use a slightly simpler function as an example:
import jax
import jax.numpy as jnp
A = jnp.array([1.0, 2.0, 3.0])
theta = jnp.array([5.0, 1.0, 3.0])
def f(A, theta):
return A[jnp.argmax(theta)]
x = f(A, theta)
dx_dtheta = jax.grad(f, argnums=1)(A, theta)
Here the result of differentiating f with respect to theta is all zero, for the same reasons as above. Why? If you make an infinitesimal change to theta, it will in general not affect the sort order of theta. Thus, the entries you choose from A do not change given an infinitesimal change in theta, and thus the derivative with respect to theta is zero.
Now, you might argue that there are circumstances where this is not the case: for example, if two values in theta are very close together, then certainly perturbing one even infinitesimally could change their respective rank. This is true, but the gradient resulting from this procedure is undefined (the change in output is not smooth with respect to the change in input). The good news is this discontinuity is one-sided: if you perturb in the other direction, there is no change in rank and the gradient is well-defined. In order to avoid undefined gradients, most autodiff systems will implicitly use this safer definition of a derivative for rank-based computations.
The result is that the value of the output does not change when you infinitesimally perturb the input, which is another way of saying the gradient is zero. And this is not a failure of autodiff – it is the correct gradient given the definition of differentiation that autodiff is built on. Moreover, were you to try changing to a different definition of the derivative at these discontinuities, the best you could hope for would be undefined outputs, so the definition that results in zeros is arguably more useful and correct.

STAN - Defining priors for dependent random variables

Background: I have a simulation model which has unobserved parameters. I created a metamodel using artificial neural networks (ANN) because the runtime was very long for the simulation model. I am trying to estimate the unobserved parameters using Bayesian calibration, where priors are based on current knowledge, and the likelihood of observing data is being estimated from the metamodel.
Query: I have two random variables X and Y for which I am trying to get the posterior distribution using STAN. The prior distribution of X is uniform, U(0,2). The prior for Y is also uniform, but it will always exceed X i.e., Y ~ U(X,2). Since Y is linked to X, how can I define the prior distribution for Y in STAN such that the constraint Y>X holds? I am new to STAN, so I would appreciate any suggestions or guidance on how to proceed. Thank you so much!
Stan's ordered vectors are what you need. Create an ordered vector of length 2 (I'll call it beta) in the parameters block, like this:
parameters {
ordered<lower=0,upper=2>[2] beta;
}
Ordered vectors are constrained such that each element is greater than the previous element. So beta[1] will be your estimate of X and beta[2] will be your estimate of Y.
(To make sure I understand your model correctly: you have two parameters, X and Y, and your only prior knowledge about them is that they both lie in [0, 2] and Y > X. X and Y describe some aspect of the distribution of your data - for example, maybe X is the mean of some other random variable Z, for which you have observations. Do I have that right?)
I believe Stan's priors are uniform by default, but you can make sure of this by specifying a prior for beta in the model block:
model {
beta ~ uniform(0, 2);
...
}

Toggling between tensors in two separate batches

Let's say x and y are two N-dimensional tensors, where both have the same dimensions and the first dimension is of size S (the batch size). Let's say b is a 1-dimensional tensor of booleans, of size S.
I want to produce z, a N-dimensional tensor defined as:
z[i] = b[i] ? x[i] : y[i] for i from 0 to (S-1)
where x[i] refers to the i-th (N-1)-dimensional slice of x.
What is the easiest way to do this? I thought tf.cond would work, but it only accepts scalar-valued predicates. Thank you!
tf.where should work, and supports this kind of broadcasting. If you find yourself wanting a batch version of conditional execution (where one or both branches are expensive to compute), that's also possible.

How to fetch gradients with respect to certain occurrences of variables in tensorflow?

Since tensorflow supports variable reuse, some part of computing graph may occur multiple times in both forward and backward process. So my question is, is it possible to update variables with respect their certain occurrences in the compute graph?
For example, in X_A->Y_B->Y_A->Y_B, Y_B occurs twice, how to update them respectively? I mean, at first, we take the latter occurrence as constant, and update the previous one, then do opposite.
A more simple example is, say X_A, Y_B, Y_A are all scalar variable, then let Z = X_A * Y_B * Y_A * Y_B, here the gradient of Z w.r.t both occurrences of Y_B is X_A * Y_B * Y_A, but actually the gradient of Z to Y_B is 2*X_A * Y_B * Y_A. In this example computing gradients respectively may seems unnecessary, but not always are those computation commutative.
In the first example, gradients to the latter occurrence may be computed by calling tf.stop_gradient on X_A->Y_B. But I could not think of a way to fetch the previous one. Is there a way to do it in tensorflow's python API?
Edit:
#Seven provided an example on how to deal with it when reuse a single variable. However often it's a variable scope that is reused, which contains many variables and functions that manage them. As far as I know, their is no way to reuse a variable scope with applying tf.stop_gradient to all variables it contains.
With my understanding, when you use A = tf.stop_gradient(A), A will be considered as a constant. I have an example here, maybe it can help you.
import tensorflow as tf
wa = tf.get_variable('a', shape=(), dtype=tf.float32,
initializer=tf.constant_initializer(1.5))
b = tf.get_variable('b', shape=(), dtype=tf.float32,
initializer=tf.constant_initializer(7))
x = tf.placeholder(tf.float32, shape=())
l = tf.stop_gradient(wa*x) * (wa*x+b)
op_gradient = tf.gradients(l, x)
sess = tf.Session()
sess.run(tf.global_variables_initializer())
print sess.run([op_gradient], feed_dict={x:11})
I have a workaround for this question. Define a custom getter for the concerning variable scope, which wraps the default getter with tf.stop_gradient. This could set all variables returned in this scope as a Tensor contributing no gradients, though sometimes things get complicated because it returns a Tensor instead of a variable, such as when using tf.nn.batch_norm.

How to create a recurrent variable with TensorFlow

This sounds super easy but I cannot find any info on the internet. I am probably lacking some fundamental understanding.
I would like to do something simple: a recurrent variable. Say:
Z(t) = W * Z(t-1)
with some fixed (but trainable) W.
I tried things like:
initializer = tf.random_uniform_initializer(0., 1.)
with tf.variable_scope('recurrent', initializer=initializer):
Z = tf.get_variable('Z', shape=[...])
Z = tf.matmul(W, Z)
But of course, within a session, if I do Z.eval(), it gives a coherent value of Z, but Z itself is not updated.
Hence my question: how do you create a recurrent variable that gets updated when running the graph with TensorFlow?
Thank you very much for your help!
When you write a statement like
Z = tf.matmul(W, Z)
you are updating the python variable Z and not the TensorFlow's internal storage associated with the TensorFlow variable Z. Please have a look at the section on stateful operations in TensorFlow to get an idea of how TensorFlow manages state. To answer your specific question, you have to use the tf.assign operation to update TensorFlow's Z variable as follows :
Z = tf.assign(Z, tf.matmul(W, Z))