Let's say x and y are two N-dimensional tensors, where both have the same dimensions and the first dimension is of size S (the batch size). Let's say b is a 1-dimensional tensor of booleans, of size S.
I want to produce z, a N-dimensional tensor defined as:
z[i] = b[i] ? x[i] : y[i] for i from 0 to (S-1)
where x[i] refers to the i-th (N-1)-dimensional slice of x.
What is the easiest way to do this? I thought tf.cond would work, but it only accepts scalar-valued predicates. Thank you!
tf.where should work, and supports this kind of broadcasting. If you find yourself wanting a batch version of conditional execution (where one or both branches are expensive to compute), that's also possible.
Related
I'm new to automatic differentiation programming, so this maybe a naive question. Below is a simplified version of what I'm trying to solve.
I have two input arrays - a vector A of size N and a matrix B of shape (N, M), as well a parameter vector theta of size M. I define a new array C(theta) = B * theta to get a new vector of size N. I then obtain the indices of elements that fall in the upper and lower quartile of C, and use them to create a new array A_low(theta) = A[lower quartile indices of C] and A_high(theta) = A[upper quartile indices of C]. Clearly these two do depend on theta, but is it possible to differentiate A_low and A_high w.r.t theta?
My attempts so far seem to suggest no - I have using the python libraries of autograd, JAX and tensorflow, but they all return a gradient of zero. (The approaches I have tried so far involve using argsort or extracting the relevant sub-arrays using tf.top_k.)
What I'm seeking help with is either a proof that the derivative is not defined (or cannot be analytically computed) or if it does exist, a suggestion on how to estimate it. My eventual goal is to minimize some function f(A_low, A_high) wrt theta.
This is the JAX computation that I wrote based on your description:
import numpy as np
import jax.numpy as jnp
import jax
N = 10
M = 20
rng = np.random.default_rng(0)
A = jnp.array(rng.random((N,)))
B = jnp.array(rng.random((N, M)))
theta = jnp.array(rng.random(M))
def f(A, B, theta, k=3):
C = B # theta
_, i_upper = lax.top_k(C, k)
_, i_lower = lax.top_k(-C, k)
return A[i_lower], A[i_upper]
x, y = f(A, B, theta)
dx_dtheta, dy_dtheta = jax.jacobian(f, argnums=2)(A, B, theta)
The derivatives are all zero, and I believe this is correct, because the change in value of the outputs does not depend on the change in value of theta.
But, you might ask, how can this be? After all, theta enters into the computation, and if you put in a different value for theta, you get different outputs. How could the gradient be zero?
What you must keep in mind, though, is that differentiation doesn't measure whether an input affects an output. It measures the change in output given an infinitesimal change in input.
Let's use a slightly simpler function as an example:
import jax
import jax.numpy as jnp
A = jnp.array([1.0, 2.0, 3.0])
theta = jnp.array([5.0, 1.0, 3.0])
def f(A, theta):
return A[jnp.argmax(theta)]
x = f(A, theta)
dx_dtheta = jax.grad(f, argnums=1)(A, theta)
Here the result of differentiating f with respect to theta is all zero, for the same reasons as above. Why? If you make an infinitesimal change to theta, it will in general not affect the sort order of theta. Thus, the entries you choose from A do not change given an infinitesimal change in theta, and thus the derivative with respect to theta is zero.
Now, you might argue that there are circumstances where this is not the case: for example, if two values in theta are very close together, then certainly perturbing one even infinitesimally could change their respective rank. This is true, but the gradient resulting from this procedure is undefined (the change in output is not smooth with respect to the change in input). The good news is this discontinuity is one-sided: if you perturb in the other direction, there is no change in rank and the gradient is well-defined. In order to avoid undefined gradients, most autodiff systems will implicitly use this safer definition of a derivative for rank-based computations.
The result is that the value of the output does not change when you infinitesimally perturb the input, which is another way of saying the gradient is zero. And this is not a failure of autodiff – it is the correct gradient given the definition of differentiation that autodiff is built on. Moreover, were you to try changing to a different definition of the derivative at these discontinuities, the best you could hope for would be undefined outputs, so the definition that results in zeros is arguably more useful and correct.
I understand that chebvander2d and chebval2d return the Vandermonde matrix and fitted values for 2D inputs, and chebfit returns the coefficients for 1D-input series, but how do I get the coefficients for 2D-input series?
Short answer: It looks to me like this is not yet implemented. The whole of 2D polynomials seems more like a draft with some stub functions (as of June 2020).
Long answer (I came looking for the same thing, so I dug a little deeper):
First of all, this applies to all of the polynomial classes, not only chebyshev, so you also cannot fit an "ordinary" polynomial (power series). In fact, you cannot even construct one.
To understand the programming problem, let me recapture what a 2D polynomial looks like as a math formula, at an example polynomial of degree 2:
p(x, y) = c_00 + c_10 x + c_01 y + c_20 x^2 + c11 xy + c02 y^2
here the indices of c refer to the powers of x and y (the sum of the exponents must be <= degree).
First thing to notice is that, for degree d, there are (d+1)(d+2)/2 coefficients.
They could be stored in the upper left part of a matrix or in a 1D array, e.g. aranged as in the formula above.
The documentation of functions like numpy.polynomial.polynomial.polyval2d implies that numpy expects the matrix variant: p(x, y) = sum_i,j c_i,j * x^i * y^j.
Side note: it may be confusing that the row index i ("y-coordinate") of the matrix is used as exponent of x, not y; maybe the role of i and j should be switched if this is eventually implementd, or at least there should be a note in the documentation.
This leads to the core problem: the data structure for the 2D coefficients is not defined anywhere; only indirectly, like above, it can be guessed that a matrix should be used. But compared to a 1D array this is a waste of space, and evaluation of the polynomial takes two nested loops instead of just one. Also: does the matrix have to be initialized with np.zeros or do the implemented functions make sure that the lower right part is never touched so that np.empty can be used?
If the whole (d+1)^2 matrix were used, as the polyval2d function doc suggests, the degree of the polynomial would actually be d*2 (if c_d,d != 0)
To test this, I wanted to construct a numpy.polynomial.polynomial.Polynomial (yes, three times polynomial) and check the degree attribute:
import numpy as np
import numpy.polynomial.polynomial as poly
coef = np.array([
[5.00, 5.01, 5.02],
[5.10, 5.11, 0. ],
[5.20, 0. , 0. ]
])
polyObj = poly.Polynomial(coef)
print(polyObj.degree)
This gave a ValueError: Coefficient array is not 1-d before the print statement was reached. So while polyval2d expects a 2D coefficient array, it is not (yet) possible to construct such a polynomial - not manually like this at least. With this insight, it is not surprising that there is no function (yet) that computes a fit for 2D polynomials.
I want to construct a weight whose certain elements are zero and never change, and other elements are the variables.For example:
[[0,0,a,0],[0,0,b,0],[0,0,0,c],[0,0,0,d]]
This is a tf variable, and all zeros stay unchanged. Only a, b, c, d are tuned using gradient descent.
Are there anyone who knows how to define such a matrix?
You should look into SparseTensor. It is highly optimised for operations where tensor consists of many zeros.
So, in your case, to initialise SparseTensor:
a,b,c,d = 10,20,30,40
sparse = tf.SparseTensor([[0,2], [1,2], [2,3], [3,3]], [a,b,c,d], [4,4])
Let T be a tensor of shape [n,f], which represents a batch. Now I want to slice T into m tensors along axis=0. The value of m depends on the current batch. I have another tensor I of shape [m,2] which stores pairs of indices which indicate where the slices should occur.
I am not really sure how to "iterate" over the indices to apply tf.slice. Any ideas?
Can this somehow be achieved using tf.scan?
I suppose you are looking for the split function.
I have two tensors, a of rank 4 and b of rank 1. I'd like to produce aprime, of rank 3, by "contracting" the last axis of a away, by replacing it with its dot product against b. In numpy, this is as easy as np.tensordot(a, b, 1). However, I can't figure out a way to do this in Tensorflow.
How can I replace the last axis of a tensor with a value equal to that axis's dot product against another tensor (of course, of the same shape)?
UPDATE:
I see in Wikipedia that this is called the "Tensor Inner Product" https://en.wikipedia.org/wiki/Dot_product#Tensors aka tensor contraction. It seems like this is a common operation, I'm surprised that there's no explicit support for it in Tensorflow.
I believe that this may be possible via tf.einsum; however, I have not been able to find a generalized way to do this that works for tensors of any rank (this is probably because I do not understand einsum and have been reduced to trial and error)
Aren't you just using tensor in the sense of a multidimensional array? Or in some disciplines a tensor is 3d (vector 1d, matrix 2d, etc). I haven't used tensorflow but I don't think it has much to do with tensors in that linear algebra sensor. They talk about data flow graphs. I'm not sure where the tensor part of the name comes from.
I assume you are talking about an expression like:
In [293]: A=np.tensordot(np.ones((5,4,3,2)),np.arange(2),1)
resulting in a (5,4,3) shape array. The einsum equivalent is
In [294]: B=np.einsum('ijkl,l->ijk',np.ones((5,4,3,2)),np.arange(2))
np.einsum implements Einstine Notation, as discussed here: https://en.wikipedia.org/wiki/Einstein_notation. I got this link from https://en.wikipedia.org/wiki/Tensor_contraction
You seem to be talking about straight forward numpy operations, not something special in tensorflow.
I would first add 3 dimensions of size 1 to b so that it can be broadcast along the 4'th dimension of a.
b = tf.reshape(b, (1, 1, 1, -1))
Then you can multiply b and a and it will broadcast b along all of the other dimensions.
a_prime = a * b
Finally, reduce the sum along the 4'th dimension to get rid of that dimension and replace it with the dot product.
a_prime = tf.reduce_sum(a_prime, [3])
This seems like it would work (for the first tensor being of any rank):
tf.einsum('...i,i->...', x, y)