tensorflow shape of a tiled tensor - tensorflow

I have a variable a of dimension (1, 5) which I want to 'tile' as many times as the size of my mini-batch. For example, if the mini-batch size is 32 then I want to construct a tensor c of dimension (32, 5) where each row has values the same as the original (1, 5) variable a.
But I only know the mini-batch size at run time: it's the size of dimension 0 of a placeholder b: tf.shape(b)[0]
Here's my code to construct c:
a = tf.Variable(np.random.uniform(size=(1,5)))
b = tf.placeholder(shape=[None, 12], dtype=tf.float32)
batch_size = tf.shape(b)[0]
c = tf.tile(a, tf.pack([batch_size, 1]))
This runs fine. Howeverc.get_shape() returns (?, ?). I don't understand why this doesn't return (?, 5) instead.
This is causing an issue later in my code when I construct a matrix variable W with number of columns c.get_shape()[1] which I expect to return 5 rather than ?.
Any help would be appreciated. Thanks.

[EDIT: This was fixed in a commit to TensorFlow on August 10, 2016.]
This is a known limitation of TensorFlow's shape inference: when the multiples argument to tf.tile() is a computed value (such as the result of tf.pack() here), and its value is not trivially computable at graph construction time (in this case, because it depends on a tf.placeholder(), which has no value until it is fed), the current shape inference will throw its hands up and declare that the shape is unknown (but with the same rank as the input, a).
The current workaround is to use Tensor.set_shape(), which allows you as the programmer to provide additional shape information when you know more than the shape inference does. For example, you could do:
a = tf.Variable(np.random.uniform(size=(1, 5)))
b = tf.placeholder(shape=[None, 12], dtype=tf.float32)
batch_size = tf.shape(b)[0]
c = tf.tile(a, tf.pack([batch_size, 1]))
c.set_shape([None, a.get_shape()[1]]) # or `c.set_shape([None, 5])`
However, we recently added some features that make it possible to propagate partially computed values that may be used as shapes, and this can be adapted to aid the shape function for tf.tile(). I have created a GitHub issue to track this, and I have a fix being tested right now.

Related

Change saved tensorflow model input shape at inference time

I've searched everywhere but couldn't find anything. It looks so weird that nobody have already encountered the same problem as I... Let me explain:
I've trained a Tensorflow 2 custom model. During the training I have used set_shape((None, 320, 320, 14)) so that Tensorflow knows the shape (It couldn't infer it for whatever reason... -_-").
I have also saved my custom model at every 100 epochs using:
model.save(os.path.join('models', 'pb', FLAGS.task_name + '-%i' % epoch))
So for the 100th epoch I will have a folder models/pb/my_name-100 that contains
assets
variables
saved_model.pb
Now, at inference time, I just want to load the model (without all the code). So I have created another piece of code that only loads the model and make a prediction... A basic template looks like:
class NeuralNetwork:
def __init__(self, model):
self.model = tf.keras.models.load_model(model)
def predict(self, input_tensor):
pred = self.model(input_tensor[None, ...])
return pred[0]
Where input_tensor is of size (H, W, 14) and so input_tensor[None, ...] is of size:
(None, H, W, 14).
The problem is that, because I have set the shape during training to be (None, 320, 320, 14)... This stupid Tensorflow expects the input to be (None, 320, 320, 14) -_-"!!!. My Neural Network is a fully convolutional neural network, so I really don't care about the input shape. I set it to be (320, 320, 14) during training for memory reason...
During prediction I'd like to be able to do prediction on any kind of shape.
Obviously, I could do a preprocessing function that extracts patch of size (320, 320) from the input image and tiles them. So for example my input_tensor could be of size (30, 320, 320, 14)
And then after the prediction, I could reconstruct the image from the tiles... But I don't want to do that.
Firstly because It takes a bit of time to create the tiles and reconstruct the image from the tile
Secondly because the result will be a bit off due to 0 padding in the convolution. Which means that I need to create overlapping tiles and average the results on the overlapping part to avoid having artifacts during the reconstruction
So my question is simple:
How can I tell tensorflow to accept any width and height at inference time? Omg it's so bothersome. I can't believe that there are not an easy options available to do that
I answer my own question.
Unfortunately, my answer will not satisfy everybody. There are so many convoluted things happening in TF (Not to mention that when you search for help, most of it concern the 1st API... -_-").
Anyway, here is the "solution"
In my Neural Network, I have implemented a custom layer to mimic the pytorch function AdaptiveAvgPool2D. My implementation actually use tf.nn_avg_pool under the hood and need to dynamically compute the kernel_size as well as the stride. Here is my code, for reference:
class AdaptiveAvgPool2d(layers.Layer):
def __init__(self, output_shape, data_format='channels_last'):
super(AdaptiveAvgPool2d, self).__init__(autocast=False)
assert data_format in {'channels_last', 'channels_first'}, \
'data format parameter must be in {channels_last, channels_first}'
if isinstance(output_shape, tuple):
self.out_h = output_shape[0]
self.out_w = output_shape[1]
elif isinstance(output_shape, int):
self.out_h = output_shape
self.out_w = output_shape
else:
raise RuntimeError(f"""output_shape should be an Integer or a Tuple2""")
self.data_format = data_format
def call(self, inputs, mask=None):
# input_shape = tf.shape(inputs)
input_shape = inputs.get_shape().as_list()
if self.data_format == 'channels_last':
h_idx, w_idx = 1, 2
else: # can use else instead of elif due to assert in __init__
h_idx, w_idx = 2, 3
stride_h = input_shape[h_idx] // self.out_h
stride_w = input_shape[w_idx] // self.out_w
k_size_h = stride_h + input_shape[h_idx] % self.out_h
k_size_w = stride_w + input_shape[w_idx] % self.out_w
pool = tf.nn.avg_pool(
inputs,
ksize=[k_size_h, k_size_w],
strides=[stride_h, stride_w],
padding='VALID',
data_format='NHWC' if self.data_format == 'channels_last' else 'NCHW')
return pool
The problem is that, I'm using inputs.get_shape().as_list()
to recover int values and not a Tensor(..., type=int). Indeed, the tf.nn.avg_pool accept a list of Int for both the ksize and the strides parameters...
Put it differently, I couldn't use tf.shape(inputs) because it returns a Tensor(..., type=int) and there is no way to recover a int from a Tensor beside by evaluating it...
The way I have implemented my function worked just fine, the problem is that, Tensorflow infers the size under the hood and save the size of all the tensors inside the .pb file when I save it.
Indeed, you can easily open a .pb file with any TextEditor (SublimeText) and see by ourself the expected TensorShape. In my case it was `TensorShape: [null, 320, 320, 14]
So, using set_shape((None, None, None, 14)) instead of set_shape((None, 320, 320, 14)) or nothing actually doesn't change the problem...
The problem is that the average pooling layer does not accept a dynamic kernel size/strides....
I then realizes that there is a tensorflow function for this actually tfa.layers.AdaptiveAveragePooling2D. So, I might just go with it and it will be fine, right?
Well not exactly, under the hood, this tensorflow function use other tf.function like tf.split. The problem with tf.split is that, if your dimension you want to split is of size X and you want to output a tensor of size Y. If X % Y != 0, when tf.split will throw an error... While Pytorch is much more robust and handle cases were X % Y != 0.
Put it differently, it means that, in order for me to use tfa.layers.AdaptiveAveragePooling2D, I need to be sure that the size of the tensor received by this function is divisible by the scalar I pass to the function.
For example, in my case,
The input image are of size: (320, 320, whatever), the input tensor received by tfa.layers.AdaptiveAveragePooling2D is: (40, 40, whatever).
So it means, that the spatial dimension of my tensor was divided by 8 during training. In order for it to work, I should choose a size that can divide 40. Let's say I choose 5.
It means that during the prediction, my neural network will work if the input dimension that the tfa.layers.AdaptiveAveragePooling2D receives is also divisible by 5. But we already know that my input image is 8x bigger then the tensor receives by tfa.layers.AdaptiveAveragePooling2D, so it means that, at prediction times, I can use whatever image size as long as:
H % (8*5) == 0 and W % (8 * 5) == 0
Where H and W are respectively the height and the width of my input image.
To do that, we can just implement a simple function:
new_W = W + W % 40 (40 in this example...)
new_H = H + H % 40 (40 in this example...)
This function will stretch a bit the image but not to much so that it should be just fine.
Summing up:
My AdaptivePooling uses static shape, but I cannot do otherwise since it uses tf.nn.avg_pool under the hood that doesn't accept dynamic shape
tfa.layers.AdaptiveAveragePooling2D is a work around, but because it relies on tf.split that is not robust to inexact divide, it is not perfect either
The basic solution is to use tfa.layers.AdaptiveAveragePooling2D and create a preprocessing function before calling the prediction so that the tensor will work just fine with tf.split constraint
Finally, this is not a good solution either. Because, during training if I receive a tensor of size (40, 40) and want an avg output of size (5, 5), it means that I basically average (8, 8) features to retrieve one features.
The problem is that, if I do that during inference time on a bigger image, I will receive a bigger tensor. Let's say: (100, 200). But since my output will always be (5, 5), it means that I will, this time, average (20, 40) features to retrieve one feature...
Because of this difference between training and inference if I go with this way of doing, inferring on a bigger image might lead to inconsistent results
In my case, the way to go is to batch the images as I have explained in my first post...
Hope it will help some of you.

How to map an array of values for y_true to a single value in order to compare to y_pred in a Tensorflow loss function (Tensorflow/Tensorflow Quantum)

I am trying to implement the circuits listed on page 8 in the following paper: https://arxiv.org/pdf/1905.10876.pdf using Tensorflow Quantum (TFQ). I have done so previously for a subset of circuits using Qiskit, and ended up with accuracies that can be found on page 14 in the following paper: https://arxiv.org/pdf/2003.09887.pdf. In TFQ, my accuracies are way down. I think this delta originates because in TFQ, I only used 1 observable Pauli Z operator on the first qubit, and the circuits do not seem to "transfer all knowledge" to the first qubit. I place this in quotes, because I am sure there is a better way to describe this. In Qiskit on the other hand, 16 states (4^2) get mapped to 2 states.
My question: how can I get my accuracies back up?
Potential answer a): some method of "transferring all information" to a single qubit, potentially an ancilla qubit, and doing a readout on this qubit.
Potential answer b) placing a Pauli Z observable on all qubits (4 in total), mapping half of the 16 states to a label 0 and the other half to a label 1. I attempted this in the code below.
My attempt at answer b):
I have a Tensorflow Quantum (TFQ) circuit implemented in Tensorflow. The circuit has multiple observables, which I try to bring together in my loss function. I prefer to use as many standard components as possible, but need to map my quantum states to a label in order to determine the loss. I think what I am trying to achieve is not unique to TFQ. I define my model in the following way:
def circuit():
data_qubits = cirq.GridQubit.rect(4, 1)
circuit = cirq.Circuit()
...
return circuit, [cirq.Z(data_qubits[0]), cirq.Z(data_qubits[1]), cirq.Z(data_qubits[2]), cirq.Z(data_qubits[3])]
model_circuit, model_readout = circuit()
model = tf.keras.Sequential([
tf.keras.layers.Input(shape=(), dtype=tf.string),
# The PQC layer returns the expected value of the readout gate, range [-1,1].
tfq.layers.PQC(model_circuit, model_readout),
])
# compile model
model.compile(
loss = loss_mse,
optimizer=tf.keras.optimizers.Adam(learning_rate=0.01),
metrics=[])
in loss_mse (Mean Square Error), I receive a (32, 4) tensor for y_pred. One row could look like
[-0.2, 0.33, 0.6, 0.3]
This would have to be first mapped from [-1,1] to a binarized version of [0,1], so that it looks like:
[0, 1, 1, 1]
Now, a table lookup needs to happen, which tells if this combination is 0 or 1. Finally, the regular (y_true-y_pred)^2 can be performed by that row, followed by a np.sum on all rows. I tried to implement this:
def get_label(measurement):
if measurement == [0,0,0,0]: return 0
...
elif measurement == [1,1,1,1]: return 0
else: return -1
def py_call(y_true, y_pred):
# cast tensor to numpy
y_pred_np = np.asarray(y_pred)
loss = np.zeros((len(y_pred))) # could be a single variable with += within the loop
# evalaute all 32 samples
for pred in range(len(y_pred_np)):
# map, binarize and lookup
y_labelled = get_label([0 if y<0 else 1 for y in y_pred_np[pred]])
# regular loss comparison
loss[pred] = (y_labelled - y_true[pred])**2
# reduce
loss = np.sum(loss)/len(y_true)
return loss
#tf.function
def loss_mse(y_true, y_pred):
external_list = []
loss = tf.py_function(py_call, inp=[y_true, y_pred], Tout=[tf.float64])
return loss
However, the system appears to still expect a (32,4) tensor. I would have thought I could simply provide a single loss values (float). My question: how can I map multiple values for y_true to a single number in order to compare with a single y_pred value in a tensorflow loss function?
So it looks like there are a couple of things going on here. To answer your question
how can I map multiple values for y_true to a single number in order to compare with a single y_pred value in a tensorflow loss function ?
What you might want is some kind of tf.reduce_* function like tf.reduce_mean or tf.reduce_sum. This function will allow you to apply this reduction operation accross a given tensor axis allowing you to convert a tensor of shape (32, 4) to a tensor of shape (32,) or a tensor of shape (4,). Here is a quick snippet:
#tf.function
def my_loss(y_true, y_pred):
# y_true is shape (32, 4)
# y_pred is shape (32, 4)
# Scale from [-1, 1] to [0, 1]
y_true += 1
y_true /= 2
y_pred += 1
y_pred /= 2
# These are now both (32,) with the reduction of taking the mean applied along
# the second axis.
reduced_true = tf.reduce_mean(y_true, axis=1)
reduced_pred = tf.reduce_mean(y_pred, axis=1)
# Now a scalar loss.
loss = tf.reduce_mean((reduce_true - reduced_pred) ** 2)
return loss
Now the above isn't exactly what you want, since it's not super clear to me at least what exact reduction rules you have in mind for taking something like [0,1,1,1] -> 0 vs [0,0,0,0] -> 1.
Another thing I will also mention is that if you want JUST the sum of these Pauli Operators in cirq that you have term by term in the list [cirq.Z(data_qubits[0]), cirq.Z(data_qubits[1]), cirq.Z(data_qubits[2]), cirq.Z(data_qubits[3])] and all you care about is the final sum of these expectations, you could just as easily do:
my_operator = sum([cirq.Z(data_qubits[0]), cirq.Z(data_qubits[1]),
cirq.Z(data_qubits[2]), cirq.Z(data_qubits[3])])
print(my_op)
Which should give something like:
cirq.PauliSum(cirq.LinearDict({frozenset({(cirq.GridQubit(0, 0), cirq.Z)}): (1+0j), frozenset({(cirq.GridQubit(0, 1), cirq.Z)}): (1+0j), frozenset({(cirq.GridQubit(0, 2), cirq.Z)}): (1+0j), frozenset({(cirq.GridQubit(0, 3), cirq.Z)}): (1+0j)}))
Which is also compatable as a readout operation in the PQC layer. Lastly if would recommend reading through some of the snippets and examples here:
https://www.tensorflow.org/quantum/api_docs/python/tfq/layers/PQC
and here:
https://www.tensorflow.org/quantum/api_docs/python/tfq/layers/Expectation
Which give a pretty good description of how the input and output signatures of the functions look as well as the shapes you can expect from them.

Understanding TensorBoard (weight) histograms

It is really straightforward to see and understand the scalar values in TensorBoard. However, it's not clear how to understand histogram graphs.
For example, they are the histograms of my network weights.
(After fixing a bug thanks to sunside)
What is the best way to interpret these? Layer 1 weights look mostly flat, what does this mean?
I added the network construction code here.
X = tf.placeholder(tf.float32, [None, input_size], name="input_x")
x_image = tf.reshape(X, [-1, 6, 10, 1])
tf.summary.image('input', x_image, 4)
# First layer of weights
with tf.name_scope("layer1"):
W1 = tf.get_variable("W1", shape=[input_size, hidden_layer_neurons],
initializer=tf.contrib.layers.xavier_initializer())
layer1 = tf.matmul(X, W1)
layer1_act = tf.nn.tanh(layer1)
tf.summary.histogram("weights", W1)
tf.summary.histogram("layer", layer1)
tf.summary.histogram("activations", layer1_act)
# Second layer of weights
with tf.name_scope("layer2"):
W2 = tf.get_variable("W2", shape=[hidden_layer_neurons, hidden_layer_neurons],
initializer=tf.contrib.layers.xavier_initializer())
layer2 = tf.matmul(layer1_act, W2)
layer2_act = tf.nn.tanh(layer2)
tf.summary.histogram("weights", W2)
tf.summary.histogram("layer", layer2)
tf.summary.histogram("activations", layer2_act)
# Third layer of weights
with tf.name_scope("layer3"):
W3 = tf.get_variable("W3", shape=[hidden_layer_neurons, hidden_layer_neurons],
initializer=tf.contrib.layers.xavier_initializer())
layer3 = tf.matmul(layer2_act, W3)
layer3_act = tf.nn.tanh(layer3)
tf.summary.histogram("weights", W3)
tf.summary.histogram("layer", layer3)
tf.summary.histogram("activations", layer3_act)
# Fourth layer of weights
with tf.name_scope("layer4"):
W4 = tf.get_variable("W4", shape=[hidden_layer_neurons, output_size],
initializer=tf.contrib.layers.xavier_initializer())
Qpred = tf.nn.softmax(tf.matmul(layer3_act, W4)) # Bug fixed: Qpred = tf.nn.softmax(tf.matmul(layer3, W4))
tf.summary.histogram("weights", W4)
tf.summary.histogram("Qpred", Qpred)
# We need to define the parts of the network needed for learning a policy
Y = tf.placeholder(tf.float32, [None, output_size], name="input_y")
advantages = tf.placeholder(tf.float32, name="reward_signal")
# Loss function
# Sum (Ai*logp(yi|xi))
log_lik = -Y * tf.log(Qpred)
loss = tf.reduce_mean(tf.reduce_sum(log_lik * advantages, axis=1))
tf.summary.scalar("Q", tf.reduce_mean(Qpred))
tf.summary.scalar("Y", tf.reduce_mean(Y))
tf.summary.scalar("log_likelihood", tf.reduce_mean(log_lik))
tf.summary.scalar("loss", loss)
# Learning
train = tf.train.AdamOptimizer(learning_rate=learning_rate).minimize(loss)
It appears that the network hasn't learned anything in the layers one to three. The last layer does change, so that means that there either may be something wrong with the gradients (if you're tampering with them manually), you're constraining learning to the last layer by optimizing only its weights or the last layer really 'eats up' all error. It could also be that only biases are learned. The network appears to learn something though, but it might not be using its full potential. More context would be needed here, but playing around with the learning rate (e.g. using a smaller one) might be worth a shot.
In general, histograms display the number of occurrences of a value relative to each other values. Simply speaking, if the possible values are in a range of 0..9 and you see a spike of amount 10 on the value 0, this means that 10 inputs assume the value 0; in contrast, if the histogram shows a plateau of 1 for all values of 0..9, it means that for 10 inputs, each possible value 0..9 occurs exactly once.
You can also use histograms to visualize probability distributions when you normalize all histogram values by their total sum; if you do that, you'll intuitively obtain the likelihood with which a certain value (on the x axis) will appear (compared to other inputs).
Now for layer1/weights, the plateau means that:
most of the weights are in the range of -0.15 to 0.15
it is (mostly) equally likely for a weight to have any of these values, i.e. they are (almost) uniformly distributed
Said differently, almost the same number of weights have the values -0.15, 0.0, 0.15 and everything in between. There are some weights having slightly smaller or higher values.
So in short, this simply looks like the weights have been initialized using a uniform distribution with zero mean and value range -0.15..0.15 ... give or take. If you do indeed use uniform initialization, then this is typical when the network has not been trained yet.
In comparison, layer1/activations forms a bell curve (gaussian)-like shape: The values are centered around a specific value, in this case 0, but they may also be greater or smaller than that (equally likely so, since it's symmetric). Most values appear close around the mean of 0, but values do range from -0.8 to 0.8.
I assume that the layer1/activations is taken as the distribution over all layer outputs in a batch. You can see that the values do change over time.
The layer 4 histogram doesn't tell me anything specific. From the shape, it's just showing that some weight values around -0.1, 0.05 and 0.25 tend to be occur with a higher probability; a reason could be, that different parts of each neuron there actually pick up the same information and are basically redundant. This can mean that you could actually use a smaller network or that your network has the potential to learn more distinguishing features in order to prevent overfitting. These are just assumptions though.
Also, as already stated in the comments below, do add bias units. By leaving them out, you are forcefully constraining your network to a possibly invalid solution.
Here I would indirectly explain the plot by giving a minimal example. The following code produce a simple histogram plot in tensorboard.
from datetime import datetime
import tensorflow as tf
filename = datetime.now().strftime("%Y%m%d-%H%M%S")
fw = tf.summary.create_file_writer(f'logs/fit/{filename}')
with fw.as_default():
for i in range(10):
t = tf.random.uniform((2, 2), 1000)
tf.summary.histogram(
"train/hist",
t,
step=i
)
print(t)
We see that generating a 2x2 matrix with a maximum range 1000 will produce values from 0-1000. To how this tensor might look, i am putting log of a few of them here.
tf.Tensor(
[[398.65747 939.9828 ]
[942.4269 59.790222]], shape=(2, 2), dtype=float32)
tf.Tensor(
[[869.5309 980.9699 ]
[149.97845 454.524 ]], shape=(2, 2), dtype=float32)
tf.Tensor(
[[967.5063 100.77594 ]
[ 47.620544 482.77008 ]], shape=(2, 2), dtype=float32)
We logged into tensorboard 10 times. The to right of the plot, a timeline is generated to indicate timesteps. The depth of histogram indicate which values are new. The lighter/front values are newer and darker/far values are older.
Values are gathered into buckets which are indicated by those triangle structures. x-axis indicate the range of values where the bunch lies.

Dynamic tensor shape for tensorflow RNN

I'm trying a very simple example for tensorflow RNN.
In that example, I use dynamic rnn. The code is as follows:
data = tf.placeholder(tf.float32, [None, 10,1]) #Number of examples, number of input, dimension of each input
target = tf.placeholder(tf.float32, [None, 11])
num_hidden = 24
cell = tf.nn.rnn_cell.LSTMCell(num_hidden,state_is_tuple=True)
val, _ = tf.nn.dynamic_rnn(cell, data, dtype=tf.float32)
val = tf.transpose(val, [1, 0, 2])
last = tf.gather(val, int(val.get_shape()[0]) - 1)
weight = tf.Variable(tf.truncated_normal([num_hidden, int(target.get_shape()[1])]))
bias = tf.Variable(tf.constant(0.1, shape=[target.get_shape()[1]]))
prediction = tf.nn.softmax(tf.matmul(last, weight) + bias)
cross_entropy = -tf.reduce_sum(target * tf.log(tf.clip_by_value(prediction,1e-10,1.0)))
optimizer = tf.train.AdamOptimizer()
minimize = optimizer.minimize(cross_entropy)
mistakes = tf.not_equal(tf.argmax(target, 1), tf.argmax(prediction, 1))
error = tf.reduce_mean(tf.cast(mistakes, tf.float32))
Actually, the code is taken from this tutorial.
The input to this RNN network is a sequence of binary numbers. Each number is put into an array. For example, a seuquence has format:
[[1],[0],[0],[1],[1],[0],[1],[1],[1],[0]]
The shape of the input is [None,10,1] which are batch size, sequence size and embedding size, respectively. Now because dynamic rnn can accept variable input shape, I change the code as follows:
data = tf.placeholder(tf.float32, [None, None,1])
Basically, I want to use variable-length sequences (of course same length for all sequences in the same batch, but different between batches). However, it throws the error:
Traceback (most recent call last):
File "rnn-lstm-variable-length.py", line 48, in <module>
last = tf.gather(val, int(val.get_shape()[0]) - 1)
TypeError: __int__ returned non-int (type NoneType)
I understand that the second dimension is None, which cannot be used in get_shape()[0]. However, I believe that there must be a way to overcome this because RNN accepts variable lenth inputs, in general.
How can I do it?
tl;dr: try using tf.batch(..., dynamic_pad=True) to batch your data.
#chris_anderson's comment is correct. Ultimately your network needs a dense matrix of numbers to work with and there are a couple of strategies to convert variable length data into hyperrectangles:
Pad all batches to a fixed size (e.g. assume a maximum length of say 500 items per input and every item in every batch is padded to 500). There is nothing dynamic about this strategy.
Apply padding per-batch to the length of the longest item in the batch (dynamic padding).
Bucket your input based on length and apply padding per-batch. This is the same as #2, but with less overall padding.
There are other strategies that you could use too.
To do this batching, you use:
tf.train.batch - by default it does no padding, you need to implement it yourself.
tf.train.batch(..., dynamic_pad=True)
tf.contrib.training.bucket_by_sequence_length
I suspect you're also confused by the use of tf.nn.dynamic_rnn. It's important to note that the dynamic in dynamic_rnn refers to the way that TensorFlow unrolls the recurrent part of the network. in tf.nn.rnn, the recurrence is done statically in the graph (there is no internal loop, it's unrolled at graph construction time). In dynamic_rnn however, TensorFlow uses tf.while_loop to iterate inside the graph at run time. To use dynamic padding, you need to use dynamic unrolling, but it does not do it automatically.
tf.gather expects a tensor, so you can use tf.shape(val) to get a tensor, calculated at run-time, for the shape of val - e.g. tf.gather(val, tf.shape(val)[0] - 1)

Why can't tensorflow determine the shape of this expression?

I have the following expression which is giving me problems. I have defined the batch_size as batch_size = tf.shape(input_tensor)[0] which dynamically determines the size of the batch based on the size of the input tensor to the model. I have used it elsewhere in the code without issue. What I am confused about is that when I run the following line of code it says the shape is (?, ?) I would expect it to be (?, 128) because it knows the second dimension.
print(tf.zeros((batch_size, 128)).get_shape())
I want to know the shape since I am trying to do the following and I am getting an error.
rnn_input = tf.reduce_sum(w * decoder_input, 1)
last_out = decoder_outputs[t - 1] if t else tf.zeros((batch_size, 128))
rnn_input = tf.concat(1, (rnn_input, last_out))
This code needs to set last_out to zero on the first time step.
Here is the error ValueError: Linear expects shape[1] of arguments: [[None, None], [None, 1024]]
I am doing something similar when I determine my initial state vector for the RNNs.
state = tf.zeros((batch_size, decoder_multi_rnn.state_size), tf.float32)
I also get (?, ?) when I try to print the size of state but it does not really throw any exceptions when I try to use it.
You are mixing static shapes and dynamic shapes. Static shape is what you get during tensor.get_shape(tensor) which is best-effort attempt to obtain shape, while dynamic shape comes from sess.run(tf.shape(tensor)) and it is always defined.
To be more precise, tf.shape(tensor) creates an op in the graph that will produce shape tensor on run call. If you do aop=tf.shape(tensor)[0], there's some magic through _SliceHelper that adds extra ops that will extract first element of the shape tensor on run call.
This means that myval=tf.zeros((aop, 128)) has to run aop to obtain the dimensions and this means that first dimension of myval is undefined until you issue the run call. IE, your run call could look like sess.run(myval, feed_dict={aop:2}, where feed_dict overrides aop with 2. Hence static shape inference reports ? for that dimension.
(EDIT: I rewrite an answer as what I wrote before was not up to the point)
The quick fix to your issue is to use set_shape() to update the static (inferred) shape of the Tensor:
input_tensor = tf.placeholder(tf.float32, [None, 32])
batch_size = tf.shape(input_tensor)[0]
res = tf.zeros((batch_size, 128))
print res.get_shape() # prints (?, ?) WHEREAS one could expect (?, 128)
res.set_shape([None, 128])
print res.get_shape() # prints (?, 128)
As for why TensorFlow looses the information about the second dimension being 128, I don't really know.
Maybe #Yaroslav will be able to answer.
EDIT:
The incorrect behavior was corrected following this issue.