I've searched everywhere but couldn't find anything. It looks so weird that nobody have already encountered the same problem as I... Let me explain:
I've trained a Tensorflow 2 custom model. During the training I have used set_shape((None, 320, 320, 14)) so that Tensorflow knows the shape (It couldn't infer it for whatever reason... -_-").
I have also saved my custom model at every 100 epochs using:
model.save(os.path.join('models', 'pb', FLAGS.task_name + '-%i' % epoch))
So for the 100th epoch I will have a folder models/pb/my_name-100 that contains
assets
variables
saved_model.pb
Now, at inference time, I just want to load the model (without all the code). So I have created another piece of code that only loads the model and make a prediction... A basic template looks like:
class NeuralNetwork:
def __init__(self, model):
self.model = tf.keras.models.load_model(model)
def predict(self, input_tensor):
pred = self.model(input_tensor[None, ...])
return pred[0]
Where input_tensor is of size (H, W, 14) and so input_tensor[None, ...] is of size:
(None, H, W, 14).
The problem is that, because I have set the shape during training to be (None, 320, 320, 14)... This stupid Tensorflow expects the input to be (None, 320, 320, 14) -_-"!!!. My Neural Network is a fully convolutional neural network, so I really don't care about the input shape. I set it to be (320, 320, 14) during training for memory reason...
During prediction I'd like to be able to do prediction on any kind of shape.
Obviously, I could do a preprocessing function that extracts patch of size (320, 320) from the input image and tiles them. So for example my input_tensor could be of size (30, 320, 320, 14)
And then after the prediction, I could reconstruct the image from the tiles... But I don't want to do that.
Firstly because It takes a bit of time to create the tiles and reconstruct the image from the tile
Secondly because the result will be a bit off due to 0 padding in the convolution. Which means that I need to create overlapping tiles and average the results on the overlapping part to avoid having artifacts during the reconstruction
So my question is simple:
How can I tell tensorflow to accept any width and height at inference time? Omg it's so bothersome. I can't believe that there are not an easy options available to do that
I answer my own question.
Unfortunately, my answer will not satisfy everybody. There are so many convoluted things happening in TF (Not to mention that when you search for help, most of it concern the 1st API... -_-").
Anyway, here is the "solution"
In my Neural Network, I have implemented a custom layer to mimic the pytorch function AdaptiveAvgPool2D. My implementation actually use tf.nn_avg_pool under the hood and need to dynamically compute the kernel_size as well as the stride. Here is my code, for reference:
class AdaptiveAvgPool2d(layers.Layer):
def __init__(self, output_shape, data_format='channels_last'):
super(AdaptiveAvgPool2d, self).__init__(autocast=False)
assert data_format in {'channels_last', 'channels_first'}, \
'data format parameter must be in {channels_last, channels_first}'
if isinstance(output_shape, tuple):
self.out_h = output_shape[0]
self.out_w = output_shape[1]
elif isinstance(output_shape, int):
self.out_h = output_shape
self.out_w = output_shape
else:
raise RuntimeError(f"""output_shape should be an Integer or a Tuple2""")
self.data_format = data_format
def call(self, inputs, mask=None):
# input_shape = tf.shape(inputs)
input_shape = inputs.get_shape().as_list()
if self.data_format == 'channels_last':
h_idx, w_idx = 1, 2
else: # can use else instead of elif due to assert in __init__
h_idx, w_idx = 2, 3
stride_h = input_shape[h_idx] // self.out_h
stride_w = input_shape[w_idx] // self.out_w
k_size_h = stride_h + input_shape[h_idx] % self.out_h
k_size_w = stride_w + input_shape[w_idx] % self.out_w
pool = tf.nn.avg_pool(
inputs,
ksize=[k_size_h, k_size_w],
strides=[stride_h, stride_w],
padding='VALID',
data_format='NHWC' if self.data_format == 'channels_last' else 'NCHW')
return pool
The problem is that, I'm using inputs.get_shape().as_list()
to recover int values and not a Tensor(..., type=int). Indeed, the tf.nn.avg_pool accept a list of Int for both the ksize and the strides parameters...
Put it differently, I couldn't use tf.shape(inputs) because it returns a Tensor(..., type=int) and there is no way to recover a int from a Tensor beside by evaluating it...
The way I have implemented my function worked just fine, the problem is that, Tensorflow infers the size under the hood and save the size of all the tensors inside the .pb file when I save it.
Indeed, you can easily open a .pb file with any TextEditor (SublimeText) and see by ourself the expected TensorShape. In my case it was `TensorShape: [null, 320, 320, 14]
So, using set_shape((None, None, None, 14)) instead of set_shape((None, 320, 320, 14)) or nothing actually doesn't change the problem...
The problem is that the average pooling layer does not accept a dynamic kernel size/strides....
I then realizes that there is a tensorflow function for this actually tfa.layers.AdaptiveAveragePooling2D. So, I might just go with it and it will be fine, right?
Well not exactly, under the hood, this tensorflow function use other tf.function like tf.split. The problem with tf.split is that, if your dimension you want to split is of size X and you want to output a tensor of size Y. If X % Y != 0, when tf.split will throw an error... While Pytorch is much more robust and handle cases were X % Y != 0.
Put it differently, it means that, in order for me to use tfa.layers.AdaptiveAveragePooling2D, I need to be sure that the size of the tensor received by this function is divisible by the scalar I pass to the function.
For example, in my case,
The input image are of size: (320, 320, whatever), the input tensor received by tfa.layers.AdaptiveAveragePooling2D is: (40, 40, whatever).
So it means, that the spatial dimension of my tensor was divided by 8 during training. In order for it to work, I should choose a size that can divide 40. Let's say I choose 5.
It means that during the prediction, my neural network will work if the input dimension that the tfa.layers.AdaptiveAveragePooling2D receives is also divisible by 5. But we already know that my input image is 8x bigger then the tensor receives by tfa.layers.AdaptiveAveragePooling2D, so it means that, at prediction times, I can use whatever image size as long as:
H % (8*5) == 0 and W % (8 * 5) == 0
Where H and W are respectively the height and the width of my input image.
To do that, we can just implement a simple function:
new_W = W + W % 40 (40 in this example...)
new_H = H + H % 40 (40 in this example...)
This function will stretch a bit the image but not to much so that it should be just fine.
Summing up:
My AdaptivePooling uses static shape, but I cannot do otherwise since it uses tf.nn.avg_pool under the hood that doesn't accept dynamic shape
tfa.layers.AdaptiveAveragePooling2D is a work around, but because it relies on tf.split that is not robust to inexact divide, it is not perfect either
The basic solution is to use tfa.layers.AdaptiveAveragePooling2D and create a preprocessing function before calling the prediction so that the tensor will work just fine with tf.split constraint
Finally, this is not a good solution either. Because, during training if I receive a tensor of size (40, 40) and want an avg output of size (5, 5), it means that I basically average (8, 8) features to retrieve one features.
The problem is that, if I do that during inference time on a bigger image, I will receive a bigger tensor. Let's say: (100, 200). But since my output will always be (5, 5), it means that I will, this time, average (20, 40) features to retrieve one feature...
Because of this difference between training and inference if I go with this way of doing, inferring on a bigger image might lead to inconsistent results
In my case, the way to go is to batch the images as I have explained in my first post...
Hope it will help some of you.
Related
I'm working with padded sequences of maximum length 50. I have two types of sequence data:
1) A sequence, seq1, of integers (1-100) that correspond to event types (e.g. [3,6,3,1,45,45....3]
2) A sequence, seq2, of integers representing time, in minutes, from the last event in seq1. So the last element is zero, by definition. So for example [100, 96, 96, 45, 44, 12,... 0]. seq1 and seq2 are the same length, 50.
I'm trying to run the LSTM primarily on the event/seq1 data, but have the time/seq2 strongly influence the forget gate within the LSTM. The reason for this is I want the LSTM to tend to really penalize older events and be more likely to forget them. I was thinking about multiplying the forget weight by the inverse of the current value of the time/seq2 sequence. Or maybe (1/seq2_element + 1), to handle cases where it's zero minutes.
I see in the keras code (LSTMCell class) where the change would have to be:
f = self.recurrent_activation(x_f + K.dot(h_tm1_f,self.recurrent_kernel_f))
So I need to modify keras' LSTM code to accept multiple inputs. As an initial test, within the LSTMCell class, I changed the call function to look like this:
def call(self, inputs, states, training=None):
time_input = inputs[1]
inputs = inputs[0]
So that it can handle two inputs given as a list.
When I try running the model with the Functional API:
# Input 1: event type sequences
# Take the event integer sequences, run them through an embedding layer to get float vectors, then run through LSTM
main_input = Input(shape =(max_seq_length,), dtype = 'int32', name = 'main_input')
x = Embedding(output_dim = embedding_length, input_dim = num_unique_event_symbols, input_length = max_seq_length, mask_zero=True)(main_input)
## Input 2: time vectors
auxiliary_input = Input(shape=(max_seq_length,1), dtype='float32', name='aux_input')
m = Masking(mask_value = 99999999.0)(auxiliary_input)
lstm_out = LSTM(32)(x, time_vector = m)
# Auxiliary loss here from first input
auxiliary_output = Dense(1, activation='sigmoid', name='aux_output')(lstm_out)
# An abitrary number of dense, hidden layers here
x = Dense(64, activation='relu')(lstm_out)
# The main output node
main_output = Dense(1, activation='sigmoid', name='main_output')(x)
## Compile and fit the model
model = Model(inputs=[main_input, auxiliary_input], outputs=[main_output, auxiliary_output])
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'], loss_weights=[1., 0.2])
print(model.summary())
np.random.seed(21)
model.fit([train_X1, train_X2], [train_Y, train_Y], epochs=1, batch_size=200)
However, I get the following error:
An `initial_state` was passed that is not compatible with `cell.state_size`. Received `state_spec`=[InputSpec(shape=(None, 50, 1), ndim=3)]; however `cell.state_size` is (32, 32)
Any advice?
You can't pass a list of inputs to default recurrent layers in Keras. The input_spec is fixed and the recurrent code is implemented based on single tensor input also pointed out in the documentation, ie it doesn't magically iterate over 2 inputs of same timesteps and pass that to the cell. This is partly because of how the iterations are optimised and assumptions made if the network is unrolled etc.
If you like 2 inputs, you can pass constants (doc) to the cell which will pass the tensor as is. This is mainly to implement attention models in the future. So 1 input will iterate over timesteps while the other will not. If you really like 2 inputs to be iterated like a zip() in python, you will have to implement a custom layer.
I would like to throw in a different ideas here. They don't require you to modify the Keras code.
After the embedding layer of the event types, stack the embeddings with the elapsed time. The Keras function is keras.layers.Concatenate(axis=-1). Imagine this, a single even type is mapped to a n dimensional vector by the embedding layer. You just add the elapsed time as one more dimension after the embedding so that it becomes a n+1 vector.
Another idea, sort of related to your problem/question and may help here, is 1D convolution. The convolution can happen right after the concatenated embeddings. The intuition for applying convolution to event types and elapsed time is actually 1x1 convolution. In such a way that you linearly combine the two together and the parameters are trained. Note in terms of convolution, the dimensions of the vectors are called channels. Of course, you can also convolve more than 1 event at a step. Just try it. It may or may not help.
I have the following situation:
I want to deploy a face detector model using Tensorflow Serving: https://www.tensorflow.org/serving/.
In Tensorflow Serving, there is a command line option called --enable_batching. This causes the model server to automatically batch the requests to maximize throughput. I want this to be enabled.
My model takes in a set of images (called images), which is a tensor of shape (batch_size, 640, 480, 3).
The model has two outputs: (number_of_faces, 4) and (number_of_faces,). The first output will be called faces. The last output, which we can call partitions is the index in the original batch for the corresponding face. For example, if I pass in a batch of 4 images and get 7 faces, then I might have this tensor as [0, 0, 1, 2, 2, 2, 3]. The first two faces correspond to the first image, the third face for the second image, the 3rd image has 3 faces, etc.
My issue is this:
In order for the --enable_batching flag to work, the output from my model needs to have the 0th dimension the same as the input. That is, I need a tensor with the following shape: (batch_size, ...). I suppose this is so that the model server can know which grpc connection to send each output in the batch towards.
What I want to do is to convert my output tensor from the face detector from this shape (number_of_faces, 4) to this shape (batch_size, None, 4). That is, an array of batches, where each batch can have a variable number of faces (e.g. one image in the batch may have no faces, and another might have 3).
What I tried:
tf.dynamic_partition. On the surface, this function looks perfect. However, I ran into difficulties after realizing that the num_partitions parameter cannot be a tensor, only an integer:
tensorflow_serving_output = tf.dynamic_partition(faces, partitions, batch_size)
If the tf.dynamic_partition function were to accept tensor values for num_partition, then it seems that my problem would be solved. However, I am back to square one since this is not the case.
Thank you all for your help! Let me know if anything is unclear
P.S. Here is a visual representation of the intended process:
I ended up finding a solution to this using TensorArray and tf.while_loop:
def batch_reconstructor(tensor, partitions, batch_size):
"""
Take a tensor of shape (batch_size, 4) and a 1-D partitions tensor as well as the scalar batch_size
And reconstruct a TensorArray that preserves the original batching
From the partitions, we can get the maximum amount of tensors within a batch. This will inform the padding we need to use.
Params:
- tensor: The tensor to convert to a batch
- partitions: A list of batch indices. The tensor at position i corresponds to batch # partitions[i]
"""
tfarr = tf.TensorArray(tf.int32, size=batch_size, infer_shape=False)
_, _, count = tf.unique_with_counts(partitions)
maximum_tensor_size = tf.cast(tf.reduce_max(count), tf.int32)
padding_tensor_index = tf.cast(tf.gather(tf.shape(tensor), 0), tf.int32)
padding_tensor = tf.expand_dims(tf.cast(tf.fill([4], -1), tf.float32), axis=0) # fill with [-1, -1, -1, -1]
tensor = tf.concat([tensor, padding_tensor], axis=0)
def cond(i, acc):
return tf.less(i, batch_size)
def body(i, acc):
partition_indices = tf.reshape(tf.cast(tf.where(tf.equal(partitions, i)), tf.int32), [-1])
partition_size = tf.gather(tf.shape(partition_indices), 0)
# concat the partition_indices with padding_size * padding_tensor_index
padding_size = tf.subtract(maximum_tensor_size, partition_size)
padding_indices = tf.reshape(tf.fill([padding_size], padding_tensor_index), [-1])
partition_indices = tf.concat([partition_indices, padding_indices], axis=0)
return (tf.add(i, 1), acc.write(i, tf.gather(tensor, partition_indices)))
_, reconstructed = tf.while_loop(
cond,
body,
(tf.constant(0), tfarr),
name='batch_reconstructor'
)
reconstructed = reconstructed.stack()
return reconstructed
I have a TensorFlow CNN model that is performing well and we would like to implement this model in hardware; i.e., an FPGA. It's a relatively small network but it would be ideal if it were smaller. With that goal, I've examined the kernels and find that there are some where the weights are quite strong and there are others that aren't doing much at all (the kernel values are all close to zero). This occurs specifically in layer 2, corresponding to the tf.Variable() named, "W_conv2". W_conv2 has shape [3, 3, 32, 32]. I would like to freeze/lock the values of W_conv2[:, :, 29, 13] and set them to zero so that the rest of the network can be trained to compensate. Setting the values of this kernel to zero effectively removes/prunes the kernel from the hardware implementation thus achieving the goal stated above.
I have found similar questions with suggestions that generally revolve around one of two approaches;
Suggestion #1:
tf.Variable(some_initial_value, trainable = False)
Implementing this suggestion freezes the entire variable. I want to freeze just a slice, specifically W_conv2[:, :, 29, 13].
Suggestion #2:
Optimizer = tf.train.RMSPropOptimizer(0.001).minimize(loss, var_list)
Again, implementing this suggestion does not allow the use of slices. For instance, if I try the inverse of my stated goal (optimize only a single kernel of a single variable) as follows:
Optimizer = tf.train.RMSPropOptimizer(0.001).minimize(loss, var_list = W_conv2[:,:,0,0]))
I get the following error:
NotImplementedError: ('Trying to optimize unsupported type ', <tf.Tensor 'strided_slice_2228:0' shape=(3, 3) dtype=float32>)
Slicing tf.Variables() isn't possible in the way that I've tried it here. The only thing that I've tried which comes close to doing what I want is using .assign() but this is extremely inefficient, cumbersome, and caveman-like as I've implemented it as follows (after the model is trained):
for _ in range(10000):
# get a new batch of data
# reset the values of W_conv2[:,:,29,13]=0 each time through
for m in range(3):
for n in range(3):
assign_op = W_conv2[m,n,29,13].assign(0)
sess.run(assign_op)
# re-train the rest of the network
_, loss_val = sess.run([optimizer, loss], feed_dict = {
dict_stuff_here
})
print(loss_val)
The model was started in Keras then moved to TensorFlow since Keras didn't seem to have a mechanism to achieve the desired results. I'm starting to think that TensorFlow doesn't allow for pruning but find this hard to believe; it just needs the correct implementation.
A possible approach is to initialize these specific weights with zeros, and modify the minimization process such that gradients won't be applied to them. It can be done by replacing the call to minimize() with something like:
W_conv2_weights = np.ones((3, 3, 32, 32))
W_conv2_weights[:, :, 29, 13] = 0
W_conv2_weights_const = tf.constant(W_conv2_weights)
optimizer = tf.train.RMSPropOptimizer(0.001)
W_conv2_orig_grads = tf.gradients(loss, W_conv2)
W_conv2_grads = tf.multiply(W_conv2_weights_const, W_conv2_orig_grads)
W_conv2_train_op = optimizer.apply_gradients(zip(W_conv2_grads, W_conv2))
rest_grads = tf.gradients(loss, rest_of_vars)
rest_train_op = optimizer.apply_gradients(zip(rest_grads, rest_of_vars))
tf.group([rest_train_op, W_conv2_train_op])
I.e,
Preparing a constant Tensor for canceling the appropriate gradients
Compute gradients only for W_conv2, then multiply element-wise with the constant W_conv2_weights to zero the appropriate gradients and only then apply gradients.
Compute and apply gradients "normally" to the rest of the variables.
Group the 2 train ops to a single training op.
I didn't convert the weights by myself, instead I used vgg16_weights.npz from www(dot)cs(dot)toronto(dot)edu/~frossard/post/vgg16/. There, it is mentioned
We convert the Caffe weights publicly available in the author’s GitHub profile (gist(dot)github(dot)com/ksimonyan/211839e770f7b538e2d8#file-readme-md) using a specialized tool (github(dot)com/ethereon/caffe-tensorflow).
But, in that page, there is no validation code, so I made it referring to tensorflow MNIST and inception code.
How I create TFRecords of Imagenet
I use build_imagenet_data.py from inception. I changed the
label_index = 0 #originally label_index = 1
because inception use label_index 0 as background class (so in total there are 1001 classes). Caffe format doesn't use that as the number of output is 1000. I prefer to use TFRecord format as I will change process the weight and retrain.
How I load the weights
inference function taken from MNIST's mnist.py was modified so the Variable is taken from the vgg16_weights.npz
How I load the weights:
weights = np.load('/the_path/vgg16_weights.npz')
How I put the variable in conv1_1:
with tf.name_scope('conv1_1') as scope:
kernel = tf.Variable(tf.constant(weights['conv1_1_W']), name='weights')
conv = tf.nn.conv2d(images, kernel, [1, 1, 1, 1], padding='SAME')
biases = tf.Variable(tf.constant(weights['conv1_1_b']), name='biases')
out = tf.nn.bias_add(conv, biases)
conv1_1 = tf.nn.relu(out, name=scope)
sess.run(conv1_1)
How I read the TFRecords
I took inception's image_processing.py, dataset.py, and ImagenetData.py with no change. Then, I run inception's inception_eval.py evaluate function with changing in inference code and deleting the restoring moving variable from checkpoint (as I already restore manually in variable initialization). However, the accuracy is not same with the VGG-16 in caffe. Top-5 accuracy is around 9%.
Closing
What is the problem of this method? There are several part of code that I still don't understand though:
How TFReader move to the next batch of images after processing 1 batch of images? The output of inception's image_processing.py size is only the number of batch size. To be complete, this is the output based on documentation:
images: Images. 4D tensor of size [batch_size, FLAGS.image_size,
image_size, 3].
labels: 1-D integer Tensor of [FLAGS.batch_size].
Do I need softmax the logits before tf.in_top_k ? (Well, I don't think it is matter as the value sequence is same)
Thank you for the help. Sorry if the link is messy as I can only post 2 links in 1 post because of my reputation.
UPDATE
I tried myself by changing the caffe weight. Reverse the channel input dimension of conv1_1 (because caffe receive BGR, so the weight is for BGR instead of RGB in tensorflow) and get the same accuracy with the weight from website: around 9% in top-5.
I found out that there is no mean image subtraction in tensorflow inception's image_processing.py. I add mean subtraction (in eval_image function) with tf.reduce_mean and got 11% accuracy.
Then I tried to change the eval_image function with
# source: https://github.com/ethereon/caffe-tensorflow/blob/master/examples/imagenet/dataset.py
img_shape = tf.to_float(tf.shape(image)[:2])
min_length = tf.minimum(img_shape[0], img_shape[1])
new_shape = tf.to_int32((256 / min_length) * img_shape) #isotropic case
# new_shape = tf.pack([256,256]) #non isotropic case
image = tf.image.resize_images(image, [new_shape[0], new_shape[1]])
offset = tf.to_int32((new_shape - 224) / 2)
image = tf.slice(image, begin=tf.pack([offset[0], offset[1], 0]), size=tf.pack([224, 224, -1]))
# mean_subs_image = tf.reduce_mean(image,axis=[0,1],keep_dims=True)
return image - mean_subs_image
and I got 13%. Increased but still lack a lot. Seems it is one of the problem. I am not sure what is the other problems.
In general porting whole model weights across libraries will be hard. You pointed out some differences from caffe, but there could be others. It might be easier to retrain the model in TensorFlow.
I'm trying a very simple example for tensorflow RNN.
In that example, I use dynamic rnn. The code is as follows:
data = tf.placeholder(tf.float32, [None, 10,1]) #Number of examples, number of input, dimension of each input
target = tf.placeholder(tf.float32, [None, 11])
num_hidden = 24
cell = tf.nn.rnn_cell.LSTMCell(num_hidden,state_is_tuple=True)
val, _ = tf.nn.dynamic_rnn(cell, data, dtype=tf.float32)
val = tf.transpose(val, [1, 0, 2])
last = tf.gather(val, int(val.get_shape()[0]) - 1)
weight = tf.Variable(tf.truncated_normal([num_hidden, int(target.get_shape()[1])]))
bias = tf.Variable(tf.constant(0.1, shape=[target.get_shape()[1]]))
prediction = tf.nn.softmax(tf.matmul(last, weight) + bias)
cross_entropy = -tf.reduce_sum(target * tf.log(tf.clip_by_value(prediction,1e-10,1.0)))
optimizer = tf.train.AdamOptimizer()
minimize = optimizer.minimize(cross_entropy)
mistakes = tf.not_equal(tf.argmax(target, 1), tf.argmax(prediction, 1))
error = tf.reduce_mean(tf.cast(mistakes, tf.float32))
Actually, the code is taken from this tutorial.
The input to this RNN network is a sequence of binary numbers. Each number is put into an array. For example, a seuquence has format:
[[1],[0],[0],[1],[1],[0],[1],[1],[1],[0]]
The shape of the input is [None,10,1] which are batch size, sequence size and embedding size, respectively. Now because dynamic rnn can accept variable input shape, I change the code as follows:
data = tf.placeholder(tf.float32, [None, None,1])
Basically, I want to use variable-length sequences (of course same length for all sequences in the same batch, but different between batches). However, it throws the error:
Traceback (most recent call last):
File "rnn-lstm-variable-length.py", line 48, in <module>
last = tf.gather(val, int(val.get_shape()[0]) - 1)
TypeError: __int__ returned non-int (type NoneType)
I understand that the second dimension is None, which cannot be used in get_shape()[0]. However, I believe that there must be a way to overcome this because RNN accepts variable lenth inputs, in general.
How can I do it?
tl;dr: try using tf.batch(..., dynamic_pad=True) to batch your data.
#chris_anderson's comment is correct. Ultimately your network needs a dense matrix of numbers to work with and there are a couple of strategies to convert variable length data into hyperrectangles:
Pad all batches to a fixed size (e.g. assume a maximum length of say 500 items per input and every item in every batch is padded to 500). There is nothing dynamic about this strategy.
Apply padding per-batch to the length of the longest item in the batch (dynamic padding).
Bucket your input based on length and apply padding per-batch. This is the same as #2, but with less overall padding.
There are other strategies that you could use too.
To do this batching, you use:
tf.train.batch - by default it does no padding, you need to implement it yourself.
tf.train.batch(..., dynamic_pad=True)
tf.contrib.training.bucket_by_sequence_length
I suspect you're also confused by the use of tf.nn.dynamic_rnn. It's important to note that the dynamic in dynamic_rnn refers to the way that TensorFlow unrolls the recurrent part of the network. in tf.nn.rnn, the recurrence is done statically in the graph (there is no internal loop, it's unrolled at graph construction time). In dynamic_rnn however, TensorFlow uses tf.while_loop to iterate inside the graph at run time. To use dynamic padding, you need to use dynamic unrolling, but it does not do it automatically.
tf.gather expects a tensor, so you can use tf.shape(val) to get a tensor, calculated at run-time, for the shape of val - e.g. tf.gather(val, tf.shape(val)[0] - 1)