RETURNN Custom Layer Search Mode Assertion Error - tensorflow

I've implemented a custom RETURNN layer (HMM Factorization), which works as intended during training, but throws an assertion error when used in search mode. The output of the layer is identical to that of a softmax layer.
Here's the config that was used : transformer + HMM Factorization
This was tested using the latest version of RETURNN.
The exact line that fails is (code link):
assert fixed_seq_len is not None
Here's the full error log (too large to paste here)
Here's the training initialisation
Does anybody have any ideas what the error could be?
Thanks!

This is actually a bug in RETURNN. I created a pull request here which should fix that, and merged that in now.
The problem was not with your custom layer, but rather with a layer inside your RecLayer, which was actually totally independent, i.e. this one:
'encoder_int': {'activation': None,
'class': 'linear',
'from': ['base:encoder'],
'n_out': 1000,
'with_bias': False}
It just depends on one base layer ("base:encoder"), nothing else. So it (correctly) optimize this layer out of the recurrent loop, because it is independent.
However, then it sees that you are accessing this layer inside the loop, and as this is a loop over time, it assumes that this loop is over this time-dimension of "base:encoder". Then it tries to unroll the "base:encoder" (TensorArray.unroll) given the seq len of the rec layer, but then it fails because at this time it does not know the seq len of the rec layer.
My fix now does some more advanced check whether this assumption is correct, i.e. that the loop is really over the same time dimension. The check is a bit fragile though, and not sure if that works correctly in all cases. However, I created a test case which reproduces your problem and this is fixed now.

Related

Does `tf.data.Dataset.take()` return random sample?

Different calls of tf.data.Dataset.take() return different batches from the given dataset. Are those samples chosen randomly or is there another mechanism at play?
This is all the more confusing that the documentation makes no reference as to the randomness of the sampling.
Most probably, you might be using data.shuffle() before tf.data.Dataset.take().
Commenting that out should make the iterator behave as intended: take the same results over and over for each iterator run.
-- Or if you used an api that automatically shuffles without asking like image_dataset_from_directory
shuffle: Whether to shuffle the data. Default: True.
If set to False, sorts the data in alphanumeric order.
You would have to explicitly set shuffle=False when creating the dataset
I am a newbie to this domain. But from what I have seen in my notebook is that the take() does pick random samples. For instance, in the image shown here, I had just called image_dataset_from_directory() before calling take(), so no shuffling preceded the take op, still I see different samples on every run. Pls correct me if I am wrong, will help my understanding as well.

CNTK Asymmetric padding warning

When creating a model in CNTK, with a convolutional layer, I get the following warning:
WARNING: Detected asymmetric padding issue with even kernel size and lowerPad (9) < higherPad (10) (i=2), cuDNN will not be able to produce correct result. Switch to reference engine (VERY SLOW).
I have tried increasing the kernel size from 4x4 to 5x5 so the kernel size is not even without result.
I have also tried adjusting lowerPad, upperPad (the paramater named in the docs), and higherPad (the parameter listed in the message).
Setting autoPadding=false does not affect this message.
Is it just a warning that I should ignore? The VERY SLOW part concerns me, as my models are already quite slow.
I figured this out if anyone else is interested in the answer.
I stated in the question that I tried setting "autopadding=false". This is the incorrect format for the autopadding parameter; it must actually be a set of boolean values, with the value corresponding to the InputChannels dimension being false.
So the correct form of the parameter would be "autopadding=(true:true:false)", and everything works correctly.
You have a layer that has lower pad 9 and upper pad 10 at depth direction. Are you doing 3D convolution?

When should we use the place_pruned_graph config?

Question:As the title saied, I wander When should we use the config place_pruned_graph in GraphOptions. What's the purpose of this config?
I'm not clear to the comment about this config:
// Only place the subgraphs that are run, rather than the entire graph.
//
// This is useful for interactive graph building, where one might
// produce graphs that cannot be placed during the debugging
// process. In particular, it allows the client to continue work in
// a session after adding a node to a graph whose placement
// constraints are unsatisfiable.
We know that Tensorflow will partition a entire graph into several subgraphs in normal. And the following code from CreateGraphs of direct_session.cc takes the else branch in normal.(as far as I can see, I never found the case taking the if branch(so I don't know when should we trigger it).
if (options_.config.graph_options().place_pruned_graph()) {
// Because we are placing pruned graphs, we need to create a
// new SimpleGraphExecutionState for every new unseen graph,
// and then place it.
SimpleGraphExecutionStateOptions prune_options;
prune_options.device_set = &device_set_;
prune_options.session_options = &options_;
prune_options.stateful_placements = stateful_placements_;
TF_RETURN_IF_ERROR(SimpleGraphExecutionState::MakeForPrunedGraph(
execution_state_->original_graph_def().library(), prune_options,
execution_state_->original_graph_def(), subgraph_options,
&temp_exec_state_holder, &client_graph));
execution_state = temp_exec_state_holder.get();
} else {
execution_state = execution_state_.get();
TF_RETURN_IF_ERROR(
execution_state->BuildGraph(subgraph_options, &client_graph));
}
The short answer? Never. The longer answer requires me to explain why this option exists at all.
So why does TensorFlow include this convoluted configuration option and logic to handle it? It's a historical accident that came about when tensorflow::DirectSession and tensorflow::GrpcSession had different internal implementations:
The tensorflow::GrpcSession used a single SimpleGraphExecutionState for the entire graph in a session. The net effect of this was that the placer—which is responsible for assigning devices to each node in the graph—would run before the graph was pruned.
The tensorflow::DirectSession originally used one SimpleGraphExecutionState for each pruned subgraph, with some special logic for sharing the placements of stateful nodes between invocations. Therefore the placer would run after the graph was pruned, and could make different decisions about where to place stateful nodes.
The benefit of the tensorflow::GrpcSession approach (place_pruned_graph = false) is that it takes into account all of the colocation constraints in the graph when running the placement algorithm, even if they don't occur in the subgraph being executed. For example, if you had an embedding matrix, and wanted to optimize it using the SparseApplyAdagrad op (which only has a CPU implementation), TensorFlow would figure out that the embedding matrix should be placed on CPU.
By contrast, if you specified no device for the embedding matrix and set placed_pruned_graph = true the matrix would (most likely) be placed on GPU when you ran its initializer, because all of the ops in the initialization subgraph would be runnable on GPU. And, since variables cannot move between devices, TensorFlow would not be able to issue the subgraph that ran SparseApplyAdagrad on the matrix. This was a real issue in the earliest version of TensorFlow.
So why support place_pruned_graph = true at all? It turns out that it is useful when using TensorFlow interactively. The placed_pruned_graph = false option is unforgiving: once the graph for a session contains a node that cannot be placed, that session is useless, because the placement algorithm runs on the whole graph, it would fail every time it is invoked, and therefore no steps could run. When you use a tf.InteractiveSession, we assume that you are using a REPL (or Jupyter notebook) and that it's beneficial to allow you to continue after making such a mistake. Therefore in a tf.InteractiveSession we set place_pruned_graph = true so that you can continue to use the session after adding an unplaceable node (as long as you don't try to run that node in a pruned subgraph).
There is probably a better approach than place_pruned_graph = true for interactive use, but we haven't investigated adding one. Suggestions are always welcome on the GitHub issues page.

Can I change Inv operation into Reciprocal in an existing graph in Tensorflow?

I am working on an image classification problem with tensorflow. I have 2 different CNNs trained separately (in fact 3 in total but I will deal with the third later), for different tasks and on a AWS (Amazon) machine. One tells if there is text in the image and the other one tells if the image is safe for work or not. Now I want to use them in a single script on my computer, so that I can put an image as input and get the results of both networks as output.
I load the two graphs in a single tensorflow Session, using the import_meta_graph API and the import_scope argument and putting each subgraph in a separate scope. Then I just use the restore method of the created saver, giving it the common Session as argument.
Then, in order to run inference, I retrieve the placeholders and final output with graph=tf.get_default_graph() and my_var=graph.get_operation_by_name('name').outputs[0] before using it in sess.run (I think I could just have put 'name' in sess.run instead of fetching the output tensor and putting it in a variable, but this is not my problem).
My problem is the text CNN works perfectly fine, but the nsfw detector always gives me the same output, no matter the input (even with np.zeros()). I have tried both separately and same story: text works but not nsfw. So I don't think the problem comes from using two networks simultaneaously.
I also tried on the original AWS machine I trained it on, and this time the nsfw CNN worked perfectly.
Both networks are very similar. I checked on Tensorboard if everything was fine and I think it is ok. The differences are in the number of hidden units and the fact that I use batch normalization in the nsfw model and not in the text one. Now why this title ? I observed that I had a warning when running the nsfw model that I didn't have when using only the text model:
W tensorflow/core/framework/op_def_util.cc:332] Op Inv is deprecated. It will cease to work in GraphDef version 17. Use Reciprocal.
So I thougt maybe this was the reason, everything else being equal. I checked my GraphDef version, which seems to be 11, so Inv should still work in theory. By the way the AWS machine use tensroflow version 0.10 and I use version 0.12.
I noticed that the text network only had one Inv operation (via a filtering on the names of the operations given by graph.get_operations()), and that the nsfw model had the same operation plus multiple Inv operations due to the batch normalization layers. As precised in the release notes, tf.inv has simply been renamed to tf.reciprocal, so I tried to change the names of the operations to Reciprocal with tf.group(), as proposed here, but it didn't work. I have seen that using tf.identity() and changing the name could also work, but from what I understand, tensorflow graphs are an append-only structure, so we can't really modify its operations (which seems to be immutable anyway).
The thing is:
as I said, the Inv operation should still work in my GraphDef version;
this is only a warning;
the Inv operations only appear under name scopes that begin with 'gradients' so, from my understanding, this shouldn't be used for inference;
the text model also have an Inv operation.
For these reasons, I have a big doubt on my diagnosis. So my final questions are:
do you have another diagnosis?
if mine is correct, is it possible to replace Inv operations with Reciprocal operations, or do you have any other solution?
After a thorough examination of the output of relevant nodes, with the help of Tensorboard, I am now pretty certain that the renaming of Inv to Reciprocal has nothing to do with my problem.
It appears that the last batch normalization layer eliminates almost any variance of its output when the inputs varies. I will ask why elsewhere.

Has anyone managed to make Asynchronous advantage actor critic work with Mujoco experiments?

I'm using an open source version of a3c implementation in Tensorflow which works reasonably well for atari 2600 experiments. However, when I modify the network for Mujoco, as outlined in the paper, the network refuses to learn anything meaningful. Has anyone managed to make any open source implementations of a3c work with continuous domain problems, for example mujoco?
I have done a continuous action of Pendulum and it works well.
Firstly, you will build your neural network and output mean (mu) and standard deviation (sigma) for selecting an action.
The essential part of the continuous action is to include a normal distribution. I'm using tensorflow, so the code is looks like:
normal_dist = tf.contrib.distributions.Normal(mu, sigma)
log_prob = normal_dist.log_prob(action)
exp_v = log_prob * td_error
entropy = normal_dist.entropy() # encourage exploration
exp_v = tf.reduce_sum(0.01 * entropy + exp_v)
actor_loss = -exp_v
When you wanna sample an action, use the function tensorflow gives:
sampled_action = normal_dist.sample(1)
The full code of Pendulum can be found in my Github. https://github.com/MorvanZhou/tutorials/blob/master/Reinforcement_learning_TUT/10_A3C/A3C_continuous_action.py
I was hung up on this for a long time, hopefully this helps someone in my shoes:
Advantage Actor-critic in discrete spaces is easy: if your actor does better than you expect, increase the probability of doing that move. If it does worse, decrease it.
In continuous spaces though, how do you do this? The entire vector your policy function outputs is your move -- if you are on-policy, and you do better than expected, there's no way of saying "let's output that action even more!" because you're already outputting exactly that vector.
That's where Morvan's answer comes into play. Instead of outputting just an action, you output a mean and a std-dev for each output-feature. To choose an action, you pass your inputs in to create a mean/stddev for each output-feature, and then sample each feature from this normal distribution.
If you do well, you adjust the weights of your policy network to change the mean/stddev to encourage this action. If you do poorly, you do the opposite.