CNTK Asymmetric padding warning

When creating a model in CNTK, with a convolutional layer, I get the following warning:
WARNING: Detected asymmetric padding issue with even kernel size and lowerPad (9) < higherPad (10) (i=2), cuDNN will not be able to produce correct result. Switch to reference engine (VERY SLOW).
I have tried increasing the kernel size from 4x4 to 5x5 so the kernel size is not even without result.
I have also tried adjusting lowerPad, upperPad (the paramater named in the docs), and higherPad (the parameter listed in the message).
Setting autoPadding=false does not affect this message.
Is it just a warning that I should ignore? The VERY SLOW part concerns me, as my models are already quite slow.

I figured this out if anyone else is interested in the answer.
I stated in the question that I tried setting "autopadding=false". This is the incorrect format for the autopadding parameter; it must actually be a set of boolean values, with the value corresponding to the InputChannels dimension being false.
So the correct form of the parameter would be "autopadding=(true:true:false)", and everything works correctly.

You have a layer that has lower pad 9 and upper pad 10 at depth direction. Are you doing 3D convolution?


Question about input_dim in keras embedding layer

From the documentation on tf.keras.layers.Embedding :
Integer. Size of the vocabulary, i.e. maximum integer index + 1.
Boolean, whether or not the input value 0 is a special “padding” value that should be masked
out. This is useful when using recurrent layers which may take variable length input. If this
is True, then all subsequent layers in the model need to support masking or an exception will
be raised. If mask_zero is set to True, as a consequence, index 0 cannot be used in the
vocabulary (input_dim should equal size of vocabulary + 1).
I was reading this answer but I'm still confused. If my vocabulary size is n but they are encoded with index values from 1 to n (0 is left for padding), is input_dim equal to n or n+1?
If the inputs are padded with zeroes, what are the consequences of leaving mask_zero = False?
If mask_zero = True, based on the documentation, I would have to increment the answer from my first question by one? What is the expected behaviour if this was not done?
I am basically just trying to rephrase parts of the linked answer to make it a bit more understandable in the current context, and also address your other subquestions (which technically should be their own questions, according to [ask]).
It does not matter whether you actually use 0 for padding or not, Keras assumes that you will start indexing from zero and will have to "brace itself" for an input value of 0 in your data. Therefore, you need to choose the value as n+1, because you are essentially just adding a specific value to your vocabulary that you previously didn't consider.
I think this is out of scope for this question to discuss in detail, but - depending on the exact model - the loss values on padded positions do not affect the backpropagation. However, if you choose mask_zero = False, your model will essentially have to correctly predict padding on all those positions (where the padding then also affects the training).
This relates to my illustration: Essentially, you are adding a new vocabulary index. If you do not adjust your dimension, there will likely be an indexing error (out of range) for the vocabulary entry with the highest index (n). Otherwise, you would likely not notice any different behavior.

RETURNN Custom Layer Search Mode Assertion Error

I've implemented a custom RETURNN layer (HMM Factorization), which works as intended during training, but throws an assertion error when used in search mode. The output of the layer is identical to that of a softmax layer.
Here's the config that was used : transformer + HMM Factorization
This was tested using the latest version of RETURNN.
The exact line that fails is (code link):
assert fixed_seq_len is not None
Here's the full error log (too large to paste here)
Here's the training initialisation
Does anybody have any ideas what the error could be?
This is actually a bug in RETURNN. I created a pull request here which should fix that, and merged that in now.
The problem was not with your custom layer, but rather with a layer inside your RecLayer, which was actually totally independent, i.e. this one:
'encoder_int': {'activation': None,
'class': 'linear',
'from': ['base:encoder'],
'n_out': 1000,
'with_bias': False}
It just depends on one base layer ("base:encoder"), nothing else. So it (correctly) optimize this layer out of the recurrent loop, because it is independent.
However, then it sees that you are accessing this layer inside the loop, and as this is a loop over time, it assumes that this loop is over this time-dimension of "base:encoder". Then it tries to unroll the "base:encoder" (TensorArray.unroll) given the seq len of the rec layer, but then it fails because at this time it does not know the seq len of the rec layer.
My fix now does some more advanced check whether this assumption is correct, i.e. that the loop is really over the same time dimension. The check is a bit fragile though, and not sure if that works correctly in all cases. However, I created a test case which reproduces your problem and this is fixed now.

TensorFlow shape checker

Unlike most programming languages, TensorFlow does not regard the shape of an array as part of the type. The downside of this is that, if you make a mistake and accidentally provide data of the wrong shape, it may silently give a wrong answer e.g. Slightly different shape converges to wrong number - why? which makes debugging difficult.
Does there exist a shape checker for TF? That is, a function or program that can analyze a graph (with sample feed_dict if need be) and raise the alarm if there is a shape mismatch?
Tensorflow does offer a shape checker mechanism which is technically the shape argument you should specify while declaring Tensorflow place holders. By default, tensorflow takes [None,None] for shape. But , for example if you do specify the shape while declaring your place holders, then it will raise shape error whenever user enters data of incorrect/conflicting shape. For example
lets say I declared a place holder named X and did specify its shape argument too:
X=tf.placeholder(dtype=tf.float32, shape=[None,256])
Now, this means that number of rows of X can vary but number of features will always be 256. And now , if I mistakenly feed data of shape lets say 1000 rows and 20 features, shape error will be raised.
Also, check this link :
print(str(tf.Shape(test_tensor))) # where test_tensor is
whatever your tensor's name is
Documentation available here:

Reason why setting tensorflow's variable with small stddev

I have a question about a reason why setting TensorFlow's variable with small stddev.
I guess many people do test MNIST test code from TensorFlow beginner's guide.
As following it, the first layer's weights are initiated by using truncated_normal with stddev 0.1.
And I guessed if setting it with more bigger value, then it would be the same result, which is exactly accurate.
But although increasing epoch count, it doesn't work.
Is there anybody know this reason?
original :
W_layer = tf.Variable(tf.truncated_normal([inp.get_shape()[1].value, size],stddev=0.1), name='w_'+name)
#result : (990, 0.93000001, 0.89719999)
modified :
W_layer = tf.Variable(tf.truncated_normal([inp.get_shape()[1].value, size],stddev=200), name='w_'+name)
#result : (99990, 0.1, 0.098000005)
The reason is because you want to keep all the layer's variances (or standard deviations) approximately the same, and sane. It has to do with the error backpropagation step of the learning process and the activation functions used.
In order to learn the network's weights, the backpropagation step requires knowledge of the network's gradient, a measure of how strong each weight influences the input to reach the final output; layer's weight variance directly influences the propagation of gradients.
Say, for example, that the activation function is sigmoidal (e.g. tf.nn.sigmoid or tf.nn.tanh); this implies that all input values are squashed into a fixed output value range. For the sigmoid, it is the range 0..1, where essentially all values z greater or smaller than +/- 4 are very close to one (for z > 4) or zero (for z < -4) and only values within that range tend to have some meaningful "change".
Now the difference between the values sigmoid(5) and sigmoid(1000) is barely noticeable. Because of that, all very large or very small values will optimize very slowly, since their influence on the result y = sigmoid(W*x+b) is extremely small. Now the pre-activation value z = W*x+b (where x is the input) depends on the actual input x and the current weights W. If either of them is large, e.g. by initializing the weights with a high variance (i.e. standard deviation), the result will necessarily be (relatively) large, leading to said problem. This is also the reason why truncated_normal is used rather than a correct normal distribution: The latter only guarantees that most of the values are very close to the mean, with some less than 5% chance that this is not the case, while truncated_normal simply clips away every value that is too big or too small, guaranteeing that all weights are in the same range, while still being normally distributed.
To make matters worse, in a typical neural network - especially in deep learning - each network layer is followed by one or many others. If in each layer the output value range is big, the gradients will get bigger and bigger as well; this is known as the exploding gradients problem (a variation of the vanishing gradients, where gradients are getting smaller).
The reason that this is a problem is because learning starts at the very last layer and each weight is adjusted depending on how much it contributed to the error. If the gradients are indeed getting very big towards the end, the very last layer is the first one to pay a high toll for this: Its weights get adjusted very strongly - likely overcorrecting the actual problem - and then only the "remaining" error gets propagated further back, or up, the network. Here, since the last layer was already "fixed a lot" regarding the measured error, only smaller adjustments will be made. This may lead to the problem that the first layers are corrected only by a tiny bit or not at all, effectively preventing all learning there. The same basically happens if the learning rate is too big.
Finding the best weight initialization is a topic by itself and there are somewhat more sophisticated methods such as Xavier initialization or Layer-sequential unit variance, however small normally distributed values are usually simply a good guess.

Can I change Inv operation into Reciprocal in an existing graph in Tensorflow?

I am working on an image classification problem with tensorflow. I have 2 different CNNs trained separately (in fact 3 in total but I will deal with the third later), for different tasks and on a AWS (Amazon) machine. One tells if there is text in the image and the other one tells if the image is safe for work or not. Now I want to use them in a single script on my computer, so that I can put an image as input and get the results of both networks as output.
I load the two graphs in a single tensorflow Session, using the import_meta_graph API and the import_scope argument and putting each subgraph in a separate scope. Then I just use the restore method of the created saver, giving it the common Session as argument.
Then, in order to run inference, I retrieve the placeholders and final output with graph=tf.get_default_graph() and my_var=graph.get_operation_by_name('name').outputs[0] before using it in (I think I could just have put 'name' in instead of fetching the output tensor and putting it in a variable, but this is not my problem).
My problem is the text CNN works perfectly fine, but the nsfw detector always gives me the same output, no matter the input (even with np.zeros()). I have tried both separately and same story: text works but not nsfw. So I don't think the problem comes from using two networks simultaneaously.
I also tried on the original AWS machine I trained it on, and this time the nsfw CNN worked perfectly.
Both networks are very similar. I checked on Tensorboard if everything was fine and I think it is ok. The differences are in the number of hidden units and the fact that I use batch normalization in the nsfw model and not in the text one. Now why this title ? I observed that I had a warning when running the nsfw model that I didn't have when using only the text model:
W tensorflow/core/framework/] Op Inv is deprecated. It will cease to work in GraphDef version 17. Use Reciprocal.
So I thougt maybe this was the reason, everything else being equal. I checked my GraphDef version, which seems to be 11, so Inv should still work in theory. By the way the AWS machine use tensroflow version 0.10 and I use version 0.12.
I noticed that the text network only had one Inv operation (via a filtering on the names of the operations given by graph.get_operations()), and that the nsfw model had the same operation plus multiple Inv operations due to the batch normalization layers. As precised in the release notes, tf.inv has simply been renamed to tf.reciprocal, so I tried to change the names of the operations to Reciprocal with, as proposed here, but it didn't work. I have seen that using tf.identity() and changing the name could also work, but from what I understand, tensorflow graphs are an append-only structure, so we can't really modify its operations (which seems to be immutable anyway).
The thing is:
as I said, the Inv operation should still work in my GraphDef version;
this is only a warning;
the Inv operations only appear under name scopes that begin with 'gradients' so, from my understanding, this shouldn't be used for inference;
the text model also have an Inv operation.
For these reasons, I have a big doubt on my diagnosis. So my final questions are:
do you have another diagnosis?
if mine is correct, is it possible to replace Inv operations with Reciprocal operations, or do you have any other solution?
After a thorough examination of the output of relevant nodes, with the help of Tensorboard, I am now pretty certain that the renaming of Inv to Reciprocal has nothing to do with my problem.
It appears that the last batch normalization layer eliminates almost any variance of its output when the inputs varies. I will ask why elsewhere.