Tensorflow limiting batch size when learning embeddings - tensorflow

I'm trying to learn the state embeddings for a sequence of states produced by a HMM, similar to how the tensorflow Vector Representation of Words does this for text sequences.
My issue is that the "vocabulary" of this HMM is only 12 different states. Tensorflow doesn't seem to like it when I run my code using batches larger than the size of this vocabulary. For example, attempting to train it with a batch size of 14 gives the error:
F tensorflow/core/kernels/range_sampler.cc:86] Check failed: batch_size + avoided_values.size() <= range_ (14 vs. 12)
Abort trap: 6
What is the motivation behind this check?

If you are following the example from the tutorial
This error actually comes when you set the num_sampled > len(vocabulary)
num_sampled = 64 # Number of negative examples to sample.
you cannot indeed sample indexes (for the negative examples in word to vec) beyond the vocabulary size

Related

How to batch an object detection dataset?

I am working on implementing a face detection model on the wider face dataset. I learned it was built into Tensorflow datasets and I am using it.
However, I am facing an issue while batching the data. Since, an Image can have multiple faces, therefore the number of bounding boxes output are different for each Image. For example, an Image with 2 faces will have 2 bounding box, whereas one with 4 will have 4 and so on.
But the problem is, these unequal number of bounding boxes is causing each of the Dataset object tensors to be of different shapes. And in TensorFlow afaik we cannot batch tensors of unequal shapes ( source - Tensorflow Datasets: Make batches with different shaped data). So I am unable to batch the dataset.
So after loading the following code and batching -
ds,info = tfds.load('wider_face', split='train', shuffle_files=True, with_info= True)
ds1 = ds.batch(12)
for step, (x,y,z) in enumerate(ds1) :
print(step)
break
I am getting this kind of error on run Link to Error Image
In general any help on how can I batch the Tensorflow object detection datasets will be very helpfull.
It might be a bit late but I thought I should post this anyways. The padded_batch feature ought to do the trick here. It kind of goes around the issue by matching dimension via padding zeros
ds,info = tfds.load('wider_face', split='train', shuffle_files=True, with_info= True)
ds1 = ds.padded_batch(12)
for step, (x,y,z) in enumerate(ds1) :
print(step)
break
Another solution would be to process not use batch and process with custom buffers with for loops but that kind of defeats the purpose. Just for posterity I'll add the sample code here as an example of a simple workaround.
ds,info = tfds.load('wider_face', split='train', shuffle_files=True, with_info= True)
batch_size = 12
image_annotations_pair = [x['image'], x['faces']['bbox'] for n, x in enumerate(ds) if n < batch_size]
Then use a train_step modified for this.
For details one may refer to - https://www.kite.com/python/docs/tensorflow.contrib.autograph.operators.control_flow.dataset_ops.DatasetV2.padded_batch

Understanding 2D convolution output size

I am a beginner in Convolutional DL. I saw the following architecture in paper Simultaneous Feature Learning and Hash Coding with Deep Neural Networks: For images of size 256*256,
I do not understand the output size of the first 2D convolution: 96*54*54. 96 seems fine as the number of filters is 96. But, if we apply the following formula for the output size: size = [(W−K+2P)/S]+1 = [(256 - 11 + 2*0)/4] + 1 = 62.25 ~ 62. I have assumed the padding, P to be 0 as it is not mentioned in the paper anywhere. Keras Conv2D API produces the same 96*62*62 size output. Then, why paper points to 96*54*54? What am I missing?
Well, it reminded me AlexNet paper where there was a similar mistake. Your calculation is correct. I think they mistakenly write 256x256 instead of 224x224, in which case the calculation for the input layer is,
(224-11+2*0)/4 + 1 = 54.25 ~ 54
It's highly possible that authors mistakenly wrote 256x256 instead of the real architecture input size being 224x224 (that was the case in AlexNet also), or the other less possible option is they wrote 256x256 which was the real architecture input size, but do the calculations for 224x224. The latter is ignorable as I think it is a very silly mistake and I don't think that's even an option.
Thus, I believe the true input size was 224x224 instead of 256x256.

How to understand masked multi-head attention in transformer

I'm currently studying code of transformer, but I can not understand the masked multi-head of decoder. The paper said that it is to prevent you from seeing the generating word, but I can not unserstand if the words after generating word have not been generated, how can them be seen?
I try to read the code of transformer (link:https://github.com/Kyubyong/transformer). The code achieved mask is shown below. It uses the lower triangular matrix to mask, I can not understand why.
padding_num = -2 ** 32 + 1
diag_vals = tf.ones_like(inputs[0, :, :]) # (T_q, T_k)
tril = tf.linalg.LinearOperatorLowerTriangular(diag_vals).to_dense() # (T_q, T_k)
masks = tf.tile(tf.expand_dims(tril, 0), [tf.shape(inputs)[0], 1, 1]) # (N, T_q, T_k)
paddings = tf.ones_like(masks) * padding_num
outputs = tf.where(tf.equal(masks, 0), paddings, inputs)
I had the very same question after reading the Transformer paper. I found no complete and detailed answer to the question in the Internet so I'll try to explain my understanding of Masked Multi-Head Attention.
The short answer is - we need masking to make the training parallel. And the parallelization is good as it allows the model to train faster.
Here's an example explaining the idea. Let's say we train to translate "I love you" to German. The encoder works in parallel mode - it can produce vector representation of the input sequence ("I love you") within a constant number of steps (i.e. the number of steps doesn't depend on the length of the input sequence).
Let's say the encoder produces the numbers 11, 12, 13 as the vector representations of the input sequence. In reality these vectors will be much longer but for simplicity we use the short ones. Also for simplicity we ignore the service tokens, like - beginning of the sequence, - end of the sequence and others.
During the training we know that the translation should be "Ich liebe dich" (we always know the expected output during the training). Let's say the expected vector representations of the "Ich liebe dich" words are 21, 22, 23.
If we make the decoder training in sequential mode, it'll look like the training of the Recurrent Neural Network. The following sequential steps will be performed:
Sequential operation #1. Input: 11, 12, 13.
Trying to predict 21.
The predicted output won't be exactly 21, let's say it'll be 21.1.
Sequential operation #2. Input: 11, 12, 13, and also 21.1 as the previous output.
Trying to predict 22.
The predicted output won't be exactly 22, let's say it'll be 22.3.
Sequential operation #3. Input 11, 12, 13, and also 22.3 as the previous output.
Trying to predict 23.
The predicted output won't be exactly 23, let's say it'll be 23.5.
This means we'll need to make 3 sequential operations (in general case - a sequential operation per each input). Also we'll have an accumulating error on each next iteration. Also we don't use attention as we only look to a single previous output.
As we actually know the expected outputs we can adjust the process and make it parallel. There's no need to wait for the previous step output.
Parallel operation #A. Inputs: 11, 12, 13.
Trying to predict 21.
Parallel operation #B. Inputs: 11, 12, 13, and also 21.
Trying to predict 22.
Parallel operation #C. Inputs: 11, 12, 13, and also 21, 22.
Trying to predict 23.
This algorithm can be executed in parallel and also it doesn't accumulate the error. And this algorithm uses attention (i.e. looks to all previous inputs) thus has more information about the context to consider while making the prediction.
And here is where we need the masking. The training algorithm knows the entire expected output (21, 22, 23). It hides (masks) a part of this known output sequence for each of the parallel operations.
When it executes #A - it hides (masks) the entire output.
When it executes #B - it hides 2nd and 3rd outputs.
When it executes #C - it hides 3rd output.
Masking itself is implemented as the following (from the original paper):
We implement this inside of scaled dot-product attention by masking
out (setting to −∞) all values in the input of the softmax which
correspond to illegal connections
Note: during the inference (not training) the decoder works in the sequential (not parallel) mode as it doesn't know the output sequence initially. But it's different from RNN approach as Transformer inference still uses self-attention and looks at all previous outputs (but not only the very previous one).
Note 2: I've seen in some materials that masking can be used differently for non-translation applications. For example, for language modeling the masking can be used to hide some words from the input sentence and the model will try to predict them during the training using other, non-masked words (i.e. learn to understand the context).
decoder is a self-regressor and can't see the future words
encoder in transformer is a self-regressor;
which means it will predict the next token according to the previous;
so input x can't see the future words;
we use masked multi-head attention to do this.

Limited range for TensorFlow Universal Sentence Encoder Lite embeddings?

Starting from the universal-sentence-encoder in TensorFlow.js, I noticed that the range of the numbers in the embeddings wasn't what I expected. I was expecting some distribution between [0-1] or [-1,1] but don't see either of these.
For the sentence "cats are great!" here's a visualization, where each dimension is projected onto a scale from [-0.5, 0.5]:
Here's the same kind of visualization for "i wonder what this sentence's embedding will be" (the pattern is similar for the first ~10 sentences I tried):
To debug, I looked at whether the same kind of thing comes up in the demo Colab notebook, and it seems like it is. Here's what I see if I see for the range of the embeddings for those two sentences:
# NEW: added this, with different messages
messages = ["cats are great!", "sometimes models are confusing"]
values, indices, dense_shape = process_to_IDs_in_sparse_format(sp, messages)
with tf.Session() as session:
session.run([tf.global_variables_initializer(), tf.tables_initializer()])
message_embeddings = session.run(
encodings,
feed_dict={input_placeholder.values: values,
input_placeholder.indices: indices,
input_placeholder.dense_shape: dense_shape})
for i, message_embedding in enumerate(np.array(message_embeddings).tolist()):
print("Message: {}".format(messages[i]))
print("Embedding size: {}".format(len(message_embedding)))
message_embedding_snippet = ", ".join(
(str(x) for x in message_embedding[:3]))
print("Embedding: [{}, ...]\n".format(message_embedding_snippet))
# NEW: added this, to show the range of the embedding output
print("Embedding range: [{}, {}]".format(min(message_embedding), max(message_embedding)))
And the output shows:
Message: cats are great!
Embedding range: [-0.05904272198677063, 0.05903803929686546]
Message: sometimes models are confusing
Embedding range: [-0.060731519013643265, 0.06075377017259598]
So this again isn't what I'm expecting - the range is more narrow than I'd expect. I thought this might be a TF convention that I missed, but couldn't see it in the TFHub page or the guide to text embeddings or in the paper so am not sure where else to look without digging into the training code.
The colab notebook example code has an example sentence that says:
Universal Sentence Encoder embeddings also support short paragraphs.
There is no hard limit on how long the paragraph is. Roughly, the
longer the more 'diluted' the embedding will be.
But the range of the embedding is roughly the same for all the other examples in the colab, even one word examples.
I'm assuming this range is not just arbitrary, and it does make sense to me that the range is centered in zero and small, but I'm trying to understand how this scale came to be.
The output of the universal sentence encoder is a vector of length 512, with an L2 norm of (approximately) 1.0. You can check this by calculating the inner product
ip = 0
for i in range(512):
ip += message_embeddings[0][i] * message_embeddings[0][i]
print(ip)
> 1.0000000807544893
The implications are that:
Most values are likely to be in a narrow range centered around zero
The largest possible single value in the vector is 1.0 - and this would only happen if all other values are exactly 0.
Similarly the smallest possible value is -1.
If we take a random vector of length 512, with values distributed uniformly, and then normalize it to unit magnitude, we expect to see values in a range similar to what you see.
rand_uniform = np.random.uniform(-1, 1, 512)
l2 = np.linalg.norm(rand_uniform)
plt.plot(rand_uniform / l2, 'b.')
axes = plt.gca()
axes.set_ylim([-0.5, 0.5])
Judging visually, the distribution of excitations does not look uniform, but rather is biased toward extremes.

Avoiding exhausting GPU resources in convNN Tensorflow

I'm trying to run a hyperparameter optimization script, for a convNN using Tensorflow.
As you may know, TF handling of the GPU-Memory isn't that fancy(don't think it will ever be, thanks to the TPU). So my question is how do I know to choose the filter dimensions and the batchsize, so that the GPU-memory don't get exhausted.
Here's the equation that I'm thinking of:
image_shape =128x128x3(3 color channel)
batchSitze = 20 ( is the smallest possible batchsize, since I got 20 klasses)
filter_shape= fw_fh_fd[filter_width=4, filter_height=4, filter_depth=32]
As far as understood, using tf.conv2d function will need the following amount of memory:
image_width * image_height *numerofchannel*batchSize*filter_height*filter_width*filter_depth*32bit
since we're tf.float32 type for each pixel.
in the given example, the needed memory, will be :
128x128x3x20x4x4x32x32 =16106127360 (bits), which is all most 16GB of memory.
I'm not the formula is correct, so I hope to get a validation or the a correction of what I'm missing.
Actually, this will take only about 44MB of memory, mostly taken by the output.
Your input is 20x128x128x3
The convolution kernel is 4x4x3x32
The output is 20x128x128x32
When you sum up the total, you get
(20*128*128*3 + 4*4*3*32 + 20*128*128*32) * 4 / 1024**2 ≈ 44MB
(In the above, 4 is for the size in bytes of float32 and 1024**2 is to get the result in MB).
Your batch size can be smaller than your number of classes. Think about ImageNet and its 1000 classes: people are training with batch sizes 10 times smaller.
EDIT
Here is a tensorboard screenshot of the net — it reports 40MB rather than 44MB, probably because it excludes the input — and you also have all the tensor sizes I mentioned earlier.