Getting each example exactly once - tensorflow

For monitoring my model's performance on my evaluation dataset, I'm using tf.train.string_input_producer for the filenames queue on .tfr files, then I feed the parsed examples to the tf.train.batch function, that produces batches of a fixed size.
Assume my evaluation dataset contains exactly 761 examples (a prime number). To read all the examples exactly once, I have to have a batch size that divides 761, but there is no such, except 1 that will be too slow and 761 that will not fit in my GPU. Any standard way for reading each example exactly once?
Actually, my dataset size is not 761, but there is no number in the reasonable range of 50-300 that divides it exactly. Also I'm working with many different datasets, and finding a number that approximately divides the number of examples in each dataset can be a hassle.
Note that using the num_epochs parameter to tf.train.string_input_producer does not solve the issue.
Thanks!

You can use reader.read_up_to as in this example. Your last batch will be smaller, so you need to make sure your network doesn't hard-wire batch-size anywhere

Related

Tensorflow Shuffle Batch Non Deterministic

I am trying to get deterministic behaviour from tf.train.shuffle_batch(). I could, instead, use tf.train.batch() which works fine (always the same order of elements), but I need to get examples from multiple tf-records and so I am stuck with shuffle_batch().
I am using:
random.seed(0)
np.random.seed(0)
tf.set_random_seed(0)
data_entries = tf.train.shuffle_batch(
[data], batch_size=batch_size, num_threads=1, capacity=512,
seed=57, min_after_dequeue=32)
But every time I restart my script I get slightly different results (not completely different, but about 20% of the elements are in the wrong order).
Is there anything I am missing?
Edit: Solved it! See my answer below!
Maybe I misunderstood something, but you can collect multiple tf-records in a queue with tf.train.string_input_producer(), then read the examples into tensors and finally use tf.train.batch().
Take a look at CIFAR-10 input.
Answering my own question:
First the reason shuffle_batch is non deterministic:
The time until I request a batch is inherently random.
In that time, a random number of tensors are available.
Tensorflow calls a shuffle operation that is seeded but depending on the number of items, it will return a different order.
So no matter the seeding, the order is always different unless the number of elements is constant. So the solution is to keep the number of elements constant, but how we do it?
By setting capacity=min_after_dequeue+batch_size. This will force Tensorflow to fill up the queue until it reaches full capacity before dequeuing an item. Therefore, at the time of the shuffle operation, we have capacity many items which is a constant number.
So why are we doing this? Because one tf.record contains many examples but we want examples from multiple tf.records. With a normal batch we would first get all the examples of one record and then of the next one. This also means we should set min_after_dequeue to something larger than the number of items in one tf.record. In my example, I have 50 examples in one file so I set min_after_dequeue=2048.
Alternatively, we can also shuffle the examples before creating the tf.records, but this was not possible for me because I read tf.records from multiple directories (each with their own dataset).
Last Note: You should also use a batch size of 1 to be super save.

Tensorflow batching without extra None dimension?

Is it possible to do batching in tensorflow without expanding the placeholder size by an extra dimension of None? Specifically I'd just like to feed multiple samples via the placeholders through feed_dict. The code base I'm working on would require a large amount of change to the code to account for adding an extra dimension for the batch size.
eg:
sess.run(feed_dict={var1:val1values, var2: val2values, ...})
Where val1values would represent a batch of size X instead of just one training sample.
The shape information including the number of dimensions is available to Python code to do arbitrary things with, and does affect the ops added to the graph (like which matmul kernel is used), so there's no general safe way to automatically add a batch dimension. Something like labeled_tensor may make code slightly less confusing to refactor.

In a dynamic_rnn do I need to bucket samples in order to balance batches?

I am implementing a bidirectional dynamic rnn. Now I face the question whether I need to bucket my training samples.
My thought (and fear) is that if I don't bucket I might face the following situation: In a batch with 32 samples and maybe all but one samples being below 500 characters long and one samples being say 10.000 characters long the backprop will behave basically as if I had only a batch size of 1 and might result in NANs quickly or throw off the learned weights pretty badly every time that situation occurs.
Any experiences before I write code and check for days of training and debugging? Thx

Why embedding_lookup_sparse and string_to_hash_bucket in tensorflow slow with large number of rows of embeddings

In tensorflow embedding_lookup_sparse lookup the row of embeddings according the sp_ids. I think it's similar to random access. However when the shape of embeddings is large, i.e 10M rows, the inference spent more time than when the embeddings only has about 1M rows. As I think, the lookup phase and is similar to random access and the hash function spent constant time which is all fast and less sensitive with the size. Is there any wrong with my thought? Is there any way to optimize so that the inference can be faster? Thank you!
Are you sure it is caused by the embedding_lookup? In my case I also have millions of rows to lookup. It is very fast if I use GradientDecend optimizer. It is very slow if I use Adam or the others. Probably it is not the embedding_lookup opr slows down your training but other oprs that depend on the total number of params.
It is true that "embedding_lookup" works slowly when there are many rows in table.
And you may figure out why by reading its source code. Here is the source code in "embedding_lookup":
image of the source code: variable "np" is the length of table
image of the source code: loop with np
As you see there is a loop with a time complexity of O(table length) appearing here. In fact "embedding_lookup" use dynamic partition to separate input data into several partition of ids, and then use this loop to embed words vectors to each id's partition. In my opinion, this trick can fix the time complexity to O(table length) no matter how big the input data is.
So I think the best way for you to increase training speed is to input more samples in each batch.

How do I have to train a HMM with Baum-Welch and multiple observations?

I am having some problems understanding how the Baum-Welch algorithm exactly works. I read that it adjusts the parameters of the HMM (the transition and the emission probabilities) in order to maximize the probability that my observation sequence may be seen by the given model.
However, what does happen if I have multiple observation sequences? I want to train my HMM against a huge lot of observations (and I think this is what is usually done).
ghmm for example can take both a single observation sequence and a full set of observations for the baumWelch method.
Does it work the same in both situations? Or does the algorithm have to know all observations at the same time?
In Rabiner's paper, the parameters of GMMs (weights, means and covariances) are re-estimated in the Baum-Welch algorithm using these equations:
These are just for the single observation sequence case. In the multiple case, the numerators and denominators are just summed over all observation sequences, and then divided to get the parameters. (this can be done since they simply represent occupation counts, see pg. 273 of the paper)
So it's not required to know all observation sequences during an invocation of the algorithm. As an example, the HERest tool in HTK has a mechanism that allows splitting up the training data amongst multiple machines. Each machine computes the numerators and denominators and dumps them to a file. In the end, a single machine reads these files, sums up the numerators and denominators and divides them to get the result. See pg. 129 of the HTK book v3.4