Understanding valgrind output loss record - valgrind

When I run valgrind on my process and after process exit I get below output. What is the meaning of "loss record 33,118 of 34,156"
==4215== 2,048 bytes in 128 blocks are definitely lost in loss record 33,118 of 34,156

It means the 33118th loss record of total 34156 records.
As described in Memory leak detection section of Valgrind documentation,
... it merges results for all blocks that have the same leak kind and sufficiently similar stack traces into a single "loss record".
... The loss records are not presented in any notable order, so the loss record numbers aren't particularly meaningful. The loss record numbers can be used in the Valgrind gdbserver to list the addresses of the leaked blocks and/or give more details about how a block is still reachable.

Related

Tensorflow: how to stop small values leaking through pruning?

The documentation for PolynomialDecay suggests that by default, frequency=100 so that pruning is only applied every 100 steps. This presumably means that the parameters which are pruned to 0 will drift away from 0 during the other 99/100 steps. So at the end of the pruning process, unless you are careful to have an exact multiple of 100 steps, you well end up with a model that is not perfectly pruned but which has a large number of near-zero values.
How does one stop this happening? Do you have to tweak frequency to be a divisor of the number of steps? I can't find any code samples that do that...
As per this example in the doc: while training the tfmot.sparsity.keras.UpdatePruningStep() callback must be registered:
callbacks = [
tfmot.sparsity.keras.UpdatePruningStep(),
…
]
model_for_pruning.fit(…, callbacks=callbacks)
This will ensure that the mask is applied (and so weights set to zero) when the training ends.
https://github.com/tensorflow/model-optimization/blob/master/tensorflow_model_optimization/python/core/sparsity/keras/pruning_callbacks.py#L64

Explanation of parallel arguments of tf.while_loop in TensorFlow

I want to implement an algorithm which allows a parallel implementation in TensorFlow. My question is what the arguments parallel_iterations, swap_memory and maximum_iterations actually do and which are their appropriate values according the situation. Specifically, in the documentation on TensorFlow's site https://www.tensorflow.org/api_docs/python/tf/while_loop says that parallel_iterations are the number of iterations allowed to run in parallel. Is this number the number of threads? When someone should allow CPU-GPU swap memory and for what reason? What are the advantages and disadvantages from this choice? What is the purpose of maximum_iterations? Can it be combined with parallel_iterations?
swap_memory is used when you want to have extra memory on the GPU device. Usually when you are training a model some activations are saved in the GPU mem. for later use. With swap_memory, you can store those activations in the CPU memory and use the GPU mem. to fit e.g. larger batch sizes. And this is an advantage. You would choose this if you need big batch_sizes or have long sequences and want to avoid OOM exceptions. Disadvantage is computation time since you need to transfer the data from CPU mem. to GPU mem.
The maximum iterations is smth. like this:
while num_iter < 100 and <some condition>:
do something
num_iter += 1
So it is useful when you check a condition, but also want to have an upper bound (one example is to check if your model converges. If it doesn't you still want to end after k iterations.)
As for parallel_iterations I am not sure, but it sounds like multiple threads, yes. You can try and see the effect in a sample script.

What is fraction_of_32_full in TensorFlow

As can be seen in this picture, my graph has a: "fraction_of_32_full" output attached to it. I haven't done this explicitly, and nowhere in my code have I set any limits of a size of 32.
The only data I am explicitly adding to my summary is the cost of each batch, yet when I view my TensorBoard visualisation, I see this:
As you can see, it contains three things. The cost, which I asked for, and two other variables which I havent asked for. The fraction of 25,000 full, and the fraction of 32 full.
My Questions Are:
What are these?
Why are they added to my summaries without me explicitly asking?
I can actually answer my own question here. I did some digging, and found the answers.
What are these?
These are measures of how full your queues are. I have two queues, a string input producer queue which reads my files, and a batch queue which batches my records. Both of the records are in the format: "fraction of x full" where x is the capacity of the queue.
The reason the string input producer is fraction of 32, is because if you look at the documentation here, you'll see the default capacity is 32.
Why are these added to my summaries without me explicitly asking.
This was a little trickier. If you look at the source code for the input string producer here, you'll see that although the input_string_producer doesn't explicitly ask for a summary name, it returns an input_producer, which has a default summary name of: summary_name="fraction_of_%d_full" % capacity. Check line 235 for this. Something similar happens here for the batch queue.
The reason these are being recorded without explicitly asking, is because they were created without explicitly asking, and then the line of code:
merged_summaries = tf.summary.merge_all()
merged all these summaries together, so when I called:
sess.run([optimizer, cost, merged_summaries], ..... )
writer.add_summary(s, batch)
I was actually asking for these to be recorded too.
I hope this answer helped some people.

Is my training data really being randomized? Error rates are wildly oscillating

So I set the randomization window to 100,000. In my log I can see that it's oscillating between 0 errors and a lot of errors, which makes me wonder if the data is truly random. The training data is made up of sequences where the input is typically about 50 tokens and the output is 6 tokens for about 99% of the sequences, and maybe about 400 tokens in the other 1% (and these sequences are the most important to learn how to output, of course). It seems like more than one of the longer sequences may be getting clumped together, and that's why the error rate might go up all of a sudden. Is that possible?
Please try to specify larger randomization window if your samples are small, i.e. randomizationWindow=100000000. It can be that your window is only a single chunk - then the data will be only randomized inside, not between chunks.
(You can see how the data is splitted if you specify verbosity=4 in the reader section, the randomized windows [) information).
The more data you can put in memory - the better. Also from the perf perspective, because (after initial load) while the data being processed the readers can start prefetching new chunks and your GPU won't be IO bound.

How do I have to train a HMM with Baum-Welch and multiple observations?

I am having some problems understanding how the Baum-Welch algorithm exactly works. I read that it adjusts the parameters of the HMM (the transition and the emission probabilities) in order to maximize the probability that my observation sequence may be seen by the given model.
However, what does happen if I have multiple observation sequences? I want to train my HMM against a huge lot of observations (and I think this is what is usually done).
ghmm for example can take both a single observation sequence and a full set of observations for the baumWelch method.
Does it work the same in both situations? Or does the algorithm have to know all observations at the same time?
In Rabiner's paper, the parameters of GMMs (weights, means and covariances) are re-estimated in the Baum-Welch algorithm using these equations:
These are just for the single observation sequence case. In the multiple case, the numerators and denominators are just summed over all observation sequences, and then divided to get the parameters. (this can be done since they simply represent occupation counts, see pg. 273 of the paper)
So it's not required to know all observation sequences during an invocation of the algorithm. As an example, the HERest tool in HTK has a mechanism that allows splitting up the training data amongst multiple machines. Each machine computes the numerators and denominators and dumps them to a file. In the end, a single machine reads these files, sums up the numerators and denominators and divides them to get the result. See pg. 129 of the HTK book v3.4