What is actually meant by parallel_iterations in tfp.mcmc.sample_chain? - tensorflow

I am not able to get what does the parameter parallel_iterations stand for in sampling multiple chains during MCMC.
The documentation for mcmc.sample_chain() doesn't give much details, it just says that
The parallel iterations are the number of iterations allowed to run in parallel. It must be a positive integer.
I am running a NUTS sampler with multiple chains while specifying parallel_iterations=8.
Does it mean that the chains are strictly run in parallel? Is the parallel execution dependent on multi-core support? If so, what is a good value (based on the number of cores) to set parallel_iterations? Should I naively set it to some higher value?

TensorFlow can unroll iterations of while loops to execute in parallel, when some parts of the data flow (I.e. iteration condition) can be computed faster than other parts. If you don't have a special preference (i.e. reproducibility with legacy stateful samplers), leave it at default.

Related

Gloabl Seed and Operation Seed in Tensorflow2

What is the difference between Global Seed and Operation Seed in TensorFlow.
According to the tensorflow documentation
While explaining Global Seed, they mention this
If the global seed is set but the operation seed is not set, we get
different results for every call to the random op, but the same
sequence for every re-run of the program:
and while explaining Operation Seed, they again state, something similiar
If the operation seed is set, we get different results for every call
to the random op, but the same sequence for every re-run of the
program:
what are the main differences between the two...and how do they operate at a intuitive level.
Thanks.
Here is good decription of the differences: https://www.kite.com/python/docs/tensorflow.set_random_seed
In short, tf.random.set_seed or tf.set_random_seed will guarantee that all operations will produce repeatable results across sessions. It will set the operation-seed deterministically for every operation.
Setting operation seed makes sense only as part of operation definition tf.random_uniform([1], seed=1) and will also lead to same sequences produced by this op across sessions.
What is the difference?
graph-seed makes all ops repeatedly deterministic. Use it if you want to fix all ops. Different ops will still produce different sequences (but repeated across sessions)
operation-seed makes single operation deterministic. You can create 2 ops that will produce same sequences.

Explanation of parallel arguments of tf.while_loop in TensorFlow

I want to implement an algorithm which allows a parallel implementation in TensorFlow. My question is what the arguments parallel_iterations, swap_memory and maximum_iterations actually do and which are their appropriate values according the situation. Specifically, in the documentation on TensorFlow's site https://www.tensorflow.org/api_docs/python/tf/while_loop says that parallel_iterations are the number of iterations allowed to run in parallel. Is this number the number of threads? When someone should allow CPU-GPU swap memory and for what reason? What are the advantages and disadvantages from this choice? What is the purpose of maximum_iterations? Can it be combined with parallel_iterations?
swap_memory is used when you want to have extra memory on the GPU device. Usually when you are training a model some activations are saved in the GPU mem. for later use. With swap_memory, you can store those activations in the CPU memory and use the GPU mem. to fit e.g. larger batch sizes. And this is an advantage. You would choose this if you need big batch_sizes or have long sequences and want to avoid OOM exceptions. Disadvantage is computation time since you need to transfer the data from CPU mem. to GPU mem.
The maximum iterations is smth. like this:
while num_iter < 100 and <some condition>:
do something
num_iter += 1
So it is useful when you check a condition, but also want to have an upper bound (one example is to check if your model converges. If it doesn't you still want to end after k iterations.)
As for parallel_iterations I am not sure, but it sounds like multiple threads, yes. You can try and see the effect in a sample script.

How can I get the number of iterations in AMPL?

I can get the number of variables using _nvars. Then, I tried _niters and _niterations but don't work.
I have also searched it in the manual unsuccessfully.
Is there a simple way to get the number of iterations, other than extracting it from solve_message (e.g. with regular expressions)?
To the best of my knowledge there is no built-in parameter representing a number of iterations in AMPL. In fact, this is very solver specific and different solvers and even different algorithms within a single solver may have multiple different iteration counts, such as MIP iterations, Simplex iterations, master iterations when using decomposition, etc. Your best bet is probably to parse the solver message.

Finding Optimal Parameters In A "Black Box" System

I'm developing machine learning algorithms which classify images based on training data.
During the image preprocessing stages, there are several parameters which I can modify that affect the data I feed my algorithms (for example, I can change the Hessian Threshold when extracting SURF features). So the flow thus far looks like:
[param1, param2, param3...] => [black box] => accuracy %
My problem is: with so many parameters at my disposal, how can I systematically pick values which give me optimized results/accuracy? A naive approach is to run i nested for-loops (assuming i parameters) and just iterate through all parameter combinations, but if it takes 5 minute to calculate an accuracy from my "black box" system this would take a long, long time.
This being said, are there any algorithms or techniques which can search for optimal parameters in a black box system? I was thinking of taking a course in Discrete Optimization but I'm not sure if that would be the best use of my time.
Thank you for your time and help!
Edit (to answer comments):
I have 5-8 parameters. Each parameter has its own range. One parameter can be 0-1000 (integer), while another can be 0 to 1 (real number). Nothing is stopping me from multithreading the black box evaluation.
Also, there are some parts of the black box that have some randomness to them. For example, one stage is using k-means clustering. Each black box evaluation, the cluster centers may change. I run k-means several times to (hopefully) avoid local optima. In addition, I evaluate the black box multiple times and find the median accuracy in order to further mitigate randomness and outliers.
As a partial solution, a grid search of moderate resolution and range can be recursively repeated in the areas where the n-parameters result in the optimal values.
Each n-dimensioned result from each step would be used as a starting point for the next iteration.
The key is that for each iteration the resolution in absolute terms is kept constant (i.e. keep the iteration period constant) but the range decreased so as to reduce the pitch/granular step size.
I'd call it a ‘contracting mesh’ :)
Keep in mind that while it avoids full brute-force complexity it only reaches exhaustive resolution in the final iteration (this is what defines the final iteration).
Also that the outlined process is only exhaustive on a subset of the points that may or may not include the global minimum - i.e. it could result in a local minima.
(You can always chase your tail though by offsetting the initial grid by some sub-initial-resolution amount and compare results...)
Have fun!
Here is the solution to your problem.
A method behind it is described in this paper.

How do I have to train a HMM with Baum-Welch and multiple observations?

I am having some problems understanding how the Baum-Welch algorithm exactly works. I read that it adjusts the parameters of the HMM (the transition and the emission probabilities) in order to maximize the probability that my observation sequence may be seen by the given model.
However, what does happen if I have multiple observation sequences? I want to train my HMM against a huge lot of observations (and I think this is what is usually done).
ghmm for example can take both a single observation sequence and a full set of observations for the baumWelch method.
Does it work the same in both situations? Or does the algorithm have to know all observations at the same time?
In Rabiner's paper, the parameters of GMMs (weights, means and covariances) are re-estimated in the Baum-Welch algorithm using these equations:
These are just for the single observation sequence case. In the multiple case, the numerators and denominators are just summed over all observation sequences, and then divided to get the parameters. (this can be done since they simply represent occupation counts, see pg. 273 of the paper)
So it's not required to know all observation sequences during an invocation of the algorithm. As an example, the HERest tool in HTK has a mechanism that allows splitting up the training data amongst multiple machines. Each machine computes the numerators and denominators and dumps them to a file. In the end, a single machine reads these files, sums up the numerators and denominators and divides them to get the result. See pg. 129 of the HTK book v3.4