Does batching queue (tf.train.batch) not preserve order? - tensorflow

I've set up a filename-producing queue using tf.train.string_input_producer with the shuffle option set to False, coupled to a batching queue using tf.train.batch (i.e. non-shuffling). Looking at the list of examples being read, while the ordering is almost perfectly preserved, it is not strictly so. For example the first few sample are 4, 2, 1, 3, 5, 6, 7, 8, 9, 11, 10, ..., where the number corresponds to the position of the sample within the first file read. After that the ordering is almost prefect for several hundred samples, but it occasionally switches adjacent samples. Is this expected behavior? Is there some way to enforce that the ordering is preserved, so that one does not have to keep track of what file got read when, etc?
I should say that I conditionally discard some samples by enqueuing either 0 or 1 sample at a time, and setting enqueue_many to True in the batching queue. None of the samples above are being skipped however and so this shouldn't in principle be an issue.

As Yaroslav has mentioned in the comments, a single thread would do the trick. In addition to a single thread, you should set num_epochs = 1. If you don't, it will keep producing batches and it may seem like order is not preserved as it loops from the start again. I hope this works.
Having said that though, I hope someone can come up with a better answer to solving this!

Related

How to find solutions randomly (nondeterministically) in SAT4J?

In the code examples from the SAT4J documentation, calling the solver multiple times on the same SAT problem always yields the same solution, even if multiple possible solutions exist - that is, the result is deterministic.
I'm looking for a way to get different solutions on multiple runs, that is, a nondeterministic/random result. For each possible solution, there should be a non-zero probability for the solution to be picked. Ideally, every solution should be picked with the same probability, but that's not a strict requirement.
I'm aware of the possibility to (deterministically) iterate over all solutions and just take a random one, but that's not a feasible solution in my case since there are too many solutions to begin with, and calculating them all takes too long.
Yes, Sat4j is by default deterministic: it will always find the same solution if you run it several times on the same problem from the command line.
The way to add some non determinism in the heuristics is to use the RandomWalkDecorator, as found for instance in the GreedySolver in org.sat4j.minisat.SolverFactory.
Note however that if you several times such solver from the command line :
java -jar org.sat4j.core.jar GreedySolver file.cnf
you will still be deterministic, since the pseudo random numbers generator is seeded by a constant.
Thus you need to ask several models within your Java code.
As mentioned in your question, you can use a ModelIterator decorator with a bound for that:
ISolver solver = SolverFactory.newGreedySolver();
ModelIterator mi = new ModelIterator(solver,10); // look for 10 models

What does the number in parentheses in `np.random.seed(number)` means?

What is the difference between np.random.seed(0), np.random.seed(42), and np.random.seed(..any number). what is the function of the number in parentheses?
python uses the iterative Mersenne Twister algorithm to generate pseudo-random numbers [1]. The seed is simply where we start iterating.
To be clear, most computers do not have a "true" source of randomness. It is kind of an interesting thing that "randomness" is so valuable to so many applications, and is quite hard to come by (you can buy a specialized device devoted to this purpose). Since it is difficult to make random numbers, but they are nevertheless necessary, many, many, many, many algorithms have been developed to generate numbers that are not random, but nevertheless look as though they are. Algorithms that generate numbers that "look randomish" are called pseudo-random number generators (PRNGs). Since PRNGs are actually deterministic, they can't simply create a number from the aether and have it look randomish. They need an input. It turns out that using some complex operations and modular arithmetic, we can take in an input, and get another number that seems to have little or no relation to the input. Using this intuition, we can simply use the previous output of the PRNG as the next input. We then get a sequence of numbers which, if our PRNG is good, will seem to have no relation to each other.
In order to get our iterative PRNG started, we need an initial input. This initial input is called a "seed". Since the PRNG is deterministic, for a given seed, it will generate an identical sequence of numbers. Usually, there is a default seed that is, itself, sort of randomish. The most common one is the current time. However, the current time isn't a very good random number, so this behavior is known to cause problems sometimes. If you want your program to run in an identical manner each time you run it, you can provide a seed (0 is a popular option, but is entirely arbitrary). Then, you get a sequence of randomish numbers, but if you give your code to someone they can actually entirely recreate the runtime of the program as you witnessed it when you ran it.
That would be the starting key of the generator. Typically if you want to get reproducible results you'll use the same seed over and over again throughout your simulations.
You are setting the seed of the random number generator so you can get reproducible results. Example.
np.random.seed(0)
np.random.randint(0,100,10)
Output:
array([44, 47, 64, 67, 67, 9, 83, 21, 36, 87])
Now, if you ran the same code your computer, you should get the same 10 number output from the random integers from 0 to 100.

Negamax: what to do with "partial" results after canceling a search?

I'm implementing negamax with alpha/beta transposition table based on the pseudo code here, with roughly this algorithm:
NegaMax():
1. Transposition Table lookup
2. Loop through moves
2a. **Bail if I'm out of time**
2b. Make move, call -NegaMax, undo move
2c. Update bestvalue, alpha/beta but if appropriate
3. Transposition table store/update
4. Return bestvalue
I'm also using iterative deepening, calling NegaMax with progressively higher depths.
My question is: when I determine I've run out of time (2a. in the beginning of move loop) what is the right thing to do? Do I bail immediately (not updating the transposition table) or do I just break the loop (saving whatever partial work I've done)?
Currently, I return null at that point, signifying that the search was canceled before "completing" that node (whether by trying every move or the alpha/beta cut). The null gets propagated up and up the stack, and each node on the way up bails by return, so step 3 never runs.
Essentially, I only store values in the TT if the node "completed". The scenario I keep seeing with the iterative deepening:
I get through depths 1-5 really quick, so the TT has a depth = 5, type = Exact entry.
The depth = 6 search is taking a long time, so I bail.
I ultimately return the best move in the transposition table, which is the move I found during the depth = 5 search. The problem is, if I start a new depth = 6 search, it feels like I'm starting it from scratch. However, if I save whatever partial results I found, I worry that I'll have corrupted my TT, potentially by overwriting the completed depth = 5 entry with an incomplete depth = 6 entry.
If the search wasn't completed, the score is inaccurate and should likely not be added to the TT. If you have a best move from the previous ply and it is still best and the score hasn't dropped significantly, you might play that.
On the other hand, if at depth 6 you discover that the opponent has a mate in 3 (oops!) or could win your queen, you might have to spend even more time to try to resolve that.
That would leave you with less time for the remaining moves (if any...), but it might be better to be slightly short on time than to get mated with plenty of time remaining. :-)

Mechanical Turk: how to avoid HITs from different batches to be collapsed

I have a problem with the distribution of HITs from different batches.
The situation is the following: I have 3 batches with 17 HITs each, and I prepared 3 different templates.
What I would like to do is that whenever a worker accepts my HITs, he is shown the 17 HITs connected to a template, and only those (template 1, batch 1).
Then, if he chooses to do another 17, he is shown the other 17 HITs (template 2, batch 2), etc.
What seems to happen is that they see more than 17 HITs, in a sequence (batch 1, part of batch 2): how can I prevent batches to be collapsed? I thought it would have been enough to publish different batches via different templates.
Many thanks in advance!
Gabriella
They'll be collapsed in the system if nothing differs between the HITType characteristics of the batch. So, in order to keep them separate, change one of those properties (e.g., the title, description, keywords, etc.). This will assign each batch a distinct HITTypeId and keep them separated in the system.

Implementing Round Robin insertions to oracle rac db with the help of sequence

Problem
My system inserts records to oracle rac DB at a rate of 600tps. During the insertion-procedure-call each record is assigned a sequence, so that each record should get distributed among 20 different batch ids (implementation of a round robin mechanism).
Following is the step for selecting batch
1) A record comes. Assigns nextValue from a sequence.
2) Do MOD(sequence,20). It gives values from 0 to 19.
Issue:
3 records comes to DB simultaneously and hits 3 different nodes in RAC
Comes out with sequences 2,102,1002.
MOD for all happens to be same.
All try to get into the same batch.
Round Robin fails here.
Please help to resolve the issue.
This is due to the implementation of Sequences on RAC. When a node is first asked for the next value of a sequence it get a bunch of them (e.g. 100 to 119) and then hands them out until it needs a new lot, when it gets another bunch (160 - 179). While Node 1 is handing out 100 then 101, Node 2 will be handing out 121, 122 etc etc.
The size of the 'bunch' is controlled by as I remember the Cache size defined on a Sequence. If you set a cache size of 0, then you will get no caching, and the sequences will be handed out sequentially. However, doing that will involve the Nodes is management overhead while they work out what the next one actually is, and with 600tps this might not be a good idea: you'd have to try it and see,