Problem when predicting via multiprocess with Tensorflow - tensorflow

I have 4 (or more) models (same structure but different training data). Now I want to ensemble them to make a prediction. I want to pre-load the models and then predict one input message (one message at a time) in parallel via multiprocess. However, the program always stops at "session.run" step. I could not figure it out why.
I tried passing all arguments to the function in each process, as shown in the code below. I also tried using a Queue object and put all the data (except the model object) in the queue. I also tried to set the number of process to 1. It made no difference.
with Manager() as manager:
first_level_test_features=manager.list()
procs =[]
for id in range(4):
p = Process(target=predict, args=(id, (message, models, configs, vocabs, emoji_dict,first_level_test_features)))
procs.append(p)
p.start()
for p in procs:
p.join()
I did not get any error message since it is just stuck there. I would expect the program can start multiple processes and each process uses the model pass to it to make the prediction.

I am unsure how session sharing along different Processes would work, and this is probably where your issue comes from. Given the way TensorFlow works, I would advise implementing the ensemble call as a graph operation, so that it can be run through a single session.run call, with TF handling the parallelization of computations wherever possible.
In practice, if you have symbolic tensors representing the models' predictions, you could use a TF operation to aggregate them (tf.concat, tf.reduce_mean, tf.add_n... whichever suits your design) and end up with a single symbolic tensor representing the ensemble prediction.
I hope this helps; if not, please provide some more details as to what your setting is, notably which form your models have.

Related

Learning parameters of each simulated device

Does tensorflow-federated support assigning different hyper-parameters(like batch-size or learning rate) for different simulated devices?
Currently, you may find this a bit unnatural, but yes, such a thing is possible.
One approach to doing this that is supported today is to have each client take its local learning rate as a top-level parameter, and use this in the training. A dummy example here would be (sliding the model parameter in the computations below) something along the lines of
#tff.tf_computation(tff.SequenceTyoe(...), tf.float32)
def train_with_learning_rate(ds, lr):
# run training with `tf.data.Dataset` ds and learning rate lr
...
#tff.federated_computation(tff.FederatedType([tff.SequenceType(...), tf.float32])
def run_one_round(datasets_and_lrs):
return tff.federated_mean(
tff.federated_map(train_with_learning_rate, datasets_and_lrs))
Invoking the federated computation here with a list of tuples with the first element of the tuple representing the clients data and the second element representing the particular client's learning rate, would give what you want.
Such a thing requires writing custom federated computations, and in particular likely defining your own IterativeProcess. A similar iterative process definition was recently open sourced here, link goes to the relevant local client function definition to allow for learning rate scheduling on the clients by taking an extra integer parameter representing the round number, it is likely a good place to look.

How to assign value to one graph with the other graph who has the same structure in tensorflow?

I'm trying to implement DQN in tensorflow. Here I have one target network and one training network who have the same structure with each other. In the beginning of every 10000 training steps, I want to load the value from checkpoint to target network and training network, then stop_gradient target network. However, I tried those ways, and none of them worked:
1, Put the two networks in one graph. However, every time I load them, I don't know how to assign the value of training network part to target network part.(They are saved in different values, since one is stop gradient.)
2, Define two graphs using tf.graph() and run two session respectively. However, I can't load the checkpoint of one graph to another, even they have the same structure. After all, they are two different graphs.
So, any one who can give me some advice? Very appreciated!
The typical approach would be to put everything in one graph, put your two networks in two name scopes, and then create tf.assign ops for each variable in one scope to the another and use the tf.group to construct a final "copying" operation. Lets assume that function create_net() builds a single network
with tf.name_scope('main_network'):
main_net = create_net()
with tf.name_scope('target_network):
target_network = create_net()
main_variables = tf.get_collection(tf.GraphKeys.VARIABLES, scope='main_network')
target_variables = tf.get_collection(tf.GraphKeys.VARIABLES, scope='target_network')
# I am assuming get_collection returns variables in the same order, please double
# check this is actually happening
assign_ops = []
for main_var, target_var in zip(main_variables, target_variables):
assign_ops.append(tf.assign(target_var, tf.identity(main_var)))
copy_operation = tf.group(*assign_ops)
Now executing copy_operation in session.run should copy your main network parameters to the target network. The above code should be considered a pseudo code, rather than something you can copy&paste.

feed data into a tf.contrib.data.Dataset like a queue

About the tf.contrib.data.Dataset (from TensorFlow 1.2, see here and here) usage:
The way how to get data doesn't really fit any way how I get the data usually. In my case, I have a thread and I receive data there and I don't know in advance when it will end but I see when it ends. Then I wait until I processed all the buffers and then I have finished one epoch. How can I get this logic with the Dataset?
Note that I prefer the Dataset interface over the QueueBase interface because it gives me the iterator interface which I can reinitialize and even reset to a different Dataset. This is more powerful compared to queues which cannot be reopened currently after they are closed (see here and here).
Maybe a similar question, or the same question: How can I wrap around a Dataset over a queue? I have some thread with reads some data from somewhere and which can feed it and queue it somehow. How do I get the data into the Dataset? I could repeat some dummy tensor infinite times and then use map to just return my queue.dequeue() but that really only gets me back to all the original problems with the queue, i.e. how to reopen the queue.
The new Dataset.from_generator() method allows you to define a Dataset that is fed by a Python generator. (To use this feature at present, you must download a nightly build of TensorFlow or build it yourself from source. It will be part of TensorFlow 1.4.)
The easiest way to implement your example would be to replace your receiving thread with a generator, with pseudocode as follows:
def receiver():
while True:
next_element = ... # Receive next element from external source.
# Note that this method may block.
end_of_epoch = ... # Decide whether or not to stop based on next_element.
if not end_of_epoch:
yield next_element # Note: you may need to convert this to an array.
else:
return # Returning will signal OutOfRangeError on downstream iterators.
dataset = tf.contrib.data.Dataset.from_generator(receiver, output_types=...)
# You can chain other `Dataset` methods after the generator. For example:
dataset = dataset.prefetch(...) # This will start a background thread
# to prefetch elements from `receiver()`.
dataset = dataset.repeat(...) # Note that each repetition will call
# `receiver()` again, and start from
# a fresh state.
dataset = dataset.batch(...)
More complicated topologies are possible. For example, you can use Dataset.interleave() to create many receivers in parallel.

Can I change Inv operation into Reciprocal in an existing graph in Tensorflow?

I am working on an image classification problem with tensorflow. I have 2 different CNNs trained separately (in fact 3 in total but I will deal with the third later), for different tasks and on a AWS (Amazon) machine. One tells if there is text in the image and the other one tells if the image is safe for work or not. Now I want to use them in a single script on my computer, so that I can put an image as input and get the results of both networks as output.
I load the two graphs in a single tensorflow Session, using the import_meta_graph API and the import_scope argument and putting each subgraph in a separate scope. Then I just use the restore method of the created saver, giving it the common Session as argument.
Then, in order to run inference, I retrieve the placeholders and final output with graph=tf.get_default_graph() and my_var=graph.get_operation_by_name('name').outputs[0] before using it in sess.run (I think I could just have put 'name' in sess.run instead of fetching the output tensor and putting it in a variable, but this is not my problem).
My problem is the text CNN works perfectly fine, but the nsfw detector always gives me the same output, no matter the input (even with np.zeros()). I have tried both separately and same story: text works but not nsfw. So I don't think the problem comes from using two networks simultaneaously.
I also tried on the original AWS machine I trained it on, and this time the nsfw CNN worked perfectly.
Both networks are very similar. I checked on Tensorboard if everything was fine and I think it is ok. The differences are in the number of hidden units and the fact that I use batch normalization in the nsfw model and not in the text one. Now why this title ? I observed that I had a warning when running the nsfw model that I didn't have when using only the text model:
W tensorflow/core/framework/op_def_util.cc:332] Op Inv is deprecated. It will cease to work in GraphDef version 17. Use Reciprocal.
So I thougt maybe this was the reason, everything else being equal. I checked my GraphDef version, which seems to be 11, so Inv should still work in theory. By the way the AWS machine use tensroflow version 0.10 and I use version 0.12.
I noticed that the text network only had one Inv operation (via a filtering on the names of the operations given by graph.get_operations()), and that the nsfw model had the same operation plus multiple Inv operations due to the batch normalization layers. As precised in the release notes, tf.inv has simply been renamed to tf.reciprocal, so I tried to change the names of the operations to Reciprocal with tf.group(), as proposed here, but it didn't work. I have seen that using tf.identity() and changing the name could also work, but from what I understand, tensorflow graphs are an append-only structure, so we can't really modify its operations (which seems to be immutable anyway).
The thing is:
as I said, the Inv operation should still work in my GraphDef version;
this is only a warning;
the Inv operations only appear under name scopes that begin with 'gradients' so, from my understanding, this shouldn't be used for inference;
the text model also have an Inv operation.
For these reasons, I have a big doubt on my diagnosis. So my final questions are:
do you have another diagnosis?
if mine is correct, is it possible to replace Inv operations with Reciprocal operations, or do you have any other solution?
After a thorough examination of the output of relevant nodes, with the help of Tensorboard, I am now pretty certain that the renaming of Inv to Reciprocal has nothing to do with my problem.
It appears that the last batch normalization layer eliminates almost any variance of its output when the inputs varies. I will ask why elsewhere.

Tensorflow--how to limit epochs with evaluation only?

Given that I train a model; save it off with metagraph/save.Saver, and the load that graph into a new script/process to test against test data, what is the best way to make sure I only iterate over the test data once?
With my training data, I want to be able to iterate over the entire data set for an arbitrary number of iterations. I use
tf.train.string_input_producer()
to drive a queue of loading files for training, so I can safely leave num_epochs as default (=None) and let other controls drive training termination.
However, when I run the graph for evaluation, I just want to the evaluate the test set once (and gather the appropriate statistics).
Initial attempted solution:
Make a tensor for Epochs, and pass that into tf.train.string_input_producer, and then tf.Assign it to the appropriate value based on test/train.
But:
tf.train.string_input_producer only takes integers as num_epochs, so this isn't possible...unless I'm missing something.
Further notes: I use
tf.train.batch()
to read-in test/train data that has been serialized into protocol buffers (https://www.tensorflow.org/versions/r0.11/how_tos/reading_data/index.html#file-formats), so I have minimal visibility into how the data is loaded and how far along it is.
tf.train.batch apparently will throw tf.errors.OutOfRangeError, but I'm not clear how to catch that successfully, or if that is even what I really want to do. I tried a very naive
try...except...finally
(like in https://www.tensorflow.org/versions/r0.11/how_tos/reading_data/index.html#creating-threads-to-prefetch-using-queuerunner-objects), which didn't catch the error from tf.train.batch.