I wrote a Python script using the TensorFlow API, including a SummaryWriter that dumps the graph definition so I can look at it in TensorBoard.
When running the script, a NotFoundError is thrown saying PruneForTargets: Some target nodes not found: Reading/data_queue_EnqueueMany_1. As its name implies, the node in question was created by an enqueue_many call on a FIFOQueue (which is then started in a QueueRunner); it does in fact exist, and can be seen clearly in TensorBoard.
What could cause TensorFlow to not find some nodes?
This is a known issue that occurs when you start threads that access the TensorFlow graph (e.g. your QueueRunner) before adding more nodes to the graph. (The underlying tf.Graph data structure is not thread-safe for concurrent reads and writes.)
The solution is to move tf.train.start_queue_runners(sess) (and any other code that starts threads) after the last node is constructed. One way to double-check this is to add a call to tf.get_default_graph().finalize() immediately before calling start_queue_runners().
Related
There's a fairly clear difference between a model and a frozen model. As described in model_files, relevant part: Freezing
...so there's the freeze_graph.py script that takes a graph definition and a set of checkpoints and freezes them together into a single file.
Is a "saved_model" most similar to a "frozen_model" (and not a saved
"model_checkpoint")?
Is this defined somewhere in docs I'm missing?
In prior versions of tensorflow we would save and restore model
weights, but this seems to be in context of a "model_checkpoint" not
a "saved_model", is that still correct?
I'm asking more for the design overview here, not implementation specifics.
Checkpoint file only contains variables for specific model and should be loaded with either exactly same, predefined graph or with specific assignment_map to load only chosen variables. See https://www.tensorflow.org/api_docs/python/tf/train/init_from_checkpoint
Saved model is more broad cause it contains graph that can be loaded within a session and training could be continued. Frozen graph, however, is serialized and could not be used to continue training.
You can find all the info here https://www.tensorflow.org/guide/saved_model
I actually want to implement model parallelism automatically in the tensorflow.
I little bit correct the code of tensorflow in the placement code(simple_placer.cc) in version 1.3. However the placement was work in case of MNIST, but it has an error on inception.
InvalidArgumentError (see above for traceback): Trying to access resource located in device /job:worker/replica:0/task:1/cpu:0 from device /job:worker/replica:0/task:0/cpu:0
I want to get some advice about this error such as when the error comes up or what condition makes this errors.
Thanks.
This error typically happens when some operation attempts to read one of its inputs, but that input happens to reside on another device. Typically, when tensorflow places operations on different devices, it inserts send/recv nodes into the execution graph to exchange tensors between these devices. You changes might have broken some of that logic.
Why does the basic static, compiled computation graph structure of TF (as opposed to a dynamic graph) necessitate a dedicated while loop node and doesn't enable the use "regular" Python control flow expressions?
Thanks.
TensorFlow builds the computational graph and makes it static (unchangeable) for efficiency. Once it's finalized, telling the TensorFlow graph to do something is like sending some input to a separate program which you can no longer change besides passing in different inputs. So the TensorFlow graph at that point has no knowledge of the Python control flow. It just runs when called. Because of this, it needs to explicitly know ahead of time where you want to add in a while loop inside the TensorFlow graph. You can however, still use Python control flow and just call the TensorFlow graph as though it were a specific function.
I am constructing a neural network in Tensorflow. I am using tf.layers module.
For some reason in the Graph visualisation i am seeing a 'report uninitialised variables' connected to every part of my graph.
Does anyone have an explanation of this? Is it related to the get_variable and variable_scope methods?
The graph seems to work. I am just trying to understand the meaning of these nodes. I am not sure if it is related to the fact that i am using a MonitoredTrainingSession.
It seems to be related to all the variables including of the optimizer.
https://i.stack.imgur.com/ySFM5.png
The is sort of an init node but it seems to say noop, not sure if proper initilaization is done by the MonitoredTrainingSession. The strange thing is that the graph still works and no 'Initialisation Error' is given. https://i.stack.imgur.com/umrRA.png
Did you use tf.train.Supervisor() in your code? I had the same case as yours when I used tf.train.Supervisor(). When tf.train.Supervisor() object is created, it will automatically verify model is fully initialized by running the tf.report_uninitialized_variables() operation, and this is why you see a report_uninitialized_variables block in your tensorboard. You can disable Supervisor to rerify your model, so that there will not be a report_uninitialized_variables block in your graph.
Solution: tf.train.Supervisor(ready_op=None)
Is it possible to share a queue between two graphs in TensorFlow? I'd like to do a kind of bootstrapping to select "hard negative" examples during training.
To speed up the process, I want separate threads for hard negative example selection, and for the training process. The hard negative selection is based on the evaluation of the current model, and it will load its graph from a checkpoint file. The training graph is run on another thread and writes the checkpoint file. The two graphs should share the same queue: the training graph will consume examples and the hard negative selection will produce them.
Currently there's no support for sharing state between different graphs in the open-source version of TensorFlow: each graph runs in a separate session, and each session uses an isolated set of devices.
However, it seems like it would be possible to achieve your goal using a queue in single graph. Simply construct a queue (using e.g. tf.FIFOQueue) and use tf.import_graph_def() to import the graph from the checkpoint file into the current graph. Using the return_elements argument to tf.import_graph_def() you can specify the name of the tensor that will contain the negative examples, and then add a q.enqueue_many() operation to add them to your queue. You would then fork a thread to run the enqueue_many operation in a loop. In your training graph, you can use q.dequeue_many() to get a batch of negative examples, and use them as the input to your training process.