Tensorflow Model parallelism error - tensorflow

I actually want to implement model parallelism automatically in the tensorflow.
I little bit correct the code of tensorflow in the placement code(simple_placer.cc) in version 1.3. However the placement was work in case of MNIST, but it has an error on inception.
InvalidArgumentError (see above for traceback): Trying to access resource located in device /job:worker/replica:0/task:1/cpu:0 from device /job:worker/replica:0/task:0/cpu:0
I want to get some advice about this error such as when the error comes up or what condition makes this errors.
Thanks.

This error typically happens when some operation attempts to read one of its inputs, but that input happens to reside on another device. Typically, when tensorflow places operations on different devices, it inserts send/recv nodes into the execution graph to exchange tensors between these devices. You changes might have broken some of that logic.

Related

How to drop elements in dataset that can cause an error while training a TensorFlow Lite model

I am trying to train a simple image classification model using TensorFlow Lite. I am following this documentation to write my code. As specified in the documentation, in order to train my model, I have written model = image_classifier.create(train_data, model_spec=model_spec.get('mobilenet_v2'), validation_data=validation_data). After training for a few seconds, however, I get an InvalidArgumentError. I believe that the error is due to something in my dataset but it is too difficult to eliminate all the sources of the error from the dataset manually because it consists of thousands of images. After some research, I found a potential solution - I could use tf.data.experimental.ignore_errors which would "produce a dataset that contains the same elements as the input, but silently drop any elements that caused an error." From the documentation, however, (here) I couldn't figure out how to integrate this transformation function with my code. If I place the line dataset = dataset.apply(tf.data.experimental.ignore_errors()) before training the model, the system doesn't know which elements to drop. If I place the line after, the system never reaches the line because an error arises in training. Moreover, the system gives an error message AttributeError: 'ImageClassifierDataLoader' object has no attribute 'apply'. I would appreciate if someone can tell me how to integrate tf.data.experimental.ignore_errors() with my model or possible alternatives to the issue I am facing.
Hi if you are exactly following the documentation then
tf.data.experimental.ignore_errors won't work for you because you are not loading your data using tf.data,You are most probably using from tflite_model_maker.image_classifier import DataLoader.
Note: Please mention the complete code snippet to help you out to solve the issue

Tensorflow inference run time high on first data point, decreases on subsequent data points

I am running inference using one of the models from TensorFlow's object detection module. I'm looping over my test images in the same session and doing sess.run(). However, on profiling these runs, I realize the first run always has a higher time as compared to the subsequent runs.
I found an answer here, as to why that happens, but there was no solution on how to fix.
I'm deploying the object detection inference pipeline on an Intel i7 CPU. The time for one session.run(), for 1,2,3, and the 4th image looks something like (in seconds):
1. 84.7132628
2. 1.495621681
3. 1.505012751
4. 1.501652718
Just a background on what all I have tried:
I tried using the TFRecords approach TensorFlow gave as a sample here. I hoped it would work better because it doesn't use a feed_dict. But since more I/O operations are involved, I'm not sure it'll be ideal. I tried making it work without writing to the disk, but always got some error regarding the encoding of the image.
I tried using the TensorFlow datasets to feed the data, but I wasn't sure how to provide the input, since the during inference I need to provide input for "image tensor" key in the graph. Any ideas on how to use this to provide input to a frozen graph?
Any help will be greatly appreciated!
TLDR: Looking to reduce the run time of inference for the first image - for deployment purposes.
Even though I have seen that the first inference takes longer, the difference (84 Vs 1.5) that is shown there seems to be a bit unbelievable. Are you counting the time to load model also, inside this time metric? Can this could be the difference for the large time difference? Is the topology that complex, so that this time difference could be justified?
My suggestions:
Try Openvino : See if the topology you are working on, is supported in Openvino. OpenVino is known to speed up the inference workloads by a substantial amount due to it's capability to optimize network operations. Also, the time taken to load openvino model, is comparitively lower in most of the cases.
Regarding the TFRecords Approach, could you please share the exact error and at what stage you got the error?
Regarding Tensorflow datasets, could you please check out https://github.com/tensorflow/tensorflow/issues/23523 & https://www.tensorflow.org/guide/datasets. Regarding the "image tensor" key in the graph, I hope, your original inference pipeline should give you some clues.

Training segmentation model, 4 GPUs are working, 1 fills and getting: "CUDA error: out of memory"

I'm trying to build a segmentation model,and I keep getting
"CUDA error: out of memory",after ivestigating, I realized that all the 4 GPUs are working but one of them is filling.
Some technical details:
My Model:
the model is written in pytorch and has 3.8M parameters.
My Hardware:
I have 4 GPUs with 12GRAM (Titan V) each.
I'm trying to understand why one of my GPUs is filling up, and what am I doing wrong.
Evidence:
as can be seen from the screenshot below, all the GPUs are working, but one of them just keep filling until he gets his limit.
Code:
I'll try to explain what I did in the code:
First my model:
model = model.cuda()
model = nn.DataParallel(model, device_ids=None)
Second, Inputs and targets:
inputs = inputs.to('cuda')
masks = masks.to('cuda')
Those are the lines that working with the GPUs, if I missed something, and you need anything else, please share.
I'm feeling like I'm missing something so basic, that will affect not only this model but also the models in the future, I'll be more than happy for some help.
Thanks a lot!
Without knowing much of the details I can say the following
nvidia-smi is not the most reliable and up-to-date measurement mechanism
the PyTorch GPU allocator does not help either - it will cache blocks of memory artificially blowing up used resources (though it is not an issue here)
I believe there is still a "master" GPU which is the one data is loaded to directly (and then broadcast to other GPUs in DataParallel)
I don't know enough about PyTorch to reliably answer, but you can definitely check if a single GPU setup works with batch size divided by 4. And perhaps if you can load the model + the batch at one (without processing it).

Flink-like barrier for Tensorflow

Is there an equivalent of a Flink barrier for Tensorflow?
There seems to not be a way to interact with the executor from any given kernel except by throwing an exception, and any deviation from a "pure" dataflow execution is not allowed, such as
Producing no output for a given input
Producing multiple outputs for a given input (e.g. splitting a sentence into words). I get around this by having such a kernel take a queue reference and do the enqueuing itself, but this feels like a modularity violation.
Receiving some sort of "control tuple / Tensor" so that multiple kernels can synchronize at some point (e.g. to implement a barrier). In other words, the only schedulable code for each kernel is Compute() on the normal Input and Output Tensors.
Is there any way to get Tensorflow to be able to behave more like a streaming framework? Is using Tensorflow as a streaming framework an unintended / improper use of it?
While TensorFlow kernels can't behave like proper units in a streaming framework, as they are, as you pointed out, called once per set of inputs and expected to produce one set of outputs each time they're called, there are alternatives.
The tf.contrib.data framework is built on the concept of a Dataset, which is a unit which has all the properties you specified above (maybe not the control tuple yet, but it'd be easy to add).
Have you considered using tge recently released Flink ML "integration" with Tensorflow?
https://github.com/FlinkML/flink-tensorflow

TensorFlow's target pruning can't find nodes

I wrote a Python script using the TensorFlow API, including a SummaryWriter that dumps the graph definition so I can look at it in TensorBoard.
When running the script, a NotFoundError is thrown saying PruneForTargets: Some target nodes not found: Reading/data_queue_EnqueueMany_1. As its name implies, the node in question was created by an enqueue_many call on a FIFOQueue (which is then started in a QueueRunner); it does in fact exist, and can be seen clearly in TensorBoard.
What could cause TensorFlow to not find some nodes?
This is a known issue that occurs when you start threads that access the TensorFlow graph (e.g. your QueueRunner) before adding more nodes to the graph. (The underlying tf.Graph data structure is not thread-safe for concurrent reads and writes.)
The solution is to move tf.train.start_queue_runners(sess) (and any other code that starts threads) after the last node is constructed. One way to double-check this is to add a call to tf.get_default_graph().finalize() immediately before calling start_queue_runners().