Sharing Queue between two graphs in tensorflow - tensorflow

Is it possible to share a queue between two graphs in TensorFlow? I'd like to do a kind of bootstrapping to select "hard negative" examples during training.
To speed up the process, I want separate threads for hard negative example selection, and for the training process. The hard negative selection is based on the evaluation of the current model, and it will load its graph from a checkpoint file. The training graph is run on another thread and writes the checkpoint file. The two graphs should share the same queue: the training graph will consume examples and the hard negative selection will produce them.

Currently there's no support for sharing state between different graphs in the open-source version of TensorFlow: each graph runs in a separate session, and each session uses an isolated set of devices.
However, it seems like it would be possible to achieve your goal using a queue in single graph. Simply construct a queue (using e.g. tf.FIFOQueue) and use tf.import_graph_def() to import the graph from the checkpoint file into the current graph. Using the return_elements argument to tf.import_graph_def() you can specify the name of the tensor that will contain the negative examples, and then add a q.enqueue_many() operation to add them to your queue. You would then fork a thread to run the enqueue_many operation in a loop. In your training graph, you can use q.dequeue_many() to get a batch of negative examples, and use them as the input to your training process.

Related

How to retain Entity Identifier with Batch Prediction of XGBoost Model in Vertex AI

I am wondering how we can match back predictions to the entity after executing a batch prediction using an XGBoost model via Custom Training on Prebuilt Images.
When kicking off a BatchPredictionJob it expects the input to be of the form
input_1,input_2,input_3
0.1,0.2,0.3
0.4,0.5,0.6
...
for csv or
[0.1,0.2,0.3]
[0.4,0.5,0.6]
...
for jsonl with the output predictions:
{"instance":[0.1,0.2,0.3], "prediction":0.0345}
...
The output predictions then just contain these instances of input values without any indication of how to map these predictions back to the original entity. As the training is distributed I do not believe I can rely on the file ordering, does anyone have a method to do so?
Doing Batch predictions on a model runs the jobs using distributed processing which means the data is distributed among an arbitrary cluster of virtual machines, and is processed in an unpredictable order.
In the AI platform, to match the returned batch prediction with input instances an instance key needs to be defined. But in Vertex AI this feature has not been documented.
As the concept of using instance keys with prebuilt XGBoost container image on custom trained models is not mentioned in the Vertex AI docs, this issue has been raised in this issue tracker. We cannot provide an ETA at this moment but you can follow the progress in the issue tracker and you can ‘STAR’ the issue to receive automatic updates and give it traction by referring to this link.
In Vertex AI the batch prediction outputs are not ordered, for which a feature request has been raised and you can track the update on this request from this link.

Tensorflow Object Detection API w/ TPU Training - Display more granular Tensorboard plots

I've been following this tutorial on the Tensorflow Object Detection API, and I've successfully trained my own object detection model using Google's Cloud TPUs.
However, the problem is that on Tensorboard, the plots I'm seeing only have 2 data points each (so it just plots a straight line), like this:
...whereas I want to see more "granular" plots like these below, which are much more detailed:
The tutorial I've been following acknowledges that this issue is caused by the fact that TPU training requires very few steps to train:
Note that these graphs only have 2 points plotted since the model
trains quickly in very few steps (if you’ve used TensorBoard before
you may be used to seeing more of a curve here)
I tried adding save_checkpoints_steps=50 in the file model_tpu_main.py (see code fragment below), and when I re-ran training, I was able to get a more granular plot, with 1 data point every 300 steps or so.
config = tf.contrib.tpu.RunConfig(
# I added this line below:
save_checkpoints_steps=50,
master=tpu_grpc_url,
evaluation_master=tpu_grpc_url,
model_dir=FLAGS.model_dir,
tpu_config=tf.contrib.tpu.TPUConfig(
iterations_per_loop=FLAGS.iterations_per_loop,
num_shards=FLAGS.num_shards))
However, my training job is actually saving a checkpoint every 100 steps, rather than every 300 steps. Looking at the logs, my evaluation job is running every 300 steps. Is there a way I can make my evaluation job run every 100 steps (whenever there's a new checkpoint) so that I can get more granular plots on Tensorboard?
Code which addresses this issue is explained by a technical lead for the Google cloud platform in a Medium blogpost. Alternatively go directly to the Github code.
The train_and_evaluate function of 81 lines defines an TPUEstimator, train_input_fn and eval_input_fn. Then it iterates to the training steps and calls estimator.train and estimator.evaluate in each iteration. The metrics can be defined in the model_fn, which is called image_classifier. Note that it currently has no effect to add tf.summary calls in the model functions since the TPU does not support it:
"TensorBoard summaries are a great way see inside your model. A minimal set of basic summaries are automatically recorded by the TPUEstimator, to event files in the model_dir. Custom summaries, however, are currently unsupported when training on a Cloud TPU. So while the TPUEstimator will still run locally with summaries, it will fail if used on a TPU." (source)
If summaries are important it might be more convenient to switch to training on GPU.
Personally I think writing this code is quite a hassle for something which should be handled by the API. Please update this answer if better solutions exist! I'm looking forward to it.
Set save_summary_steps in RunConfig to 100, so you get the statistics you want
Also iterations_per_loop to 100 so that the training doesn't go more steps
p.s. I hope you realize that checkpointing is very slow. You are probably raising the cost of your job just for the sake of a pretty graph :)
You can try adding throttle_secs=100 to the EvalSpecs constructor here. The default is 600 seconds.

Distributed training of a wide and shallow model

I am working on a very wide and shallow computation graph with a relatively small number of shared parameters on a single machine. I would like to make the graph wider but am running out of memory. My understanding is that, by using Distributed Tensorflow, it is possible to split the graph between workers by using the tf.device context manager. However it's not clear how to deal with the loss, which can only be calculated by running the entire graph, and the training operation.
What would be the right strategy to train the parameters for this kind of model?
TensorFlow is based on the concept of a data-flow graph. You define a graph consisting of variables and ops and you can place said variables and ops on different servers and/or devices. When you call session.Run, you pass data in to the graph and each operation between the inputs (specified in the feed_dict) and the outputs (specified in the fetches argument to session.Run) run, regardless of where those ops reside. Of course, passing data across servers incurs communication overhead, but that overhead is often made up for by the fact that you can have multiple concurrent workers performing computation simultaneously.
In short, even if you put ops on other servers, you can still compute the loss over the full graph.
Here's a tutorial on large scale linear models: https://www.tensorflow.org/tutorials/linear
And here's a tutorial on distributed training in TensorFlow:
https://www.tensorflow.org/deploy/distributed

Deep networks on Cloud ML

I am trying to train a very deep model on Cloud ML however i am having serious memory issues that i am not managing to go around. The model is a very deep convolutional neural network to auto-tag music.
The model for this can be found in the image below. A batch of 20 with a tensor of 12x38832x1 is inserted in the network.
The music was originally 465894x1 samples which was then split into 12 windows. Hence, 12x38832x1. When using the map_fn function each loop would have the seperate 38832x1 samples (conv1d).
Processing windows at a time yields better results than the whole music using one CNN. This was split prior to storing the data in TFRecords in order to minimise the needed processing during training. This is loaded in a queue with maximum queue size of 200 samples (ie 10 batches).
Once dequeue, it is transposed to have the 12 dimension first which then can be used in the map_fn function for processing of the windows. This is not transposed prior to being queued as the first dimension needs to match the batch dimension of the output which is [20, 50]. Where 20 is the batch size as the data and 50 are the different tags.
For each window, the data is processed and the results of each map_fn are superpooled using a smaller network. The processing of the windows is done by a very deep neural network which is giving me problems to keep as all the config options i am giving are giving me out of memory errors.
As a model i am using one similar to Census Tensorflow Model.
First and foremost, i am not sure if this is the best option since for evaluation a separate graph is built and not shared variables. This would require double the amount of parameters.
Secondly, as a cluster setup, i have been using one complex_l master, 3 complex_l workers and 3 large_model parameter servers. I do not know if am underestimating the amount of memory needed here.
My model has previously worked with a much smaller network. However, increasing it in size started giving me bad out of memory errors.
My questions are:
The memory requirement is big, but i am sure it can be processed on cloud ml. Am i underestimating the amount of memory needed? What are your suggestions about the cluster for such a network?
When using a train.server in the dispatch function, do you need to pass on the cluster_spec so it is used in the replica_device setter? Or does it allocate on it's own? When not using it, and setting tf.configProto of log placement, all the variables seem to be on the master worker. On the Census Example in the task.py this is not passed on. I can assume this is correct?
How does one calculate how much memory is needed for a model (rough estimate to select the cluster)?
Is there any other tensorflow core tutorial how to setup such big jobs? (other than Census)
When training a big model in distributed between-graph replication, does all the model need to fit on the worker, or the worker only does ops and then transmits the results to the PS. Does that mean that the workers can have low memory just for singular ops?
PS: With smaller models the network trained successfully. I am trying to deepen the network for better ROC.
Questions coming up from on-going troubleshooting:
When using the replica_device_setter with the parameter cluster, i noticed that the master has very little memory and CPU usage and checking the log placement there are very little ops on the master. I checked the TF_CONFIG that is loaded and it says the following for the cluster field:
u'cluster': {u'ps': [u'ps-4da746af4e-0:2222'], u'worker': [u'worker-4da746af4e-0:2222'], u'master': [u'master-4da746af4e-0:2222']}
On the other hand, in the tf.train.Clusterspec documentation, it only shows workers. Does that mean that the master is not considered as worker? What happens in such case?
Error is it Memory or something else? EOF Error?

tf.train.batch_join queue leak?

I'm training a deep network with two data input pipelines, one for training and one for validation. They use shuffle_batch_join and batch_join respectively for parallel data reading. The data stream that is used in the network is decided using a tf.cond operation on top of these two pipelines, which is controlled by a is_training placeholder that is set to true for a training iteration and false when doing validation. I have 4 threads for reading training data and 1 thread for validation.
However, I just added the queue summaries to tensorboard, and observed that validation queue's summary (showing fraction of the queue that is full) gets non-zero at one point during the training, and then drops back to 0. This seems very weird because validation runs only after 1K iterations, and those data points should only be removed at that point. Does anyone have a similar experience or can shed some light into what might be happening?
Answered on TensorFlow Discuss Forum (https://groups.google.com/a/tensorflow.org/forum/#!topic/discuss/mLrt5qc9_uU)