I'm training a deep network with two data input pipelines, one for training and one for validation. They use shuffle_batch_join and batch_join respectively for parallel data reading. The data stream that is used in the network is decided using a tf.cond operation on top of these two pipelines, which is controlled by a is_training placeholder that is set to true for a training iteration and false when doing validation. I have 4 threads for reading training data and 1 thread for validation.
However, I just added the queue summaries to tensorboard, and observed that validation queue's summary (showing fraction of the queue that is full) gets non-zero at one point during the training, and then drops back to 0. This seems very weird because validation runs only after 1K iterations, and those data points should only be removed at that point. Does anyone have a similar experience or can shed some light into what might be happening?
Answered on TensorFlow Discuss Forum (https://groups.google.com/a/tensorflow.org/forum/#!topic/discuss/mLrt5qc9_uU)
Related
I'm trying to apply reinforcement learning to a round-based game environment. Each round I get a (self-contained / nearly markovian) state and have to provide an action to progress in the world. Because there exist some long-term strategies (develop resource "A", wait few rounds for development, use resource "A"), I'm thinking of using an LSTM layer in my neural net. During training I can feed sequences of rounds into the network to train the LSTM; however, during the testing phase I'm only able to provide the current state (this is a hard requirement).
I'm wondering whether LSTMs are a viable option here or if they are not suitable for this usage, because I can only provide one state during testing / deployment.
Yes, LSTMs are a viable option here. In keras this would surmount to setting the field called "stateful" to true. What this does is to not reset the internal state of the cells between each sample, meaning that it would keep remembering the previous step(s) until this cell is reset.
In this case, you would simply set the LSTM stateful to true, hand it one sample per step and reset after the episode is done. Remember that you might not want to keep it stateful during training if there is enough signal that you can fit all the timesteps you need for finding the long term strategies into one sample, as you'd probably be doing replays over multiple episodes.
IF you're using anything else but keras, googling for stateful LSTM in xyz framework ought to help you further
I've used TensorFlow but am new to distributed TensorFlow for training models. My understanding is that current best practices favor the data-parallel model with asynchronous updates:
A paper published by the Google Brain team in April 2016 benchmarked
various approaches and found that data parallelism with synchronous
updates using a few spare replicas was the most efficient, not only
converging faster but also producing a better model. -- Chapter 12 of
Hands-On Machine Learning with Scikit-Learn and Tensorflow.
Now, my confusion from reading further about this architecture is figuring out which component applies the parameter updates: the workers or the parameter server?
In my illustration below, it's clear to me that the workers compute the gradients dJ/dw (the gradient of the loss J with respect to the parameter weights w). But who applies the gradient descent update rule?
What's a bit confusing is that this O'Reilly article on Distributed TensorFlow states the following:
In the more centralized architecture, the devices send their output in
the form of gradients to the parameter servers. These servers collect
and aggregate the gradients. In synchronous training, the parameter
servers compute the latest up-to-date version of the model, and send
it back to devices. In asynchronous training, parameter servers send
gradients to devices that locally compute the new model. In both
architectures, the loop repeats until training terminates.
The above paragraph suggests that in asynchronous training:
The workers compute gradients and send it to the parameter server.
The parameter server broadcasts the gradients to the workers.
Each worker receives the broadcasted gradients and applies the update rule.
Is my understanding correct? If it is, then that doesn't seem very asynchronous to me because the workers have to wait for the parameter server to broadcast the gradients. Any explanation would be appreciated.
I realize this was asked in 2018, but let's give it a shot.
Each Worker compute gradients
When a worker is done computing gradients it sends it to the parameter server.
The worker then gets sent the new parameters from the parameter server, without waiting for the other workers.
In the synchronous part, the workers will not continue training before every worker has sent its update to the server.
What this means in the asynchronous case is that every worker can have slightly different gradients, because they are fetching the gradients without waiting for every worker to update the parameter server.
I am trying to train a very deep model on Cloud ML however i am having serious memory issues that i am not managing to go around. The model is a very deep convolutional neural network to auto-tag music.
The model for this can be found in the image below. A batch of 20 with a tensor of 12x38832x1 is inserted in the network.
The music was originally 465894x1 samples which was then split into 12 windows. Hence, 12x38832x1. When using the map_fn function each loop would have the seperate 38832x1 samples (conv1d).
Processing windows at a time yields better results than the whole music using one CNN. This was split prior to storing the data in TFRecords in order to minimise the needed processing during training. This is loaded in a queue with maximum queue size of 200 samples (ie 10 batches).
Once dequeue, it is transposed to have the 12 dimension first which then can be used in the map_fn function for processing of the windows. This is not transposed prior to being queued as the first dimension needs to match the batch dimension of the output which is [20, 50]. Where 20 is the batch size as the data and 50 are the different tags.
For each window, the data is processed and the results of each map_fn are superpooled using a smaller network. The processing of the windows is done by a very deep neural network which is giving me problems to keep as all the config options i am giving are giving me out of memory errors.
As a model i am using one similar to Census Tensorflow Model.
First and foremost, i am not sure if this is the best option since for evaluation a separate graph is built and not shared variables. This would require double the amount of parameters.
Secondly, as a cluster setup, i have been using one complex_l master, 3 complex_l workers and 3 large_model parameter servers. I do not know if am underestimating the amount of memory needed here.
My model has previously worked with a much smaller network. However, increasing it in size started giving me bad out of memory errors.
My questions are:
The memory requirement is big, but i am sure it can be processed on cloud ml. Am i underestimating the amount of memory needed? What are your suggestions about the cluster for such a network?
When using a train.server in the dispatch function, do you need to pass on the cluster_spec so it is used in the replica_device setter? Or does it allocate on it's own? When not using it, and setting tf.configProto of log placement, all the variables seem to be on the master worker. On the Census Example in the task.py this is not passed on. I can assume this is correct?
How does one calculate how much memory is needed for a model (rough estimate to select the cluster)?
Is there any other tensorflow core tutorial how to setup such big jobs? (other than Census)
When training a big model in distributed between-graph replication, does all the model need to fit on the worker, or the worker only does ops and then transmits the results to the PS. Does that mean that the workers can have low memory just for singular ops?
PS: With smaller models the network trained successfully. I am trying to deepen the network for better ROC.
Questions coming up from on-going troubleshooting:
When using the replica_device_setter with the parameter cluster, i noticed that the master has very little memory and CPU usage and checking the log placement there are very little ops on the master. I checked the TF_CONFIG that is loaded and it says the following for the cluster field:
u'cluster': {u'ps': [u'ps-4da746af4e-0:2222'], u'worker': [u'worker-4da746af4e-0:2222'], u'master': [u'master-4da746af4e-0:2222']}
On the other hand, in the tf.train.Clusterspec documentation, it only shows workers. Does that mean that the master is not considered as worker? What happens in such case?
Error is it Memory or something else? EOF Error?
I am using inception v1 architecture for transfer learning. I have downloded the checkpoints file, nets, pre-processing file from the below github repository
https://github.com/tensorflow/models/tree/master/slim
I have 3700 images and pooling out the last pooling layer filters from the graph for each of my image and appending it to a listĀ . With every iteration the ram usage is increasing and finally killing the run at around 2000 images. Can you tell me what mistake I have doneĀ ?
https://github.com/Prakashvanapalli/TensorFlow/blob/master/Transfer_Learning/inception_v1_finallayer.py
Even if I remove the list appending and just trying to print the results. this is still happening. I guess the mistake is with the way of calling the graph. When I see my ram usage , with every iteration it is becoming heavy and I don't know why this is happening as I am not saving anything nor there is a difference between 1st iteration
From my point, I am just sending one Image and getting the outputs and saving them. So it should work irrespective of how many images I send.
I have tried it on both GPU (6GB) and CPU (32GB).
You seem to be storing images in your graph as tf.constants. These will be persistent, and will cause memory issues like you're experiencing. Instead, I would recommend either placeholders or queues. Queues are very flexible, and can be very high performance, but can also get quite complicated. You may want to start with just a placeholder.
For a full-complexity example of an image input pipeline, you could look at the Inception model.
Is it possible to share a queue between two graphs in TensorFlow? I'd like to do a kind of bootstrapping to select "hard negative" examples during training.
To speed up the process, I want separate threads for hard negative example selection, and for the training process. The hard negative selection is based on the evaluation of the current model, and it will load its graph from a checkpoint file. The training graph is run on another thread and writes the checkpoint file. The two graphs should share the same queue: the training graph will consume examples and the hard negative selection will produce them.
Currently there's no support for sharing state between different graphs in the open-source version of TensorFlow: each graph runs in a separate session, and each session uses an isolated set of devices.
However, it seems like it would be possible to achieve your goal using a queue in single graph. Simply construct a queue (using e.g. tf.FIFOQueue) and use tf.import_graph_def() to import the graph from the checkpoint file into the current graph. Using the return_elements argument to tf.import_graph_def() you can specify the name of the tensor that will contain the negative examples, and then add a q.enqueue_many() operation to add them to your queue. You would then fork a thread to run the enqueue_many operation in a loop. In your training graph, you can use q.dequeue_many() to get a batch of negative examples, and use them as the input to your training process.