I have several processes running on a server, one of which is training a model in tensorflow. Periodically, I want the trainer to send the current model to the other processes. The way I do this now is with the usual Saver class that can save to and restore from disk.
However, I think this form of IPC is rather inefficient, and it is potentially causing filesystem lockups on the server. If there were a way to serialize the variables into some blob I could send that over a zmq broadcast pipe, I but I haven't found this in the docs.
Alternatively, distributed tensorflow is probably up to the task, but I don't think I need something so complicated.
You could preshare the architecture then use tf.get_collection(tf.GraphKeys.VARIABLES) at every number of steps of your liking, and run it to get the values, then you can use variable.assign at the other end to load up the values in the appropriate variables.
Related
(I have posted the question on https://github.com/tensorflow/federated/issues/793 and maybe also here!)
I have customized my own data and model to federated interfaces and the training converged. But I am confused about an issue that in an images classification task, the whole dataset is extreme large and it can't be stored in a single federated_train_data nor be imported to memory for one time. So I need to load the dataset from the hard disk in batches to memory real-timely and use Keras model.fit_generator instead of model.fit during training, the approach people use to deal with large data.
I suppose in iterative_process shown in image classification tutorial, the model is fitted on a fixed set of data. Is there any way to adjust the code to let it fit to a data generator?I have looked into the source codes but still quite confused. Would be incredibly grateful for any hints.
Generally, TFF considers the feeding of data to be part of the "Python driver loop", which is a helpful distinction to make when writing TFF code.
In fact, when writing TFF, there are generally three levels at which one may be writing:
TensorFlow defining local processing (IE, processing that will happen on the clients, or on the server, or in the aggregators, or at any other placement one may want, but only a single placement.
Native TFF defining the way data is communicated across placements. For example, writing tff.federated_sum inside of a tff.federated_computation decorator; writing this line declares "this data is moved from clients to server, and aggregated via the sum operator".
Python "driving" the TFF loop, e.g. running a single round. It is the job of this final level to do what a "real" federated learning runtime would do; one example here would be selecting the clients for a given round.
If this breakdown is kept in mind, using a generator or some other lazy-evaluation-style construct to feed data in to a federated computation becomes relatively simple; it is just done at the Python level.
One way this could be done is via the create_tf_dataset_for_client method on the ClientData object; as you loop over rounds, your Python code can select from the list of client_ids, then you can instantiate a new list of tf.data.Datasetsand pass them in as your new set of client data. An example of this relatively simple usage would be here, and a more advanced usage (involving defining a custom client_datasets_fn which takes client_id as a parameter, and passing it to a separately-defined training loop would be here, in the code associated to this paper.
One final note: instantiating a tf.data.Dataset does not actually load the dataset into memory; the dataset is only loaded in when it is iterated over. One helpful tip I have received from the lead author of tf.data.Dataset is to think of tf.data.Dataset more as a "dataset recipe" than a literal instantiation of the dataset itself. It has been suggested that perhaps a better name would have been DataSource for this construct; hopefully that may help the mental model on what is actually happening. Similarly, using the tff.simulation.ClientData object generally shouldn't really load anything into memory until it is iterated over in training on the clients; this should make some nuances around managing dataset memory simpler.
In a standard image pipeline, I might use a QueueRunner which will handle keeping a queue of data full by using multiple threads to read data from disk. I want to do the inverse of this process i.e. to use multiple threads to keep a queue mostly empty, by dequeueing and writing these values to disk (I'm saving many images, so it's important this is non-blocking). What would be the easiest way to do this?
Question about between-graph replication in distributed Tensorflow, because I didn't get few moments from tutorials. As I understood the current model:
We have parameter server which we just launch in separate process and make server.join().
We have workers, each of them builds the similar computational graph which contains parameter nodes linked to parameter server (through tf.train.replica_device_setter) and calculation nodes placed on workers themselves.
What I didn't find:
How sessions are working in this model? Because in examples/tutorials it is hidden behind tf.train.Supervisor.
Do we have separate sessions on each worker or just one huge session that accumulate graphs from all the workers and parameter server?
How global variables are initialized on parameter server? I wonder that I can initialize them in one of the worker processes (choose it as a "master"), if I linked these parameters on the worker through tf.train.replica_device_setter. Is that correct?
In the following gist:
https://gist.github.com/yaroslavvb/ea1b1bae0a75c4aae593df7eca72d9ca
global variables initialized just in parameter server process and all the workers consider them initialized. How is that possible, given that they even work in different sessions? I could not replicate it in simpler example.
I have main session in core program where I perform training of the model. Part of training loop is collection of data, which in turn requires calculation on the tensorflow cluster. So I need to create this cluster, put on the parameter server current state of the trained model, then collect data from calculation and continue with training loop. How can I: 1) pass current trained model to the cluster? 2) Extract collected data from cluster and pass it to main program?
Thanks in advance!
EDIT:
To q.3:
It was answered previously (In tensorflow, is variable value the only context information a session stores?) that in distributed runtime variables are shared between sessions.
Does it mean that when I create session with some "target", then all the variables will be shared between those sessions that run on the same graph?
Guess I can try answering these questions by myself, at least it may be helpful for other newbies trying to harness distributed Tensorflow, because as of now there is lack of concise and clear blog posts on that topic.
Hope more knowledgeable people will correct me if needed.
We have separate sessions on all the servers, and these sessions share their resources (variables, queues, and readers) but only in distributed setting (i.e. you pass server.target to tf.Session constructor).
Ref: https://www.tensorflow.org/api_docs/python/client/session_management#Session
Parameter variables usually are initialized in one "master" process. It can be process where parameter server is launched. But it is not strictly necessary to do it in just one process.
Because of p.1. Replicated :)
Thanks to ideas from #YaroslavBulatov, I came to the following approach, which appears to be the simplest possible:
Cluster: one local "calculation server" and N "workers".
"Calculation server" keeps all the parameters of global network and performs training steps. All training ops are assigned to it.
"Workers" collect data in parallel and then put it in Queue; these data are used by "calculation server" when doing training steps.
So, high-level algorithm:
launch all the units in cluster
build comp graph and training ops on calculation server
build comp graph on workers (variables are linked to calculation
server).
collect data with workers
perform training step on calculation server and update global
network
repeat 4-5 till convergence :)
As of now I did coordination between calculation server and workers through Queues (when to start collection of data and when to start training step), which is definitely not the most elegant solution. Any feedback is very welcome.
I also stumbled upon these and very similar and related questions. I tried to clarify all of this in my overview of distributed TensorFlow. Maybe that is useful for some.
More specifically, let me try to answer your questions:
You say you do between-graph replication, i.e. you build a separate computation graph for every worker. This implies that you also have a separate session everywhere, because there would be no way to use that computation graph otherwise. The server (tf.distribute.Server) will not use the local computation graph. It will just execute things when remote sessions (clients) connect to it. The session has the graph. If there would be only a single session, there would also only be a single graph, and then you have in-graph replication.
If you share the variables (e.g. they live on a parameter server), it is enough if one of the workers does the initialization (e.g. the parameter server itself). Otherwise it depends on the specific distributed strategy and how you do the synchronization of the variables. E.g. a mirrored variables has separate copies on every replica, and you would need to make sure in some way that they are synchronized.
There is only a single copy of the variable in this case, which lives on the parameter server. All read and write on this variable would be a RPC call to the parameter server.
I'm not exactly sure what you mean by main program. You would have multiple instances of your program, one for each worker. But you likely will mark one of the worker as the chief worker, which has some further responsibilities like saving the checkpoint. But otherwise, all the workers are equal and do all the same thing (this is again for between-graph replication). How your gradient accumulation or parameter update looks like depends on your strategy (e.g. whether you do sync training or async training, etc.).
I am trying to train DNN model using tensorflow, my script have two variables, one is dense feature and one is sparse feature, each minibatch will pull full dense feature and pull specified sparse feature using embedding_lookup_sparse, feedforward could only begin after sparse feature is ready. I run my script using 20 parameter servers and increasing worker count did not scale out. So I profiled my job using tensorflow timeline and found one of 20 parameter server is very slow compared to the other 19. there is not dependency between different part of all the trainable variables. I am not sure if there is any bug or any limitation issues like tensorflow can only queue 40 fan out requests, any idea to debug it? Thanks in advance.
tensorflow timeline profiling
It sounds like you might have exactly 2 variables, one is stored at PS0 and the other at PS1. The other 18 parameter servers are not doing anything. Please take a look at variable partitioning (https://www.tensorflow.org/versions/master/api_docs/python/state_ops/variable_partitioners_for_sharding), i.e. partition a large variable into small chunks and store them at separate parameter servers.
This is kind of a hack way to log Send/Recv timings from Timeline object for each iteration, but it works pretty well in terms of analyzing JSON dumped data (compared to visualize it on chrome://trace).
The steps you have to perform are:
download TensorFlow source and checkout a correct branch (r0.12 for example)
modify the only place that calls SetTimelineLabel method inside executor.cc
instead of only recording non-transferable nodes, you want to record Send/Recv nodes also.
be careful to call SetTimelineLabel once inside NodeDone as it would set the text string of a node, which will be parsed later from a python script
build TensorFlow from modified source
modify model codes (for example, inception_distributed_train.py) with correct way of using Timeline and graph meta-data
Then you can run the training and retrieve JSON file once for each iteration! :)
Some suggestions that were too big for a comment:
You can't see data transfer in timeline that's because the tracing of Send/Recv is currently turned off, some discussion here -- https://github.com/tensorflow/tensorflow/issues/4809
In the latest version (nightly which is 5 days old or newer) you can turn on verbose logging by doing export TF_CPP_MIN_VLOG_LEVEL=1 and it shows second level timestamps (see here about higher granularity).
So with vlog perhaps you can use messages generated by this line to see the times at which Send ops are generated.
I have four different models with same structures, which are used as a predictor in the "main" problem. Each time in "main" problem I call one of them to provide the prediction. Also, using the new observation, I updates the weights of the each network.
Currently, in order to differentiate between the models I save them in four different ckpt models, then I load them each time to do prediction or updating it. When the network is updated, I save it again.
This procedure works good. The problem is that initializing the variables, loading the model, and saving it again is too expensive. Each time I call the network to update, it takes about 10 seconds which around 1 second is for training and the reminder of time is for initializing, loading and saving.
As another approach, I tried to have the model stayed in memory. But, since I have one dnn.py, which I call it for each of the four problems, the name of the variables, parameters, etc. are the same. So, TensorFlow get confused about them, and it just does not work.
Since I may have more than four predictors, (even like 22), it is not reasonable to create four different dnn.py with different name of variables.
I would appreciate any help to write the procedure in an efficient way.
Best,
Afshin
Reading variables from disk for each training step sounds inefficient, you should reorganize your network to keep those values in memory, ie, by using variable_scope to keep different sets of variables separate