Tensorflow: what's the SplitByWorker exactly done in distributed version? - tensorflow

In Tensorflow, there's a placement algorithm (which named as placer.cc in master branch) used for mapping node (op) to devices.
Supposed that in distributed TF, there's
1 client graph and 2 workers(or maybe more)
Without specifying node operations to specified workers or devices on workers such as with tf.device("/job:worker/task:7"):
Without using tf.train.replica_device_setter() for model replication on workers.
Question:
What's the SplitByWorker actually done?
I've read the SplitByDevice and it seems to retrive by node->device_name() if user explicitly specified a device and placement algo places the whole worker0 subgraph to /job:worker0/gpu:0
Is there any API or tf source code that actually make partition of client graph to different subgraphs, and send to different workers?
I know that tf.train.replica_device_setter() is used to create replicas on workers and for place parameters to ps. But without using this, can tensorflow partition model for me if there's more than 1 workers? (Obviously 1 worker get the whole client graph as its "subgraph")
What's the default option for tensorflow if the client doesn't partition the graph all all?
If the ans of 2 is NO, then I must partition by myself through explicit specifying. But how would tf do to if the client doesn't partition the graph all? Like placement algorithms default option is put all nodes on GPU:0, would it place the whole client graph to worker0 without partition and let worker1 be idle?
Please give me some insight for this, any answers or source code references are appriciated (:

Related

Optimize batch transform inference on sagemaker

With current batch transform inference I see a lot of bottlenecks,
Each input file can only have close to 1000 records
Currently it is processing 2000/min records on 1 instance of ml.g4dn.12xlarge
GPU instance are not necessarily giving any advantage over cpu instance.
I wonder if this is the existing limitation of the currently available tensorflow serving container v2.8. If thats the case config should I play with to increase the performance
i tried changing max_concurrent_transforms but doesn't seem to really help
my current config
transformer = tensorflow_serving_model.transformer(
instance_count=1,
instance_type="ml.g4dn.12xlarge",
max_concurrent_transforms=0,
output_path=output_data_path,
)
transformer.transform(
data=input_data_path,
split_type='Line',
content_type="text/csv",
job_name = job_name + datetime.now().strftime("%m-%d-%Y-%H-%M-%S"),
)
Generally speaking, you should first have a performing model (steps 1+2 below) yielding a satisfactory TPS, before you move over to batch transform parallelization techniques to push your overall TPS higher with parallization nobs.
Steps:
GPU enabling - Run manual test to see that your model can utilize GPU instances to begin with (this isn't related to batch transform).
picking instance - Use SageMaker Inference recommender to find the the most cost/effective instance type to run inference on.
Batch transform inputs - Sounds like you have multiple input files which is needed if you'll want to speed up the job by adding more instances.
Batch Transform Job single instance noobs - If you are using the CreateTransformJob API, you can reduce the time it takes to complete batch transform jobs by using optimal values for parameters such as MaxPayloadInMB, MaxConcurrentTransforms, or BatchStrategy. The ideal value for MaxConcurrentTransforms is equal to the number of compute workers in the batch transform job. If you are using the SageMaker console, you can specify these optimal parameter values in the Additional configuration section of the Batch transform job configuration page. SageMaker automatically finds the optimal parameter settings for built-in algorithms. For custom algorithms, provide these values through an execution-parameters endpoint.
Batch transform cluster size - Increase the instance_count to more than 1, using the cost/effective instance you found in (1)+(2).

Local Model performance in Tensorflow Federated

I am implementing federated learning through tensorflow-federated. The tutorial and all other material available compared the accuracy of the federated (global) model after each communication round. Is there a way I can compute the accuracy of each local model to compare against federated (global) model.
Summary:
Total number of clients: 15
For each communication round: Local vs Federated Model performance
References:
(https://colab.research.google.com/github/tensorflow/federated/blob/main/docs/tutorials/federated_learning_for_image_classification.ipynb#scrollTo=blQGiTQFS9_r)
I dont know how you can achieve this with tff.learning.build_federated_averaging_process but I recommend you to take a look at this simple fedavg implementation. Here you can use test_data -the same evaluation dataset you use in server model- for each client. I would suggest you to do client_test_datasets = [test_data for x in sampled_train_ids]. Then pass this as iterative_process.next(server_state, sampled_train_data, client_test_datasets ). Here you need to change the signatures for run_one_round and client_update_fn in the simple_fedavg_tff.py. In each case the signatures from test datasets shall be same as the ones for training dataset. Dont forget actually passing appropriate test datasets as input to each. Now move on to simple_fedavg_tf.py and change your client_update. Here you basically need to write evaluation very similar to one done for server model. Thereafter print the evaluation results if you wish or change the outputs for each level (tf.function, tff.tf_computation, and tff.federated_computation) and pass the eval results as output. If you go this way dont forget to update the output of iterative_process.next
edit: I assumed you wanted the accuracy of clients when the test dataset is the same as server test dataset.

Learning parameters of each simulated device

Does tensorflow-federated support assigning different hyper-parameters(like batch-size or learning rate) for different simulated devices?
Currently, you may find this a bit unnatural, but yes, such a thing is possible.
One approach to doing this that is supported today is to have each client take its local learning rate as a top-level parameter, and use this in the training. A dummy example here would be (sliding the model parameter in the computations below) something along the lines of
#tff.tf_computation(tff.SequenceTyoe(...), tf.float32)
def train_with_learning_rate(ds, lr):
# run training with `tf.data.Dataset` ds and learning rate lr
...
#tff.federated_computation(tff.FederatedType([tff.SequenceType(...), tf.float32])
def run_one_round(datasets_and_lrs):
return tff.federated_mean(
tff.federated_map(train_with_learning_rate, datasets_and_lrs))
Invoking the federated computation here with a list of tuples with the first element of the tuple representing the clients data and the second element representing the particular client's learning rate, would give what you want.
Such a thing requires writing custom federated computations, and in particular likely defining your own IterativeProcess. A similar iterative process definition was recently open sourced here, link goes to the relevant local client function definition to allow for learning rate scheduling on the clients by taking an extra integer parameter representing the round number, it is likely a good place to look.

Problem when predicting via multiprocess with Tensorflow

I have 4 (or more) models (same structure but different training data). Now I want to ensemble them to make a prediction. I want to pre-load the models and then predict one input message (one message at a time) in parallel via multiprocess. However, the program always stops at "session.run" step. I could not figure it out why.
I tried passing all arguments to the function in each process, as shown in the code below. I also tried using a Queue object and put all the data (except the model object) in the queue. I also tried to set the number of process to 1. It made no difference.
with Manager() as manager:
first_level_test_features=manager.list()
procs =[]
for id in range(4):
p = Process(target=predict, args=(id, (message, models, configs, vocabs, emoji_dict,first_level_test_features)))
procs.append(p)
p.start()
for p in procs:
p.join()
I did not get any error message since it is just stuck there. I would expect the program can start multiple processes and each process uses the model pass to it to make the prediction.
I am unsure how session sharing along different Processes would work, and this is probably where your issue comes from. Given the way TensorFlow works, I would advise implementing the ensemble call as a graph operation, so that it can be run through a single session.run call, with TF handling the parallelization of computations wherever possible.
In practice, if you have symbolic tensors representing the models' predictions, you could use a TF operation to aggregate them (tf.concat, tf.reduce_mean, tf.add_n... whichever suits your design) and end up with a single symbolic tensor representing the ensemble prediction.
I hope this helps; if not, please provide some more details as to what your setting is, notably which form your models have.

When should we use the place_pruned_graph config?

Question:As the title saied, I wander When should we use the config place_pruned_graph in GraphOptions. What's the purpose of this config?
I'm not clear to the comment about this config:
// Only place the subgraphs that are run, rather than the entire graph.
//
// This is useful for interactive graph building, where one might
// produce graphs that cannot be placed during the debugging
// process. In particular, it allows the client to continue work in
// a session after adding a node to a graph whose placement
// constraints are unsatisfiable.
We know that Tensorflow will partition a entire graph into several subgraphs in normal. And the following code from CreateGraphs of direct_session.cc takes the else branch in normal.(as far as I can see, I never found the case taking the if branch(so I don't know when should we trigger it).
if (options_.config.graph_options().place_pruned_graph()) {
// Because we are placing pruned graphs, we need to create a
// new SimpleGraphExecutionState for every new unseen graph,
// and then place it.
SimpleGraphExecutionStateOptions prune_options;
prune_options.device_set = &device_set_;
prune_options.session_options = &options_;
prune_options.stateful_placements = stateful_placements_;
TF_RETURN_IF_ERROR(SimpleGraphExecutionState::MakeForPrunedGraph(
execution_state_->original_graph_def().library(), prune_options,
execution_state_->original_graph_def(), subgraph_options,
&temp_exec_state_holder, &client_graph));
execution_state = temp_exec_state_holder.get();
} else {
execution_state = execution_state_.get();
TF_RETURN_IF_ERROR(
execution_state->BuildGraph(subgraph_options, &client_graph));
}
The short answer? Never. The longer answer requires me to explain why this option exists at all.
So why does TensorFlow include this convoluted configuration option and logic to handle it? It's a historical accident that came about when tensorflow::DirectSession and tensorflow::GrpcSession had different internal implementations:
The tensorflow::GrpcSession used a single SimpleGraphExecutionState for the entire graph in a session. The net effect of this was that the placer—which is responsible for assigning devices to each node in the graph—would run before the graph was pruned.
The tensorflow::DirectSession originally used one SimpleGraphExecutionState for each pruned subgraph, with some special logic for sharing the placements of stateful nodes between invocations. Therefore the placer would run after the graph was pruned, and could make different decisions about where to place stateful nodes.
The benefit of the tensorflow::GrpcSession approach (place_pruned_graph = false) is that it takes into account all of the colocation constraints in the graph when running the placement algorithm, even if they don't occur in the subgraph being executed. For example, if you had an embedding matrix, and wanted to optimize it using the SparseApplyAdagrad op (which only has a CPU implementation), TensorFlow would figure out that the embedding matrix should be placed on CPU.
By contrast, if you specified no device for the embedding matrix and set placed_pruned_graph = true the matrix would (most likely) be placed on GPU when you ran its initializer, because all of the ops in the initialization subgraph would be runnable on GPU. And, since variables cannot move between devices, TensorFlow would not be able to issue the subgraph that ran SparseApplyAdagrad on the matrix. This was a real issue in the earliest version of TensorFlow.
So why support place_pruned_graph = true at all? It turns out that it is useful when using TensorFlow interactively. The placed_pruned_graph = false option is unforgiving: once the graph for a session contains a node that cannot be placed, that session is useless, because the placement algorithm runs on the whole graph, it would fail every time it is invoked, and therefore no steps could run. When you use a tf.InteractiveSession, we assume that you are using a REPL (or Jupyter notebook) and that it's beneficial to allow you to continue after making such a mistake. Therefore in a tf.InteractiveSession we set place_pruned_graph = true so that you can continue to use the session after adding an unplaceable node (as long as you don't try to run that node in a pruned subgraph).
There is probably a better approach than place_pruned_graph = true for interactive use, but we haven't investigated adding one. Suggestions are always welcome on the GitHub issues page.