How to properly use ShardedByS3Key in distributed training scenario?

How to properly use ShardedByS3Key in distributed training scenario? - tensorflow

Following the API reference, one way to optimize data ingestion for distributed training is using ShardedByS3Key.
Does have code samples for using ShardedByS3Key in context of distributed training? Concretely, what changes to, e.g., PT's DistributedSampler (should it be used at all?), or TF's tf.data-pipeline is necessary?

According to the technique of "Sharded Data Parallelism":
The standard data parallelism technique replicates the training states
across the GPUs in the data parallel group, and performs gradient
aggregation based on the AllReduce operation.
Then simply leave the default mode FullyReplicated in your TrainingInput's distribution param because parallelism does not occur at the level of data division on upstream instances but later on gpu.
See the guide on "How to apply Sharded data parallelism to your training work" or the full example notebook "Train GPT-2 with near-linear scaling using Sharded Data Parallelism technique in SageMaker Model Parallelism Library".
In the last example it sets just the parameters step by step explicitly.
For example, you have to set at least the distribution dict params on PyTorch (or TensorFlow) estimator to enable the SageMaker distributed data parallelism:
{ "smdistributed": { "dataparallel": { "enabled": True } } }

Related

DIfferent optimization with different TF versions

I'm trying to train a convolutional neural network with keras and Tensorflow version 2.6, also I did it with Tensorflow version 1.11. I think that I did the migration okey (two neural networks converged) but when I see the results they are very different, worst in TF2.6, I used an optimizer Adam for both cases with the same hyperparameters (learning_rate = 0.001) but the optimization in the loss function in TF1.11 is better than in TF2.6
I'm trying to find out where the differences could be. What things must be taken into account when we work with differents TF versions? Can have important numerical differences? I know that in TF1.x the default mode is graph and in TF2 the default is eager, I don't know if this could bring different behavior in the training.
It surprises me how much the loss function is reduced in the first epochs reaching a lower value at the end of the training.

you understand that is correct they are working in different working modes eager and graph but the loss Fn is defined by how much change of value to required optimized pointed calculated by your or configured method.
You cannot directly be compared one model training history to another directly, running it several time you experience TF 1 is faster and smaller in the number of losses in the loss Fn that is needed to review the changelog Changlog
Loss Fn are updated, the graph is the powerful technique we know but TF 2.x supports access of the value at its level, why you have easy delegated methods such as callback, dynamic FNs, and working update value runtime. ( Trends to understand and experiments for student or user compared by both versions on the same tasks )
Symetrics in methods not create different results.

What does the use_multiprocessing input argument in keras mode.fit do?

I am training an LSTM autoencoder model in python using Keras using only CPU.
I can see that there is an argument called use_multiprocessing in the fit function. Could you please explain in simple terms what does this argument do exactly. I read the explanation on tensorflow.org but I cannot understand from it if I set the parameter to true how would my model be impacted. I am looking for ways to speed up the training of my model and I am wondering if this parameter would help.

The use_multiprocessing (and workers and max_queue_size) parameters apply to the batch data generation. The clue in the documentation is this: "Used for generator or keras.utils.Sequence input only" [ref https://keras.io/api/models/model_training_apis/#fit-method]. Deep under the hood, keras uses an orderedenqueuer to wrap your input.
If use_multiprocessing is True and workers > 0, then keras will create multiple (number = workers) processes to run simultaneously and prepare batches from your generator/sequence. They will try to keep the queue of batches ready for training up to max_queue_size.
If use_multiprocessing is False and workers > 1, then keras will create multiple (number = workers) threads to simultaneously prepare batches, similar to above (but your input data object needs to be thread safe).
If your batch data generation is the bottleneck in your training process, this can speed things up a lot. I have found use_multiprocessing True to speed up batch data bound work linearly e.g. 2 workers = 2x as fast (though there is overhead to start the processes). For use_multiprocessing False and threads, I have found an unpredictable 0%-15% speed increase, without overhead.
Refer also this question with lots of details:
How to define max_queue_size, workers and use_multiprocessing in keras fit_generator()?

What do the TensorFlow Dataset's functions cache() and prefetch() do?

I am following TensorFlow's Image Segmentation tutorial. In there there are the following lines:
train_dataset = train.cache().shuffle(BUFFER_SIZE).batch(BATCH_SIZE).repeat()
train_dataset = train_dataset.prefetch(buffer_size=tf.data.experimental.AUTOTUNE)
What does the cache() function do? The official documentation is pretty obscure and self-referencing:
Caches the elements in this dataset.
What does the prefetch() function do? The official documentation is again pretty obscure:
Creates a Dataset that prefetches elements from this dataset.

The tf.data.Dataset.cache transformation can cache a dataset, either in memory or on local storage. This will save some operations (like file opening and data reading) from being executed during each epoch. The next epochs will reuse the data cached by the cache transformation.
You can find more about the cache in tensorflow here.
Prefetch overlaps the preprocessing and model execution of a training step. While the model is executing training step s, the input pipeline is reading the data for step s+1. Doing so reduces the step time to the maximum (as opposed to the sum) of the training and the time it takes to extract the data.
You can find more about prefetch in tensorflow here.
Hope this answers your question. Happy Learning.

How to smoothly produce Tensorflow auc summaries for training and test sets?

Tensorflow describes writing file summaries to visualize graph execution.
I envision three stages:
training the data (with optimization)
measuring accuracy on the training set (no optimization)
measuring accuracy on the test set (no optimization!)
I'd like all stages in the same script, as in the evaluate function of the wide_and_deep tutorial, but with the low-level API. I'd like three different graphs for stats like loss or AUC, one for each stage.
Suppose I use one session, and in each stage I define an AUC summary op:
# define auc
auc, auc_op = tf.metrics.auc(labels, predictions)
# summary scalar to track it
tf.summary.scalar("auc", auc_op, family=family_name)
# merge all summaries for evaluation and later writing
summary_op = tf.summary.merge_all()
...
summary_writer.add_summary(summary, step_num)
There are three graphs, but the first graph has all three runs on it, and the second graph has the last two runs (see below). What's worse, each stage starts from the previous state. This makes sense, because all the variables from the previous stages are still around.
I could use a different session for each stage, but that would throw away the model as well.
What is the smooth way to handle this?
I'd like to just clear some of the summary variables. I've tried re-initializing some variables, looked at related questions, read about name scope and variable scope and tried not to re-use variables for AUC, read about variables and sharing, looked into pruning nodes (though I don't understand it), etc. I have not made it work yet.
I am using the low-level API. I saw something like this in the high-level API in _eval_metric_ops, but I don't understand how they 'clear' the different stages. With name_scope?
Do I have to save and load the model into a new session just for this, or is there some clean way to graph each summary separately?

The metric ops will be local variables, so you could run tf.local_variables_initializer() in your Session, which will reset all of your metrics. You could also look through the local variables collection for those with "auc" in the name if you wanted to be a bit more discerning. The high-level way to do this would be to use an Estimator, which will manage metrics for you.

At what stage is a tensorflow graph set up?

An optimizer typically run the same computation graph for many steps until convergence. Does tensorflow setup the graph at the beginning and reuse it for every step? What if I change the batch size during training? What if I make some minus change to the graph like changing the loss function? What if I made some major change to the graph? Does tensorflow pre-generate all possible graphs? Does tensorflow know how to optimize the entire computation when the graph changes?

As keveman says, from the client's perspective there is a single TensorFlow graph. In the runtime, there can be multiple pruned subgraphs that contain just the nodes that are necessary to compute the values t1, t2 etc. that you fetch when calling sess.run([t1, t2, ...]).
If you call sess.run([t1, t2]) will prune the overall graph (sess.graph) down to the subgraph required to compute those values: i.e. the operations that produce t1 and t2 and all of their antecedents. If you subsequently call sess.run([t3, t4]), the runtime will prune the graph down to the subgraph required to compute t3 and t4. Each time you pass a new combination of values to fetch, TensorFlow will compute a new pruned graph and cache it—this is why the first sess.run() can be somewhat slower than subsequent ones.
If the pruned graphs overlap, TensorFlow will reuse the "kernel" for the ops that are shared. This is relevant because some ops (e.g. tf.Variable and tf.FIFOQueue) are stateful, and their contents can be used in both pruned graphs. This allows you, for example, to initialize your variables with one subgraph (e.g. sess.run(tf.initialize_all_variables())), train them with another (e.g. sess.run(train_op)), and evaluate your model with a third (e.g. sess.run(loss, feed_dict={x: ...})). It also lets you enqueue elements to a queue with one subgraph, and dequeue them with another, which is the foundation of the input pipelines.

TensorFlow exposes only one graph that is visible to the user, namely the one specified by the user. The user can run the graph with Session.run() or by calling Tensor.eval() on some tensor. A Session.run() call can specify some tensors to be fed and others to be fetched. Depending on what needs to be fetched, the TensorFlow runtime could be internally constructing and optimizing various data structures, including a pruned version of the user visible graph. However, this internal graph is not visible to the user in anyway. No, TensorFlow doesn't 'pre-generate' all possible graphs. Yes, TensorFlow does perform extensive optimizations on the computation graph. And finally, changing the batch size of a tensor that is fed doesn't change the structure of the graph.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas