Optimize batch transform inference on sagemaker - batch-processing

With current batch transform inference I see a lot of bottlenecks,
Each input file can only have close to 1000 records
Currently it is processing 2000/min records on 1 instance of ml.g4dn.12xlarge
GPU instance are not necessarily giving any advantage over cpu instance.
I wonder if this is the existing limitation of the currently available tensorflow serving container v2.8. If thats the case config should I play with to increase the performance
i tried changing max_concurrent_transforms but doesn't seem to really help
my current config
transformer = tensorflow_serving_model.transformer(
instance_count=1,
instance_type="ml.g4dn.12xlarge",
max_concurrent_transforms=0,
output_path=output_data_path,
)
transformer.transform(
data=input_data_path,
split_type='Line',
content_type="text/csv",
job_name = job_name + datetime.now().strftime("%m-%d-%Y-%H-%M-%S"),
)

Generally speaking, you should first have a performing model (steps 1+2 below) yielding a satisfactory TPS, before you move over to batch transform parallelization techniques to push your overall TPS higher with parallization nobs.
Steps:
GPU enabling - Run manual test to see that your model can utilize GPU instances to begin with (this isn't related to batch transform).
picking instance - Use SageMaker Inference recommender to find the the most cost/effective instance type to run inference on.
Batch transform inputs - Sounds like you have multiple input files which is needed if you'll want to speed up the job by adding more instances.
Batch Transform Job single instance noobs - If you are using the CreateTransformJob API, you can reduce the time it takes to complete batch transform jobs by using optimal values for parameters such as MaxPayloadInMB, MaxConcurrentTransforms, or BatchStrategy. The ideal value for MaxConcurrentTransforms is equal to the number of compute workers in the batch transform job. If you are using the SageMaker console, you can specify these optimal parameter values in the Additional configuration section of the Batch transform job configuration page. SageMaker automatically finds the optimal parameter settings for built-in algorithms. For custom algorithms, provide these values through an execution-parameters endpoint.
Batch transform cluster size - Increase the instance_count to more than 1, using the cost/effective instance you found in (1)+(2).

Related

Local Model performance in Tensorflow Federated

I am implementing federated learning through tensorflow-federated. The tutorial and all other material available compared the accuracy of the federated (global) model after each communication round. Is there a way I can compute the accuracy of each local model to compare against federated (global) model.
Summary:
Total number of clients: 15
For each communication round: Local vs Federated Model performance
References:
(https://colab.research.google.com/github/tensorflow/federated/blob/main/docs/tutorials/federated_learning_for_image_classification.ipynb#scrollTo=blQGiTQFS9_r)
I dont know how you can achieve this with tff.learning.build_federated_averaging_process but I recommend you to take a look at this simple fedavg implementation. Here you can use test_data -the same evaluation dataset you use in server model- for each client. I would suggest you to do client_test_datasets = [test_data for x in sampled_train_ids]. Then pass this as iterative_process.next(server_state, sampled_train_data, client_test_datasets ). Here you need to change the signatures for run_one_round and client_update_fn in the simple_fedavg_tff.py. In each case the signatures from test datasets shall be same as the ones for training dataset. Dont forget actually passing appropriate test datasets as input to each. Now move on to simple_fedavg_tf.py and change your client_update. Here you basically need to write evaluation very similar to one done for server model. Thereafter print the evaluation results if you wish or change the outputs for each level (tf.function, tff.tf_computation, and tff.federated_computation) and pass the eval results as output. If you go this way dont forget to update the output of iterative_process.next
edit: I assumed you wanted the accuracy of clients when the test dataset is the same as server test dataset.

Create round-robin sharding while generating sharded tfrecords

I am new to tensorflow and I am working on image segmentation problem in tensorflow 1.14. I have a huge dataset and generating tfrecords is very slow, when I try to generate one big tfrecord file. So, I would like to create 'n' shards of tfrecords. I could not find a way to do it online. Say I have 600 images and 600 masks. I want to generate 6 shards of tfrecords, with 100 images and 100 masks each in round robin fashion. A high level /pseudo-code of what I want is as follows -
sharded_tf_record_writer:
create n TFRecordWriter
----> for each_item in n TFRecordWriter
-----> write_example in round-robin fashion
I did search online and could not find relevant answer. I do not want to use apache beam for sharding. I appreciate any idea/help/guidance to achieve this.
I had asked the same question in one of the issues of tensorflow datasets and the user - Conchylicultor responded this -
Writing is done by _TFRecordWriter. Tfds will automatically compute the required number of shards and distribute examples across shards, However each shard is written sequentially.
You do not have control over the number of shards, it is also automatically computed.
However, the fact that examples are distributed between shards do not make the writing faster as examples are not pre-processed in parallel. If you want parallelism, then you'll have to use Apache Beam which allow to scale even to huge datasets
The link to the tensorflow/datasets issue is - https://github.com/tensorflow/datasets/issues/676
This might help.
Since you are working with object detection in tensorflow, there are some nice code in the official Tensorflow models repository that will do what you want. Note this code is for Tensorflow2 (not sure if it'll work in TF1)
See this example of writing sharded tfrecords from coco annotations. The idea is that you open up a list of TFRecordWriter in an exit stack (using contextlib2.ExitStack()), which will automatically close the TFRecords when each thread finishes writing to it.
The utility function open_sharded_output_tfrecords function creates this list of TFRecordWriter
import contextlib2
import tensorflow as tf
with contextlib2.ExitStack() as tf_record_close_stack, tf.gfile.GFile(
annotations_file, 'r'
) as fid:
output_tfrecords = tf_record_creation_util.open_sharded_output_tfrecords(
tf_record_close_stack, output_path, num_shards
)
Next you can use the ProcessPoolExecutor to write tfrecords into each shard in a round-robin fashion in parallel (4 workers in this example)
from concurrent.futures.process import ProcesPoolExecutor
with ProcessPoolExecutor(4) as executor:
for idx, image in enumerate(images):
futures = []
future = executor.submit(
_write_tf_record,
image,
idx,
num_shards,
output_tfrecords,
)
futures.append(future)
for future in futures:
future.result()
where _write_tf_record may look something like this:
def _write_tf_record(image, idx, num_shards, output_tfrecords)
tf_example = create_tf_example(image)
shard_idx = idx % num_shards
output_tfrecords[shard_idx].write(tf_example.SerializeToString())
Just make sure you have more shards than multiprocess workers, otherwise the same writer may be accessed by two different processes.

How to run a bigger batch with AWS SageMaker Batch Transform

I created an XGBoost model with AWS SageMaker. Now I'm trying to use it through Batch Transform Job, and it's all going pretty well for small batches.
However, there's a slightly bigger batch of 600.000 rows in a ~16MB file and I can't manage to run it in one go. I tried two things:
1.
Setting 'Max payload size' of the Transform job to its maximum (100 MB):
transformer = sagemaker.transformer.Transformer(
model_name = config.model_name,
instance_count = config.inference_instance_count,
instance_type = config.inference_instance_type,
output_path = "s3://{}/{}".format(config.bucket, config.s3_inference_output_folder),
sagemaker_session = sagemaker_session,
base_transform_job_name = config.inference_job_prefix,
max_payload = 100
)
However, I still get an error (through console CloudWatch logs):
413 Request Entity Too Large
The data value transmitted exceeds the capacity limit.
2.
Setting max_payload to 0, which, by specification, Amazon SageMaker should interpret as no limit on the payload size.
In that case the job finishes successfully, but the output file is empty (0 bytes).
Any ideas either what I'm doing wrong, or how to run a bigger batch?
Most of SageMaker algorithms set their own default execution parameters with 6 MB in MaxPayloadInMB, so if you are getting 413 from SageMaker algorithms, you are likely to be exceeding the maximum payload they can support. Assuming each row is less than 6 MB in the file, you can fix this by leaving MaxPayloadInMB unset to fallback to the algorithm's default size and setting SplitType to "Line" instead, so it can split the data into smaller batches (https://docs.aws.amazon.com/sagemaker/latest/dg/API_TransformInput.html#SageMaker-Type-TransformInput-SplitType).
this helped me resolve the issue by setting strategy='SingleRecord' in the transformer + you can also add a stronger instance via instance_type and distribute via instance_count.
I have tried the above solutions, but unfortunately they didn't work for me.
Here is what worked for me: https://stackoverflow.com/a/55920737/7091978
Basically, I set "max_payload" from 0 to 1.

Tensorflow: what's the SplitByWorker exactly done in distributed version?

In Tensorflow, there's a placement algorithm (which named as placer.cc in master branch) used for mapping node (op) to devices.
Supposed that in distributed TF, there's
1 client graph and 2 workers(or maybe more)
Without specifying node operations to specified workers or devices on workers such as with tf.device("/job:worker/task:7"):
Without using tf.train.replica_device_setter() for model replication on workers.
Question:
What's the SplitByWorker actually done?
I've read the SplitByDevice and it seems to retrive by node->device_name() if user explicitly specified a device and placement algo places the whole worker0 subgraph to /job:worker0/gpu:0
Is there any API or tf source code that actually make partition of client graph to different subgraphs, and send to different workers?
I know that tf.train.replica_device_setter() is used to create replicas on workers and for place parameters to ps. But without using this, can tensorflow partition model for me if there's more than 1 workers? (Obviously 1 worker get the whole client graph as its "subgraph")
What's the default option for tensorflow if the client doesn't partition the graph all all?
If the ans of 2 is NO, then I must partition by myself through explicit specifying. But how would tf do to if the client doesn't partition the graph all? Like placement algorithms default option is put all nodes on GPU:0, would it place the whole client graph to worker0 without partition and let worker1 be idle?
Please give me some insight for this, any answers or source code references are appriciated (:

In distributed tensorflow, how to write to summary from workers as well

I am using google cloud ml distributed sample for training a model on a cluster of computers. Input and output (ie rfrecords, checkpoints, tfevents) are all on gs:// (google storage)
Similarly to the distributed sample, I use an evaluation step that is called at the end, and the result is written as a summary, in order to use parameter hypertuning / either within Cloud ML, or using my own stack of tools.
But rather than performing a single evaluation on a large batch of data, I am running several evaluation steps, in order to retrieve statistics on the performance criteria, because I don't want to limited to a single value. I want to get information regarding the performance interval. In particular, the variance of performance is important to me. I'd rather select a model with lower average performance but with better worst cases.
I therefore run several evaluation steps. What I would like to do is to parallelize these evaluation steps because right now, only the master is evaluating. When using large clusters, it is a source of inefficiency, and task workers to evaluate as well.
Basically, the supervisor is created as :
self.sv = tf.train.Supervisor(
graph,
is_chief=self.is_master,
logdir=train_dir(self.args.output_path),
init_op=init_op,
saver=self.saver,
# Write summary_ops by hand.
summary_op=None,
global_step=self.tensors.global_step,
# No saving; we do it manually in order to easily evaluate immediately
# afterwards.
save_model_secs=0)
At the end of training I call the summary writer. :
# only on master, this is what I want to remove
if self.is_master and not self.should_stop:
# I want to have an idea of statistics of accuracy
# not just the mean, hence I run on 10 batches
for i in range(10):
self.global_step += 1
# I call an evaluator, and extract the accuracy
evaluation_values = self.evaluator.evaluate()
accuracy_value = self.model.accuracy_value(evaluation_values)
# now I dump the accuracy, ready to use within hptune
eval_summary = tf.Summary(value=[
tf.Summary.Value(
tag='training/hptuning/metric', simple_value=accuracy_value)
])
self.sv.summary_computed(session, eval_summary, self.global_step)
I tried to write summaries from workers as well , but I got an error : basically summary can be written from masters only. Is there any easy way to workaround ? The error is : "Writing a summary requires a summary writer."
My guess is you'd create a separate summary writer on each worker yourself, and write out summaries directly rather.
I suspect you wouldn't use a supervisor for the eval processing either. Just load a session on each worker for doing eval with the latest checkpoint, and writing out independent summaries.