How to run a bigger batch with AWS SageMaker Batch Transform - xgboost

I created an XGBoost model with AWS SageMaker. Now I'm trying to use it through Batch Transform Job, and it's all going pretty well for small batches.
However, there's a slightly bigger batch of 600.000 rows in a ~16MB file and I can't manage to run it in one go. I tried two things:
Setting 'Max payload size' of the Transform job to its maximum (100 MB):
transformer = sagemaker.transformer.Transformer(
model_name = config.model_name,
instance_count = config.inference_instance_count,
instance_type = config.inference_instance_type,
output_path = "s3://{}/{}".format(config.bucket, config.s3_inference_output_folder),
sagemaker_session = sagemaker_session,
base_transform_job_name = config.inference_job_prefix,
max_payload = 100
However, I still get an error (through console CloudWatch logs):
413 Request Entity Too Large
The data value transmitted exceeds the capacity limit.
Setting max_payload to 0, which, by specification, Amazon SageMaker should interpret as no limit on the payload size.
In that case the job finishes successfully, but the output file is empty (0 bytes).
Any ideas either what I'm doing wrong, or how to run a bigger batch?

Most of SageMaker algorithms set their own default execution parameters with 6 MB in MaxPayloadInMB, so if you are getting 413 from SageMaker algorithms, you are likely to be exceeding the maximum payload they can support. Assuming each row is less than 6 MB in the file, you can fix this by leaving MaxPayloadInMB unset to fallback to the algorithm's default size and setting SplitType to "Line" instead, so it can split the data into smaller batches (

this helped me resolve the issue by setting strategy='SingleRecord' in the transformer + you can also add a stronger instance via instance_type and distribute via instance_count.

I have tried the above solutions, but unfortunately they didn't work for me.
Here is what worked for me:
Basically, I set "max_payload" from 0 to 1.


Optimize batch transform inference on sagemaker

With current batch transform inference I see a lot of bottlenecks,
Each input file can only have close to 1000 records
Currently it is processing 2000/min records on 1 instance of ml.g4dn.12xlarge
GPU instance are not necessarily giving any advantage over cpu instance.
I wonder if this is the existing limitation of the currently available tensorflow serving container v2.8. If thats the case config should I play with to increase the performance
i tried changing max_concurrent_transforms but doesn't seem to really help
my current config
transformer = tensorflow_serving_model.transformer(
job_name = job_name +"%m-%d-%Y-%H-%M-%S"),
Generally speaking, you should first have a performing model (steps 1+2 below) yielding a satisfactory TPS, before you move over to batch transform parallelization techniques to push your overall TPS higher with parallization nobs.
GPU enabling - Run manual test to see that your model can utilize GPU instances to begin with (this isn't related to batch transform).
picking instance - Use SageMaker Inference recommender to find the the most cost/effective instance type to run inference on.
Batch transform inputs - Sounds like you have multiple input files which is needed if you'll want to speed up the job by adding more instances.
Batch Transform Job single instance noobs - If you are using the CreateTransformJob API, you can reduce the time it takes to complete batch transform jobs by using optimal values for parameters such as MaxPayloadInMB, MaxConcurrentTransforms, or BatchStrategy. The ideal value for MaxConcurrentTransforms is equal to the number of compute workers in the batch transform job. If you are using the SageMaker console, you can specify these optimal parameter values in the Additional configuration section of the Batch transform job configuration page. SageMaker automatically finds the optimal parameter settings for built-in algorithms. For custom algorithms, provide these values through an execution-parameters endpoint.
Batch transform cluster size - Increase the instance_count to more than 1, using the cost/effective instance you found in (1)+(2).

Pandas to Koalas does not solve spark.rpc.message.maxSize exceeded error

I have an existing databricks job which heavily uses Pandas and below code snippet gives the error "org.apache.spark.SparkException: Job aborted due to stage failure: Serialized task 101059:0 was 1449948615 bytes, which exceeds max allowed: spark.rpc.message.maxSize (268435456 bytes). Consider increasing spark.rpc.message.maxSize or using broadcast variables for large values"
Current code snippet is
normalized_df = pd.DataFrame(data=normalized_data_remnan, columns=['idle_power_mean', 'total_eng_idle_sec', 'total_veh_idle_sec', 'total_sec_on', 'total_sec_load', 'positive_power_mean', 'time_fraction_eng_idle_pcnt', 'time_fraction_veh_idle_pcnt', 'negative_power_mean', 'mean_eng_idle_sec', 'mean_veh_idle_sec', 'mean_stand_still', 'num_start_stops', 'num_power_jump', 'positive_power_med', 'load_speed_med'])
where normalized_data_remnan is an ndarray outputted by scipty.zscore.
I thought moving this to koalas would solve the issue as Koalas uses distributed computing and so converted the code as below.
import databricks.koalas as ks
normalized_df = pd.DataFrame(data=normalized_data_remnan, columns=['idle_power_mean', 'total_eng_idle_sec', 'total_veh_idle_sec', 'total_sec_on', 'total_sec_load', 'positive_power_mean', 'time_fraction_eng_idle_pcnt', 'time_fraction_veh_idle_pcnt', 'negative_power_mean', 'mean_eng_idle_sec', 'mean_veh_idle_sec', 'mean_stand_still', 'num_start_stops', 'num_power_jump', 'positive_power_med', 'load_speed_med'])
But even after this conversion, I am getting the same error. Do you have any clue for this error?
I can think of changing this spark.rpc.message.maxSize to 2 GB. What's the maximum value of this parameter? My driver node is 128 GB memory, 6 cores and worker is 64GB,32 cores and total 8 workers
Usually, sending some huge items from the driver to executor's results in this error message.
spark.rpc.message.maxSize : Is the largest message (in MiB) that can be delivered in "control plane" communication. If you are getting alerts about the RPC message size, increase this. Its default value is 128.
Setting this property(spark.rpc.message.maxSize) in Spark configuration when you start the cluster, you might be able to resolve this error.
To lower the size of the Spark RPC message, you can break the huge list into numerous smaller ones by increasing the partition number.
largeList = [...] # This is a large list
partitionNum = 100 # Increase this number if necessary
rdd = sc.parallelize(largeList, partitionNum)
ds = rdd.toDS()

What does "max_batch_size" mean in tensorflow-serving batching_config.txt?

I'm using tensorflow-serving on GPUs with --enable-batching=true.
However, I'm a little confused with max_batch_size in batching_config.txt.
My client sends a input tensor with a tensor shape [-1, 1000] in a single gRPC request, dim0 ranges from (0, 200]. I set max_batch_size = 100 and receive an error:
"gRPC call return code: 3:Task size 158 is larger than maximum batch
size 100"
"gRPC call return code: 3:Task size 162 is larger than maximum batch
size 100"
Looks like max_batch_size limits dim0 of a single request, but tensorflow batches multiple requests to a batch, I thought it means the sum of request numbers.
Here is a direct description from the docs.
max_batch_size: The maximum size of any batch. This parameter governs
the throughput/latency tradeoff, and also avoids having batches that
are so large they exceed some resource constraint (e.g. GPU memory to
hold a batch's data).
In ML most of the time the first dimension represents a batch. So based on my understanding tensorflow serving confuses the value for the first dimension as a batch and issues errors whenever it is bigger than the allowed value. You can verify it by issuing some of the request where you manually control the first dimension to be lower than 100. I expect this to remove the error.
After that you can modify your inputs to be sent in a proper format.

how can we get benefit from sharding the data to speed the training time?

My main issue is : I have 204 GB training tfrecords for 2 million images, and 28GB for validation tf.records files, of 302900 images. it takes 8 hour to train one epoch and this will take 33 day for training. I want to speed that by using multiple threads and shards but I am little bit confused about couple of things.
In API there is shard function , So in the documentation they mentioned the following about shard function :
Creates a Dataset that includes only 1/num_shards of this dataset.
This dataset operator is very useful when running distributed training, as it allows each worker to read a unique subset.
When reading a single input file, you can skip elements as follows:
d =
d = d.shard(FLAGS.num_workers, FLAGS.worker_index)
d = d.repeat(FLAGS.num_epochs)
d = d.shuffle(FLAGS.shuffle_buffer_size)
d =, num_parallel_calls=FLAGS.num_map_threads)
Important caveats:
Be sure to shard before you use any randomizing operator (such as shuffle).
Generally it is best if the shard operator is used early in the dataset pipeline. >For example, when reading from a set of TFRecord files, shard before converting >the dataset to input samples. This avoids reading every file on every worker. The >following is an example of an efficient sharding strategy within a complete >pipeline:
d = Dataset.list_files(FLAGS.pattern)
d = d.shard(FLAGS.num_workers, FLAGS.worker_index)
d = d.repeat(FLAGS.num_epochs)
d = d.shuffle(FLAGS.shuffle_buffer_size)
d = d.repeat()
d = d.interleave(,
cycle_length=FLAGS.num_readers, block_length=1)
d =, num_parallel_calls=FLAGS.num_map_threads)
So my question regarding the code above is when I try to makes d.shards of my data using shard function, if I set the number of shards (num_workers)to 10 , I will have 10 splits of my data , then should I set the num_reader in d.interleave function to 10 to guarantee that each reader take one split from the 10 split?
and how I can control which split the function interleave will take? because if I set the shard_index (worker_index) in shard function to 1 it will give me the first split. Can anyone give me an idea how can I perform this distributed training using the above functions?
then what about the num_parallel_call . should I set it to 10 as well?
knowing that I have single tf.records file for training and another one for validation , I don't split the tf.records files into multiple files.
First of all, how come dataset is 204GB for only 2million images? I think your image is way too large. Try to resize the image. After all, you would probably need to resize it to 224 x 224 in the end.
Second, try to reduce the size of your model. your model could be either too deep or not efficient enough.
Third, try to parallelize your input reading process. It could the bottleneck.

In distributed tensorflow, how to write to summary from workers as well

I am using google cloud ml distributed sample for training a model on a cluster of computers. Input and output (ie rfrecords, checkpoints, tfevents) are all on gs:// (google storage)
Similarly to the distributed sample, I use an evaluation step that is called at the end, and the result is written as a summary, in order to use parameter hypertuning / either within Cloud ML, or using my own stack of tools.
But rather than performing a single evaluation on a large batch of data, I am running several evaluation steps, in order to retrieve statistics on the performance criteria, because I don't want to limited to a single value. I want to get information regarding the performance interval. In particular, the variance of performance is important to me. I'd rather select a model with lower average performance but with better worst cases.
I therefore run several evaluation steps. What I would like to do is to parallelize these evaluation steps because right now, only the master is evaluating. When using large clusters, it is a source of inefficiency, and task workers to evaluate as well.
Basically, the supervisor is created as : = tf.train.Supervisor(
# Write summary_ops by hand.
# No saving; we do it manually in order to easily evaluate immediately
# afterwards.
At the end of training I call the summary writer. :
# only on master, this is what I want to remove
if self.is_master and not self.should_stop:
# I want to have an idea of statistics of accuracy
# not just the mean, hence I run on 10 batches
for i in range(10):
self.global_step += 1
# I call an evaluator, and extract the accuracy
evaluation_values = self.evaluator.evaluate()
accuracy_value = self.model.accuracy_value(evaluation_values)
# now I dump the accuracy, ready to use within hptune
eval_summary = tf.Summary(value=[
tag='training/hptuning/metric', simple_value=accuracy_value)
]), eval_summary, self.global_step)
I tried to write summaries from workers as well , but I got an error : basically summary can be written from masters only. Is there any easy way to workaround ? The error is : "Writing a summary requires a summary writer."
My guess is you'd create a separate summary writer on each worker yourself, and write out summaries directly rather.
I suspect you wouldn't use a supervisor for the eval processing either. Just load a session on each worker for doing eval with the latest checkpoint, and writing out independent summaries.