Pandas to Koalas does not solve spark.rpc.message.maxSize exceeded error - pandas

I have an existing databricks job which heavily uses Pandas and below code snippet gives the error "org.apache.spark.SparkException: Job aborted due to stage failure: Serialized task 101059:0 was 1449948615 bytes, which exceeds max allowed: spark.rpc.message.maxSize (268435456 bytes). Consider increasing spark.rpc.message.maxSize or using broadcast variables for large values"
Current code snippet is
normalized_df = pd.DataFrame(data=normalized_data_remnan, columns=['idle_power_mean', 'total_eng_idle_sec', 'total_veh_idle_sec', 'total_sec_on', 'total_sec_load', 'positive_power_mean', 'time_fraction_eng_idle_pcnt', 'time_fraction_veh_idle_pcnt', 'negative_power_mean', 'mean_eng_idle_sec', 'mean_veh_idle_sec', 'mean_stand_still', 'num_start_stops', 'num_power_jump', 'positive_power_med', 'load_speed_med'])
where normalized_data_remnan is an ndarray outputted by scipty.zscore.
I thought moving this to koalas would solve the issue as Koalas uses distributed computing and so converted the code as below.
import databricks.koalas as ks
normalized_df = pd.DataFrame(data=normalized_data_remnan, columns=['idle_power_mean', 'total_eng_idle_sec', 'total_veh_idle_sec', 'total_sec_on', 'total_sec_load', 'positive_power_mean', 'time_fraction_eng_idle_pcnt', 'time_fraction_veh_idle_pcnt', 'negative_power_mean', 'mean_eng_idle_sec', 'mean_veh_idle_sec', 'mean_stand_still', 'num_start_stops', 'num_power_jump', 'positive_power_med', 'load_speed_med'])
But even after this conversion, I am getting the same error. Do you have any clue for this error?
I can think of changing this spark.rpc.message.maxSize to 2 GB. What's the maximum value of this parameter? My driver node is 128 GB memory, 6 cores and worker is 64GB,32 cores and total 8 workers
Thanks,
Nikesh

Usually, sending some huge items from the driver to executor's results in this error message.
spark.rpc.message.maxSize : Is the largest message (in MiB) that can be delivered in "control plane" communication. If you are getting alerts about the RPC message size, increase this. Its default value is 128.
Setting this property(spark.rpc.message.maxSize) in Spark configuration when you start the cluster, you might be able to resolve this error.
To lower the size of the Spark RPC message, you can break the huge list into numerous smaller ones by increasing the partition number.
Example:
largeList = [...] # This is a large list
partitionNum = 100 # Increase this number if necessary
rdd = sc.parallelize(largeList, partitionNum)
ds = rdd.toDS()

Related

MaybeEncodingError when returning large arrays with pool.starmap

I'm experiencing a MaybeEncodingError when using multiprocess when the return list becomes too large.
I have a function that takes image as an input and produces a list of arrays as a result. Naturally data parallelization works magic in this case. I made a wrapper for running this in parallel, however I start receiving the MaybeEncodingError after reaching some threshold in memory when the results need to be merged with pool.join(). I decided to go with pool since the order of processing is very important.
from multiprocessing import Pool
def pool_starmap(function, input_list_tuple, processes = 5):
with Pool(processes=processes) as pool:
results = pool.starmap(function, input_list_tuple)
pool.close()
pool.join()
return results
results=pool_starmap(function, input_list_tuple, processes=2)
I run the code on the server with 80 cores and 512GB of RAM. The whole code rarely exceeds 30GB and yet it breaks only for pool. How do I allow for returning the list of large arrays with multiprocessing but also preserving the order of execution?

Unaccountable Dask memory usage

I am digging into Dask and (mostly) feel comfortable with it. However I cannot understand what is going on in the following scenario. TBH, I'm sure a question like this has been asked in the past, but after searching for awhile I can't seem to find one that really hits the nail on the head. So here we are!
In the code below, you can see a simple python function with a Dask-delayed decorator on it. In my real use-case scenario this would be a "black box" type function within which I don't care what happens, so long as it stays with a 4 GB memory budget and ultimately returns a pandas dataframe. In this case I've specifically chosen the value N=1.5e8 since this results in a total memory footprint of nearly 2.2 GB (large, but still well within the budget). Finally, when executing this file as a script, I have a "data pipeline" which simply runs the black-box function for some number of ID's, and in the end builds up a result dataframe (which I could then do more stuff with)
The confusing bit comes in when this is executed. I can see that only two function calls are executed at once (which is what I would expect), but I receive the warning message distributed.worker - WARNING - Memory use is high but worker has no data to store to disk. Perhaps some other process is leaking memory? Process memory: 3.16 GiB -- Worker memory limit: 3.73 GiB, and shortly thereafter the script exits prematurely. Where is this memory usage coming from?? Note that if I increase memory_limit="8GB" (which is actually more than my computer has), then the script runs fine and my print statement informs me that the dataframe is indeed only utilizing 2.2 GB of memory
Please help me understand this behavior and, hopefully, implement a more memory-safe approach
Many thanks!
BTW:
In case it is helpful, I'm using python 3.8.8, dask 2021.4.0, and distributed 2021.4.0
I've also confirmed this behavior on a Linux (Ubuntu) machine, as well as a Mac M1. They both show the same behavior, although the Mac M1 fails for the same reason with far less memory usage (N=3e7, or roughly 500 MB)
import time
import pandas as pd
import numpy as np
from dask.distributed import LocalCluster, Client
import dask
#dask.delayed
def do_pandas_thing(id):
print(f"STARTING: {id}")
N = 1.5e8
df = pd.DataFrame({"a": np.arange(N), "b": np.arange(N)})
print(
f"df memory usage {df.memory_usage().sum()/(2**30):.3f} GB",
)
# Simulate a "long" computation
time.sleep(5)
return df.iloc[[-1]] # return the last row
if __name__ == "__main__":
cluster = LocalCluster(
n_workers=2,
memory_limit="4GB",
threads_per_worker=1,
processes=True,
)
client = Client(cluster)
# Evaluate "black box" functions with pandas inside
results = []
for i in range(10):
results.append(do_pandas_thing(i))
# compute
r = dask.compute(results)[0]
print(pd.concat(r, ignore_index=True))
I am unable to reproduce the warning/error with the following versions:
pandas=1.2.4
dask=2021.4.1
python=3.8.8
When the object size increases, the process does crash due to memory, but it's a good idea to have workloads that are a fraction of the available memory:
To put it simply, we weren't thinking about analyzing 100 GB or 1 TB datasets in 2011. Nowadays, my rule of thumb for pandas is that you should have 5 to 10 times as much RAM as the size of your dataset. So if you have a 10 GB dataset, you should really have about 64, preferably 128 GB of RAM if you want to avoid memory management problems. This comes as a shock to users who expect to be able to analyze datasets that are within a factor of 2 or 3 the size of their computer's RAM.
source

Array fits comfortably within available RAM, but a memory error still occurs when calling numpy.take on it

I have three arrays of shape (1029,1146,8,5). They are H4, rowOffsets, and colOffsets. H4 is float32 while the other two are int. Assuming 4 bytes per element array, H4 has a cost of 188.7 MB.
My machine has 32 GB RAM total, with 18 currently available. I used platform.architecture() to verify that the Python interpreter is 64 bit, so that RAM ought to be available.
It seems like I'm nowhere near the memory limit, yet I get a memory error when I run the following:
shifted=np.take(H4,rowOffsets,0,mode='clip').
I further tested this by running the code up to the Take call with a much larger input of (3000,3000,8,5). This consumed 7 times more memory yet also did not cause a memory error until the Take call.
So I figure I'm using Take wrong, there's a bug with it, or it consumes a massive amount of memory while executing. Can anyone help clarify what's happening here?
With multi-dimensional arguments take takes a full slice of all but the axis dimension for each entry in indices. Thus the way you use it the result would be 1029 * 1146**2 * 8**2 * 5**2 * itemsize which is a lot and explains your memory problems.
You probably want to use take_along_axis instead.

What does "max_batch_size" mean in tensorflow-serving batching_config.txt?

I'm using tensorflow-serving on GPUs with --enable-batching=true.
However, I'm a little confused with max_batch_size in batching_config.txt.
My client sends a input tensor with a tensor shape [-1, 1000] in a single gRPC request, dim0 ranges from (0, 200]. I set max_batch_size = 100 and receive an error:
"gRPC call return code: 3:Task size 158 is larger than maximum batch
size 100"
"gRPC call return code: 3:Task size 162 is larger than maximum batch
size 100"
Looks like max_batch_size limits dim0 of a single request, but tensorflow batches multiple requests to a batch, I thought it means the sum of request numbers.
Here is a direct description from the docs.
max_batch_size: The maximum size of any batch. This parameter governs
the throughput/latency tradeoff, and also avoids having batches that
are so large they exceed some resource constraint (e.g. GPU memory to
hold a batch's data).
In ML most of the time the first dimension represents a batch. So based on my understanding tensorflow serving confuses the value for the first dimension as a batch and issues errors whenever it is bigger than the allowed value. You can verify it by issuing some of the request where you manually control the first dimension to be lower than 100. I expect this to remove the error.
After that you can modify your inputs to be sent in a proper format.

How to run a bigger batch with AWS SageMaker Batch Transform

I created an XGBoost model with AWS SageMaker. Now I'm trying to use it through Batch Transform Job, and it's all going pretty well for small batches.
However, there's a slightly bigger batch of 600.000 rows in a ~16MB file and I can't manage to run it in one go. I tried two things:
1.
Setting 'Max payload size' of the Transform job to its maximum (100 MB):
transformer = sagemaker.transformer.Transformer(
model_name = config.model_name,
instance_count = config.inference_instance_count,
instance_type = config.inference_instance_type,
output_path = "s3://{}/{}".format(config.bucket, config.s3_inference_output_folder),
sagemaker_session = sagemaker_session,
base_transform_job_name = config.inference_job_prefix,
max_payload = 100
)
However, I still get an error (through console CloudWatch logs):
413 Request Entity Too Large
The data value transmitted exceeds the capacity limit.
2.
Setting max_payload to 0, which, by specification, Amazon SageMaker should interpret as no limit on the payload size.
In that case the job finishes successfully, but the output file is empty (0 bytes).
Any ideas either what I'm doing wrong, or how to run a bigger batch?
Most of SageMaker algorithms set their own default execution parameters with 6 MB in MaxPayloadInMB, so if you are getting 413 from SageMaker algorithms, you are likely to be exceeding the maximum payload they can support. Assuming each row is less than 6 MB in the file, you can fix this by leaving MaxPayloadInMB unset to fallback to the algorithm's default size and setting SplitType to "Line" instead, so it can split the data into smaller batches (https://docs.aws.amazon.com/sagemaker/latest/dg/API_TransformInput.html#SageMaker-Type-TransformInput-SplitType).
this helped me resolve the issue by setting strategy='SingleRecord' in the transformer + you can also add a stronger instance via instance_type and distribute via instance_count.
I have tried the above solutions, but unfortunately they didn't work for me.
Here is what worked for me: https://stackoverflow.com/a/55920737/7091978
Basically, I set "max_payload" from 0 to 1.