Increase Connection Pool Size - google-cloud-python

We are running the following code to upload to GCP Buckets in parallel. It seems we are quickly using up all the connections in the pool based on the warnings we are seeing. Is there any way to configure the connection pool the library is using?
def upload_string_to_bucket(content: str):
blob = bucket.blob(cloud_path)
blob.upload_from_string(content)
with concurrent.futures.ThreadPoolExecutor() as executor:
executor.map(upload_string_to_bucket, content_list)
WARNING:urllib3.connectionpool:Connection pool is full, discarding connection: www.googleapis.com
WARNING:urllib3.connectionpool:Connection pool is full, discarding connection: www.googleapis.com
WARNING:urllib3.connectionpool:Connection pool is full, discarding connection: www.googleapis.com
WARNING:urllib3.connectionpool:Connection pool is full, discarding connection: www.googleapis.com
WARNING:urllib3.connectionpool:Connection pool is full, discarding connection: www.googleapis.com
WARNING:urllib3.connectionpool:Connection pool is full, discarding connection: www.googleapis.com

I have a similar issue with download blobs in parallel.
This article may be informative.
https://laike9m.com/blog/requests-secret-pool_connections-and-pool_maxsize,89/
Personally, I don't think that increasing a connection pull is the best solution,
I prefer to chunk the "downloads" by pool_maxsize.
def chunker(it: Iterable, chunk_size: int):
chunk = []
for index, item in enumerate(it):
chunk.append(item)
if not (index + 1) % chunk_size:
yield chunk
chunk = []
if chunk:
yield chunk
for chunk in chunker(content_list, 10):
with concurrent.futures.ThreadPoolExecutor() as executor:
executor.map(upload_string_to_bucket, chunk)
Of course, we can spawn the download right away after one is ready, all as we wish.

Related

MaybeEncodingError when returning large arrays with pool.starmap

I'm experiencing a MaybeEncodingError when using multiprocess when the return list becomes too large.
I have a function that takes image as an input and produces a list of arrays as a result. Naturally data parallelization works magic in this case. I made a wrapper for running this in parallel, however I start receiving the MaybeEncodingError after reaching some threshold in memory when the results need to be merged with pool.join(). I decided to go with pool since the order of processing is very important.
from multiprocessing import Pool
def pool_starmap(function, input_list_tuple, processes = 5):
with Pool(processes=processes) as pool:
results = pool.starmap(function, input_list_tuple)
pool.close()
pool.join()
return results
results=pool_starmap(function, input_list_tuple, processes=2)
I run the code on the server with 80 cores and 512GB of RAM. The whole code rarely exceeds 30GB and yet it breaks only for pool. How do I allow for returning the list of large arrays with multiprocessing but also preserving the order of execution?

Pandas to Koalas does not solve spark.rpc.message.maxSize exceeded error

I have an existing databricks job which heavily uses Pandas and below code snippet gives the error "org.apache.spark.SparkException: Job aborted due to stage failure: Serialized task 101059:0 was 1449948615 bytes, which exceeds max allowed: spark.rpc.message.maxSize (268435456 bytes). Consider increasing spark.rpc.message.maxSize or using broadcast variables for large values"
Current code snippet is
normalized_df = pd.DataFrame(data=normalized_data_remnan, columns=['idle_power_mean', 'total_eng_idle_sec', 'total_veh_idle_sec', 'total_sec_on', 'total_sec_load', 'positive_power_mean', 'time_fraction_eng_idle_pcnt', 'time_fraction_veh_idle_pcnt', 'negative_power_mean', 'mean_eng_idle_sec', 'mean_veh_idle_sec', 'mean_stand_still', 'num_start_stops', 'num_power_jump', 'positive_power_med', 'load_speed_med'])
where normalized_data_remnan is an ndarray outputted by scipty.zscore.
I thought moving this to koalas would solve the issue as Koalas uses distributed computing and so converted the code as below.
import databricks.koalas as ks
normalized_df = pd.DataFrame(data=normalized_data_remnan, columns=['idle_power_mean', 'total_eng_idle_sec', 'total_veh_idle_sec', 'total_sec_on', 'total_sec_load', 'positive_power_mean', 'time_fraction_eng_idle_pcnt', 'time_fraction_veh_idle_pcnt', 'negative_power_mean', 'mean_eng_idle_sec', 'mean_veh_idle_sec', 'mean_stand_still', 'num_start_stops', 'num_power_jump', 'positive_power_med', 'load_speed_med'])
But even after this conversion, I am getting the same error. Do you have any clue for this error?
I can think of changing this spark.rpc.message.maxSize to 2 GB. What's the maximum value of this parameter? My driver node is 128 GB memory, 6 cores and worker is 64GB,32 cores and total 8 workers
Thanks,
Nikesh
Usually, sending some huge items from the driver to executor's results in this error message.
spark.rpc.message.maxSize : Is the largest message (in MiB) that can be delivered in "control plane" communication. If you are getting alerts about the RPC message size, increase this. Its default value is 128.
Setting this property(spark.rpc.message.maxSize) in Spark configuration when you start the cluster, you might be able to resolve this error.
To lower the size of the Spark RPC message, you can break the huge list into numerous smaller ones by increasing the partition number.
Example:
largeList = [...] # This is a large list
partitionNum = 100 # Increase this number if necessary
rdd = sc.parallelize(largeList, partitionNum)
ds = rdd.toDS()

Multiprocessing unable to automatically recycle zombie children processes for my on-the-fly data augmentation code

System info:
Linux Ubuntu
Docker
Python 3.6.8
I am doing on-the-fly data augmentation for medical image segmentation. Inspired by line 158~161 of faustomilletari/VNet, the code example for data augmentation is as below:
trainQueue = Queue(queue_size) # store patches
tr_dataPrep = [None] * nProc
for proc in range(nProc):
tr_dataPrep[proc] = Process(target=data_aug_function, args=(train_files, trainQueue, patch_size))
tr_dataPrep[proc].daemon = True
tr_dataPrep[proc].start()
The code above was running well on one server, which is now not available, but failed on another server, where lots of zombie children processes came out after a while and the training process just hang there indefinitely.
It seems that some children processes were killed by the system but we don't know why.

How to prevent dask client from dying on worker exception?

I'm not understanding the resiliency model in dask distributed.
Problem
Exceptions raised by a workers kills embarrassingly parallel dask operation. All workers and clients die if any worker encounters an exception.
Expected Behavior
Reading here: http://distributed.dask.org/en/latest/resilience.html#user-code-failures
Suggests that exceptions should be contained to workers and that subsequent tasks would go on without interruption.
"When a function raises an error that error is kept and transmitted to the client on request. Any attempt to gather that result or any dependent result will raise that exception...This does not affect the smooth operation of the scheduler or worker in any way."
I was following the embarrassingly parallel use case here:
http://docs.dask.org/en/latest/use-cases.html
Reproducible example
import numpy as np
np.random.seed(0)
from dask import compute, delayed
from dask.distributed import Client, LocalCluster
def raise_exception(x):
if x == 10:
raise ValueError("I'm an error on a worker")
elif x == 20:
print("I've made it to 20")
else:
return(x)
if __name__=="__main__":
#Create cluster
cluster = LocalCluster(n_workers=2,threads_per_worker=1)
client = Client(cluster)
values = [delayed(raise_exception)(x) for x in range(0,100)]
results=compute(*values,scheduler='distributed')
Task 20 is never accomplished. The exception on task 10 causes the scheduler and workers to die. What am I not understanding about the programming model? Why does this count as gathering? I just want to run each task and capture any exceptions for later inspection, not raise them on the client.
Use Case
Parallel image processing on a University SLURM cluster. My function has a side-effect that saves processed images to file. The processes are independent and never gathered by the scheduler. The exception causes all nodes to die on the cluster.
Cross-listed on issues, since I'm not sure if this is a bug or a feature!
https://github.com/dask/distributed/issues/2436
Answered in repo - dask delayed computes all-or-nothing. Use dask map from concurrent futures interface + wait. This was designed, not a bug.
https://github.com/dask/distributed/issues/2436

What is the size of the default block on hyperldger fabric?

I'm trying to create a estimation of the size of a chain if i create a new blockchain using hyperldger.
In order to have an idea of disk space usage i would like to know that is the average size of a default block in the hyperldger fabric.
Thank you before hand,
Best Regards
Bellow you can find default configuration provided for ordering service. You can actually control block size with BatchTimeout and BatchSize parameters, also note that this is pretty use case dependent as it relies on transaction size, i.e. the logic of your chaincode.
################################################################################
#
# SECTION: Orderer
#
# - This section defines the values to encode into a config transaction or
# genesis block for orderer related parameters
#
################################################################################
Orderer: &OrdererDefaults
# Orderer Type: The orderer implementation to start
# Available types are "solo" and "kafka"
OrdererType: solo
Addresses:
- orderer.example.com:7050
# Batch Timeout: The amount of time to wait before creating a batch
BatchTimeout: 2s
# Batch Size: Controls the number of messages batched into a block
BatchSize:
# Max Message Count: The maximum number of messages to permit in a batch
MaxMessageCount: 10
# Absolute Max Bytes: The absolute maximum number of bytes allowed for
# the serialized messages in a batch.
AbsoluteMaxBytes: 98 MB
# Preferred Max Bytes: The preferred maximum number of bytes allowed for
# the serialized messages in a batch. A message larger than the preferred
# max bytes will result in a batch larger than preferred max bytes.
PreferredMaxBytes: 512 KB
The value is configured:
################################################################################
# SECTION: Orderer
################################################################################
Orderer: &OrdererDefaults
OrdererType: solo
Addresses:
#- orderer0.ordererorg:7050
- orderer0:7050
Kafka:
Brokers:
BatchTimeout: 2s
BatchSize:
MaxMessageCount: 10
AbsoluteMaxBytes: 98 MB
PreferredMaxBytes: 512 KB
Organizations:
The file is located in configtx.yaml, and it is defined in config.go.
// BatchSize contains configuration affecting the size of batches.
type BatchSize struct {
MaxMessageCount uint32 `yaml:"MaxMessageSize"`
AbsoluteMaxBytes uint32 `yaml:"AbsoluteMaxBytes"`
PreferredMaxBytes uint32 `yaml:"PreferredMaxBytes"`
}
The values are set according the configtx.yaml file above.