How to prevent dask client from dying on worker exception? - dask-distributed

I'm not understanding the resiliency model in dask distributed.
Problem
Exceptions raised by a workers kills embarrassingly parallel dask operation. All workers and clients die if any worker encounters an exception.
Expected Behavior
Reading here: http://distributed.dask.org/en/latest/resilience.html#user-code-failures
Suggests that exceptions should be contained to workers and that subsequent tasks would go on without interruption.
"When a function raises an error that error is kept and transmitted to the client on request. Any attempt to gather that result or any dependent result will raise that exception...This does not affect the smooth operation of the scheduler or worker in any way."
I was following the embarrassingly parallel use case here:
http://docs.dask.org/en/latest/use-cases.html
Reproducible example
import numpy as np
np.random.seed(0)
from dask import compute, delayed
from dask.distributed import Client, LocalCluster
def raise_exception(x):
if x == 10:
raise ValueError("I'm an error on a worker")
elif x == 20:
print("I've made it to 20")
else:
return(x)
if __name__=="__main__":
#Create cluster
cluster = LocalCluster(n_workers=2,threads_per_worker=1)
client = Client(cluster)
values = [delayed(raise_exception)(x) for x in range(0,100)]
results=compute(*values,scheduler='distributed')
Task 20 is never accomplished. The exception on task 10 causes the scheduler and workers to die. What am I not understanding about the programming model? Why does this count as gathering? I just want to run each task and capture any exceptions for later inspection, not raise them on the client.
Use Case
Parallel image processing on a University SLURM cluster. My function has a side-effect that saves processed images to file. The processes are independent and never gathered by the scheduler. The exception causes all nodes to die on the cluster.
Cross-listed on issues, since I'm not sure if this is a bug or a feature!
https://github.com/dask/distributed/issues/2436

Answered in repo - dask delayed computes all-or-nothing. Use dask map from concurrent futures interface + wait. This was designed, not a bug.
https://github.com/dask/distributed/issues/2436

Related

Ray Tune Stuck with multiple runs

Hi I am trying hyper parameter optimization with ray tune.
Below is my code implementation.
However I get stuck and can't get the result back even though there aren't any error messages.
#ray.remote
def main:
do_somthing
return loss
def ray_pick_best_hypter(config):
runs = 10
loss_avg = np.mean(ray.get([main.remote(config,run=x) for x in range(runs)]))
tune.report(loss_avg=loss_avg)
config = load_config()
analysis = ray.tune.run(ray_pick_best_hypter, config=config,progress_reporter=reporter)
The below code works fine, but I want to run multiple experiments and get the mean value.
def ray_pick_best_hypter(config):
loss_avg = ray.get([main.remote(config,run=x))
tune.report(loss_avg=loss_avg)
What is the problem in the code?
It seems you are starting multiple distributed training processes from within your trainable. Each call to main.remote() will start a new distributed task. Since you're starting 10 of them at the same time, they will try to run in parallel.
However, the default resource allocation for each trial is usually just 1 CPU - so the remote tasks cannot be scheduled.
What you can do to resolve this is to pass resources_per_trial={"cpu": 11} - that way each of your remote tasks will have their own CPU to run on.

Unaccountable Dask memory usage

I am digging into Dask and (mostly) feel comfortable with it. However I cannot understand what is going on in the following scenario. TBH, I'm sure a question like this has been asked in the past, but after searching for awhile I can't seem to find one that really hits the nail on the head. So here we are!
In the code below, you can see a simple python function with a Dask-delayed decorator on it. In my real use-case scenario this would be a "black box" type function within which I don't care what happens, so long as it stays with a 4 GB memory budget and ultimately returns a pandas dataframe. In this case I've specifically chosen the value N=1.5e8 since this results in a total memory footprint of nearly 2.2 GB (large, but still well within the budget). Finally, when executing this file as a script, I have a "data pipeline" which simply runs the black-box function for some number of ID's, and in the end builds up a result dataframe (which I could then do more stuff with)
The confusing bit comes in when this is executed. I can see that only two function calls are executed at once (which is what I would expect), but I receive the warning message distributed.worker - WARNING - Memory use is high but worker has no data to store to disk. Perhaps some other process is leaking memory? Process memory: 3.16 GiB -- Worker memory limit: 3.73 GiB, and shortly thereafter the script exits prematurely. Where is this memory usage coming from?? Note that if I increase memory_limit="8GB" (which is actually more than my computer has), then the script runs fine and my print statement informs me that the dataframe is indeed only utilizing 2.2 GB of memory
Please help me understand this behavior and, hopefully, implement a more memory-safe approach
Many thanks!
BTW:
In case it is helpful, I'm using python 3.8.8, dask 2021.4.0, and distributed 2021.4.0
I've also confirmed this behavior on a Linux (Ubuntu) machine, as well as a Mac M1. They both show the same behavior, although the Mac M1 fails for the same reason with far less memory usage (N=3e7, or roughly 500 MB)
import time
import pandas as pd
import numpy as np
from dask.distributed import LocalCluster, Client
import dask
#dask.delayed
def do_pandas_thing(id):
print(f"STARTING: {id}")
N = 1.5e8
df = pd.DataFrame({"a": np.arange(N), "b": np.arange(N)})
print(
f"df memory usage {df.memory_usage().sum()/(2**30):.3f} GB",
)
# Simulate a "long" computation
time.sleep(5)
return df.iloc[[-1]] # return the last row
if __name__ == "__main__":
cluster = LocalCluster(
n_workers=2,
memory_limit="4GB",
threads_per_worker=1,
processes=True,
)
client = Client(cluster)
# Evaluate "black box" functions with pandas inside
results = []
for i in range(10):
results.append(do_pandas_thing(i))
# compute
r = dask.compute(results)[0]
print(pd.concat(r, ignore_index=True))
I am unable to reproduce the warning/error with the following versions:
pandas=1.2.4
dask=2021.4.1
python=3.8.8
When the object size increases, the process does crash due to memory, but it's a good idea to have workloads that are a fraction of the available memory:
To put it simply, we weren't thinking about analyzing 100 GB or 1 TB datasets in 2011. Nowadays, my rule of thumb for pandas is that you should have 5 to 10 times as much RAM as the size of your dataset. So if you have a 10 GB dataset, you should really have about 64, preferably 128 GB of RAM if you want to avoid memory management problems. This comes as a shock to users who expect to be able to analyze datasets that are within a factor of 2 or 3 the size of their computer's RAM.
source

feed data into a tf.contrib.data.Dataset like a queue

About the tf.contrib.data.Dataset (from TensorFlow 1.2, see here and here) usage:
The way how to get data doesn't really fit any way how I get the data usually. In my case, I have a thread and I receive data there and I don't know in advance when it will end but I see when it ends. Then I wait until I processed all the buffers and then I have finished one epoch. How can I get this logic with the Dataset?
Note that I prefer the Dataset interface over the QueueBase interface because it gives me the iterator interface which I can reinitialize and even reset to a different Dataset. This is more powerful compared to queues which cannot be reopened currently after they are closed (see here and here).
Maybe a similar question, or the same question: How can I wrap around a Dataset over a queue? I have some thread with reads some data from somewhere and which can feed it and queue it somehow. How do I get the data into the Dataset? I could repeat some dummy tensor infinite times and then use map to just return my queue.dequeue() but that really only gets me back to all the original problems with the queue, i.e. how to reopen the queue.
The new Dataset.from_generator() method allows you to define a Dataset that is fed by a Python generator. (To use this feature at present, you must download a nightly build of TensorFlow or build it yourself from source. It will be part of TensorFlow 1.4.)
The easiest way to implement your example would be to replace your receiving thread with a generator, with pseudocode as follows:
def receiver():
while True:
next_element = ... # Receive next element from external source.
# Note that this method may block.
end_of_epoch = ... # Decide whether or not to stop based on next_element.
if not end_of_epoch:
yield next_element # Note: you may need to convert this to an array.
else:
return # Returning will signal OutOfRangeError on downstream iterators.
dataset = tf.contrib.data.Dataset.from_generator(receiver, output_types=...)
# You can chain other `Dataset` methods after the generator. For example:
dataset = dataset.prefetch(...) # This will start a background thread
# to prefetch elements from `receiver()`.
dataset = dataset.repeat(...) # Note that each repetition will call
# `receiver()` again, and start from
# a fresh state.
dataset = dataset.batch(...)
More complicated topologies are possible. For example, you can use Dataset.interleave() to create many receivers in parallel.

In distributed tensorflow, FIFOQueue.enqueue() sometimes doesn't work

I used following lines to create a shared queue across different workers.
with tf.device("/job:ps/task:0"):
with tf.variable_scope("global"):
self.queue = tf.FIFOQueue(20, None)
In ps worker:
self.queue.dequeue()
In other workers,
self.queue.enqueue(somethings)
However, I found sometimes the enqueue operation doesn't work that nothing is enqueued, and there is no error or expectation. Does anyone have idea?

Multiple queues causing TF to lock up

I'm trying to use multiple queues for reading and batching, but this is causing TF to occasionally lock up. Here's some sample code:
import tensorflow as tf
coordinator = tf.train.Coordinator()
file_queue = tf.train.string_input_producer(tf.train.match_filenames_once(...))
reader = tf.TextLineReader()
key, serialized_example = reader.read(file_queue)
keys, serialized_examples = tf.train.batch([key, serialized_example], 10)
# repeat the code snippet below multiple times, in my case 4
file_queue_i = tf.train.string_input_producer(tf.train.match_filenames_once(...))
reader_i = tf.TextLineReader()
key_i, serialized_example_i = reader.read(file_queue_i)
initializer = tf.initialize_all_variables()
session = tf.Session(config=tf.ConfigProto(inter_op_parallelism_threads=1, intra_op_parallelism_threads=1))
session.run(initializer)
threads = tf.train.start_queue_runners(sess=session, coord=coordinator)
session.run(keys)
TensorFlow occasionally locks up at the last line, when I actually try to run something. This behavior is rather hard to reproduce using the above code however. In 1000+ runs, I could only get it to hang once. In my real code, the actual reader is more complicated, and it's using TFRecords, but otherwise everything is the same. There it hangs up 2/3 of the time with 3 queues in total. With 5 queues it seemingly never runs, and with 1 queue it seemingly never hangs. This is on a Mac with 0.6. I have a different system running Ubuntu, also with 0.6, and I get the same problem (although the frequency of locking up is much higher on the Ubuntu system).
UPDATE: A more accurate estimate of how often the above code locks up is 1 in 5,000 trials.
This is probably caused by not having enough operation threads. If you have a queue runner 1 depending on work of queue runner 2, and you run them asynchronously, then you'll need at least two op threads, set through inter_op_parallelism_threads, to guarantee that progress is being made.
In your case, you have queue runner that's filling batch thread depending on string_input_producer queue being not empty. If the queue runner associated with string_input_producer queue runs first, then everything is fine. But if batch queue runner is scheduled first, it will get stuck in string_input_producer.dequeue op waiting for string_input_producer queue to get some filenames. Since there's only 1 thread in TensorFlow op thread pool, the enqueue op of string_input_producer will never get allocated a thread to complete (ie, to execute its Compute method)
Simplest solution is to have at least as many operation threads as you have simultaneous run calls (ie, number of queues + 1). If you really want to restrict yourself to one thread, you could preload filename queue file filenames synchronously using main thread.
coordinator = tf.train.Coordinator()
import glob
files = glob.glob('/temp/pipeline/*')
if FLAGS.preload_filenames:
file_queue = tf.FIFOQueue(capacity=len(files), dtypes=tf.string)
enqueue_val = tf.placeholder(dtype=tf.string)
enqueue_op = file_queue.enqueue(enqueue_val)
else:
file_queue = tf.train.string_input_producer(files)
reader = tf.TextLineReader()
key, serialized_example = reader.read(file_queue)
keys, serialized_examples = tf.train.batch([key, serialized_example], 5,
capacity=10)
initializer = tf.initialize_all_variables()
session = tf.Session(config=tf.ConfigProto(inter_op_parallelism_threads=1,
intra_op_parallelism_threads=1))
print 'running initializer'
session.run(initializer)
if FLAGS.preload_filenames:
print 'preloading filenames'
for fn in files:
session.run([enqueue_op], feed_dict={enqueue_val: fn})
print 'size - ', session.run([file_queue.size()])
session.run([file_queue.close()])
print 'starting queue runners'
threads = tf.train.start_queue_runners(sess=session, coord=coordinator)
print 'about to run session'
print session.run(keys)
Code above will need some encapsulation if you have more than one filenames queue. Alternatively here's a hacky work-around which should work if there's exactly prebuffer_amount filenames for all input_producer queues
queue_runners=tf.get_collection(tf.GraphKeys.QUEUE_RUNNERS)
filename_queue_runners=[qr for qr in queue_runners if 'input_producer' in qr.name]
for qr in filename_queue_runners:
for k in prebuffer_amount:
sess.run(qr._enqueue_ops[0])