I'm trying to speed up training multiple models by using Python's multiprocessing.Pool.apply_async. In order to save memory, I've converted my pandas dataframe into 32bit floats. I can see the pickle is ~2gb on disk. Once I've loaded the pickle into memory, I can see in task manager that python is using ~4-5gb memory.
I then create my tensorflow models in the main thread like so:
for i in range(n):
self.estimators.append(DNNEstimator())
with Pool(processes=4) as pool:
for i in range(n):
# DNNEstimator is a wrapper over a keras neural network with scikit-learn
# compatible interface (i.e., it has methods fit(x, y) and predict(x))
self.results.append(pool.apply_async(self.estimators[i].fit, (x, y)))
for i in range(n):
self.results[i] = self.results[i].get()
My understanding of how multiprocessing works is that it pickles the data that needs to be processed, and run it in a new python process/instance. I noticed that each of the python processes was taking ~6gb of memory during training. I'm suspecting subprocesses are taking this much memory because it recreated all the variables in the main thread. So my question is, how much of the main thread's scope is recreated in each subprocess?
Related
I am struggling with the following. I am creating a tf.data.Dataset using the from_generator method. I perform these actions on CPU as I don't want to overload my GPU memory.
The dataset consists of tuples, which contain a tf.bool 1-D mask (tf.Tensor) with fixed length, and a tf.float 2-D matrix (tf.Tensor) with variable size. The loss function is decorated using the following decorator, so I would not assume the variable size is the problem.
#tf.function(experimental_relax_shapes=True)
Ideally, the dataset is kept on the CPU, but then prefetched onto the GPU.
def gen():
for i, j in zip(mask_list, wmat_list):
yield i, j
dataset = tf.data.Dataset.from_generator(gen, output_types=(tf.bool, tf.float32))
The main training loop currently relies on tf.identity to move the data to the gpu, which is inefficient. As shown on the screenshot from Tensorboard below. Roughly 70% of the time is spend loading the data and moving it to GPU.
for b, (mask, wmat) in enumerate(dataset):
with tf.GradientTape() as tape:
mask = tf.identity(mask)
wmat = tf.identity(wmat)
mean_error, loss = self.model.loss(mask, wmat)
epoch_loss += loss.numpy()
epoch_mean_error += mean_error.numpy()
I have tried the "prefetch_to_device" function. However, it did not move the data onto the GPU. As verified by printing e.g. mask.device in the training loop.
gpu_transform = tf.data.experimental.prefetch_to_device('/gpu')
dataset.apply(gpu_transform)
For me it resembles to this bug: https://github.com/tensorflow/tensorflow/issues/30929 . However, it is marked as solved and is over a year old.
Running TF 2.3 using the official Docker image.
I have found the solution to my own question.
The problem was that the tuples in the dataset did not contain tf.Tensors, but numpy arrays. Therefore, the pipeline was probably limited by the functionality of py_func().
The screenshot below show that the pipeline does not block on the CPU. However there is still a considerable MemCpy. The prefetch_to_device() still does not do anything. This is likely due to a known issue which should be fixed in TF2.4
https://github.com/tensorflow/tensorflow/issues/35563
The (unconfirmed) suggested workaround also did not work for me. (see edit)
with tf.device("/gpu:0"):
ds = ds.prefetch(1)
EDIT:
I have further investigated this issue and filed a bug report. It does now seem that the suggested workaround does something, but not sure if it completely prefetches in time.
https://github.com/tensorflow/tensorflow/issues/43905
In my use case, I have some time series data where at each time t, I train a new model over a rolling window. In tensorflow 1, I had to do the following otherwise models will accumulate in the default graph and essentially leak memory.
import tensorflow as tf
import keras.backend as K
...
tf.reset_default_graph()
K.clear_session()
In tensorflow 2, I've found equivalent functions tf.compat.v1.reset_default_graph() and tf.keras.backend.clear_session(). However, from the documentation, TF2 ties graph variables to python variables so theoretically if a python variable is destroyed, the graph variable should also be destroyed. Is this interpretation correct? I've tried putting model creation code in a loop, whilst memory usage still grows, it wasn't the sort of explosion I've witnessed in TF1.
I'm trying to run the model scoring (inference graph) from tensorflow objec detection API to run it on multiple GPU's, tried specifying the GPU number in the main, but it runs only on single GPU.placed GPU utilization snapshot here
Using tensorflow-gpu==1.13.1, can you kindly point me what I'm missing here.
for i in range(2):
with tf.device('/gpu:{}' . format(i)):
tf_init()
init = tf.global_variables_initializer
with detection_graph.as_default():
with tf.Session(config=tf.ConfigProto(log_device_placement=True)) as session:
call to #run_inference_multiple_images function
The responses to this question should give you a few options for fixing this.
Usually TensorFlow will occupy all visible GPUs unless told otherwise. So if you haven't already tried, you could just remove the with tf.device line (assuming you only have the two GPUs) and TensorFlow should use them both.
Otherwise, I think the easiest is setting the environment variables with os.environ["CUDA_VISIBLE_DEVICES"] = "0,1".
I have a 4 GPU machine on which I run Tensorflow (GPU) with Keras. Some of my classification problems take several hours to complete.
nvidia-smi returns Volatile GPU-Util which never exceeds 25% on any of my 4 GPUs.
How can I increase GPU Util% and speed up my training?
If your GPU util is below 80%, this is generally the sign of an input pipeline bottleneck. What this means is that the GPU sits idle much of the time, waiting for the CPU to prepare the data:
What you want is the CPU to keep preparing batches while the GPU is training to keep the GPU fed. This is called prefetching:
Great, but if the batch preparation is still way longer than the model training, the GPU will still remain idle, waiting for the CPU to finish the next batch. To make the batch preparation faster we can parallelize the different preprocessing operations:
We can go even further by parallelizing I/O:
Now to implement this in Keras, you need to use the Tensorflow Data API with Tensorflow version >= 1.9.0. Here is an example:
Let's assume, for the sake of this example that you have two numpy arrays x and y. You can use tf.data for any type of data but this is simpler to understand.
def preprocessing(x, y):
# Can only contain TF operations
...
return x, y
dataset = tf.data.Dataset.from_tensor_slices((x, y)) # Creates a dataset object
dataset = dataset.map(preprocessing, num_parallel_calls=64) # parallel preprocessing
dataset = dataset.batch(batch_size)
dataset = dataset.prefetch(None) # Will automatically prefetch batches
....
model = tf.keras.model(...)
model.fit(x=dataset) # Since tf 1.9.0 you can pass a dataset object
tf.data is very flexible, but as anything in Tensorflow (except eager), it uses a static graph. This can be a pain sometimes but the speed up is worth it.
To go further, you can have a look at the performance guide and the Tensorflow data guide.
I've got similar issue - the memory of all the GPUs were allocated by Keras, but Volatile was around 0% and training was taking almost the same amount of time as on CPU. I was using ImageDataGenerator, which turned out to be a bottleneck. When I increased the number of workers in fit_generator method from default value 1 to all available CPUs, then the training time dropped rapidly.
You can also load the data to the memory and then use flow method to prepare batches with augmented images.
There is a python list of image file names.
It is important that each image file is read and then following steps be applied on it - Taking 5 random crops and their mirror reflections.
In order to maintain randomness in the order of images fed to the CNN it is also important that all the images from preprocessing of one image should not go into the CNN together.
My thoughts
Let multiple CPU threads preprocess the images and put them in a random shuffle queue.
Let the batchsize number of images be dequeued from the queue and used for CNN.
My questions
a) Is the above way the most optimal way of working it out ?
b) Can anyone provide a code example which can be taken as a reference to work it out ?
This is my implementation of what almost meets your requirements. The usage is easy in train.py.
dataset_train = Dataset("path/to/list.train.txt", subtract_mean=True, is_train=True, name='train')
dataset_val = Dataset("path/to/list.val.txt", subtract_mean=True, is_train=False, name='val')
dataset_train.shuffle()
for batch_x, batch_y in dataset_train.batches(batch_size):
# batch_x: (batch_size, H, W, 3), batch_y: (batch_size)
...
# for samling for validation
for val_step, (val_batch_x, val_batch_y) in \
enumerate(dataset_val.sample_batches(batch_size, 256)):
Use "threading" and "concurrent.futures" for background loading, and "Queue" for a fixed size prefetch (but this doesn't use multi-CPU, decribed below, without lost of speed).
Each image is random-cropped and flipped when loaded for training. (only center-cropped for testing and validation)
A batch of images are dequeued for feeding to the CNN.
Why not use multi-CPU
Due to GIL of CPython python threads with threading can't run simultaneously on multiple CPU cores. The alternative way to use multi-cores is multiprocessing (mp).
I used to implement the function with mp.Process and mp.Queue, but mp.Queue VERY SLOW for transferring large data like images between processes, because of the limitation of its implementation with pipe() (on Linux). The overhead is about 0.5 seconds for a batch of 100 256x256 images on a fast workstation where a batch of 64 images training of AlexNet only takes 0.6 sec.
I tried threading and Queue, and found that as the bottleneck is I/O but not CPU computation, and Queue.Queue transfer 100 images in no time, the loading becomes much faster even without multi-CPU.