I implemented a 5x5 Gomoku by CNN + DQN.
Here is the github link:
https://github.com/bokidigital/CNN_DQN_5x5_Gomoku
My problem is this code has no parallelization design.
This means when running this code on an Intel Skylake server ( 2 CPU, 80 cores ), the CPU usage is just about 90%.
I think the idea CPU usage should be 8000% ( 80 cores ).
Since I have some customized rule in gaming ( not only neural network part which consumes GPU about 75% ), it consumes CPU and no parallelization.
My environment is:
Skylake CPU X 2
NVIDIA P100 X 2 ( only use 1 )
40GB RAM
Tensorflow 1.14.0
Keras
Python 3.7
Ubuntu 16.04
My idea is to run this program separately ( run many copies of this process in the different folder which then generates different weights ) then CPU usage could reach ideally to 8000% ( as long as many processes run at the same time )
Since it is the training process, it doesn't matter how each process trained their weights.
Q1. The problem is how to merge their results(the trained weights)? (A+B)/2?
Q2. It seems 1 GPU can only be used by 1 process, I tried to run 3 process at the same time, the GPU seems to hang.
Q3. If I disabled GPU, 80 core Skylake will faster than NVIDIA P100?
I expect to use more CPU usage to speed up this training process.
Since 5x5 agent trained with 5 days, I tested the same code but change grid size to 9x9, I estimated the training time needs 3 months.
Related
I have Linux-x86_64 operating system and I am running TF 2.11 on conda environment. I just got a workstation which includes NVIDIA GeForce RTX 4090 24GB GPU. I'd like to perform CNN image classification, and my dataset contains 20k images, 14k of which are for training, 3k for validation, and 3k for testing. The code also does hyperparameter tuning using the tensorboard API. So basically, I am expecting to finish around 10k experiments. My epoch numbers in the algorithm 300. Batch size varies within the range of 16, 32, 64.
Before, I was running a CNN with 2k image data using the same logic and number of experiments and honestly it was taking like 2 weeks to finish everything. Now, I was expecting for it to run super fast since I upgraded it from GeForce 2060 to 4090, however it's not the case.
As you see in the following pictures, there is no issue with running it on GPU, the problem is that why it runs very slow. it's like finishing the first Epoch 1/300 while it includes 450 substeps takes up to 2 - 2.5 hour. Afterward, it goes to 2/300. This is incredible. It means the whole process can take months.
I just got confused over GPU utilization but I am assuming using 0.9 percent makes sense. I checked all updates and CUDA things, they seem correct.
What do you think the issue could be? 20k image data is not huge for this GPU. I tried to run it through terminal or Jupyter notebook, those are the same. I feel like this tf.session command can create some issues? Should there be a specific open and close sessions?
Those are my parameters that needs to be optimized:
EDIT: if I run it on RTX 2060, it's definitely going too fast compared to Linux RTX 4090, I have not figured it out what the problem is. It's like finishing the first run 1/300 takes just 1.5 minutes, it's like 1.5 hr on linux 4090 workstation!
GPU UTILIZATION BEFORE TRAINING:
enter image description here
GPU UTILIZATION AFTER STARTING TRAINING:
enter image description here
how I generate the data:
train_data = train_datagen.flow_from_directory(directory=train_path,
target_size=(1000, 9), color_mode="grayscale", class_mode="categorical", shuffle=True)
valid_data = valid_datagen.flow_from_directory(directory=valid_path,
target_size=(1000, 9), color_mode="grayscale", class_mode="categorical", shuffle=False)
test_data = test_datagen.flow_from_directory(directory=test_path,
target_size=(1000, 9), color_mode="grayscale", class_mode="categorical", shuffle=False)
I am training a sparse logistic regression model on tensorflow. This problem is specifically about the inference part. I am trying to benchmark inference on on cpu and gpu. I am using the Nvidia P100 gpu (4 dies) on my current GCE box. I am new to gpu so sorry for naive questions.
The model is pretty big ~54k operation (is it considered big compared to dnn or imagenet models ? ) . When i log device placement , i only see gpu:0 being used , and rest of them unused ? I don't do any device placement during training , but during inference i want it to optimally place and use gpu.
Few things i observed : my input node placehoolder (feed_dict) is placed on cpu, so i assume my data is being copied from cpu to gpu ? how does feed_dict exactly work behind the scene ?
1) How can i place my data on which i want to run prediction directly on gpu ? Note : my training runs on distributed cpu with multiple terabytes so i cannot have constant or variable directly in my graph during training , but my inference i can definitely have small batches of data that i would directly like to place on gpu. Are there ways i can achieve this ?
2) Since i am using P100 gpu , i think it has unified memory with host , is it possible to have zerocopy and directly have my data loaded into gpu ? How can i do this from python , java and c++ code. Currently i use feed_dict which from various google sources i think is not at all optimal .
3) Is there some tool or profiler i can use to see when i profile code like :
for epoch_step in epochs:
start_time = time.time()
for i in range(epoch_step):
result = session.run(output, feed_dict={input_example: records_batch})
end_time = time.time()
print("Batch {} epochs {} :time {}".format(batch_size, epoch_step, str(end_time - start_time)))
how much time is being spent on 1) cpu to gpu data transfer 2) session run overhead 3) gpu utilization (currently i use nvidia-smi periodically to monitor
4) kernel call overhead on cpu vs gpu (I assume each invokation of sess.run invokes 1 kernel call right ?
my current bench marking results :
CPU :
Batch size : 10
NumberEpochs TimeGPU TimeCPU
10 5.473 0.484
20 11.673 0.963
40 22.716 1.922
100 56.998 4.822
200 113.483 9.773
Batch size : 100
NumberEpochs TimeGPU TimeCPU
10 5.904 0.507
20 11.708 1.004
40 23.046 1.952
100 58.493 4.989
200 118.272 9.912
Batch size : 1000
NumberEpochs TimeGPU TimeCPU
10 5.986 0.653
20 12.020 1.261
40 23.887 2.530
100 59.598 6.312
200 118.561 12.518
Batch size : 10k
NumberEpochs TimeGPU TimeCPU
10 7.542 0.969
20 14.764 1.923
40 29.308 3.838
100 72.588 9.822
200 146.156 19.542
Batch size : 100k
NumberEpochs TimeGPU TimeCPU
10 11.285 9.613
20 22.680 18.652
40 44.065 35.727
100 112.604 86.960
200 225.377 174.652
Batch size : 200k
NumberEpochs TimeGPU TimeCPU
10 19.306 21.587
20 38.918 41.346
40 78.730 81.456
100 191.367 202.523
200 387.704 419.223
Some notable observations:
As batch size increase i see my gpu utilization increase (reaches to 100% for the only gpu it uses , is there a way i can tell tf to use other gpu too)
at batch size 200k is the only time i see my naive benchmarking shows gpu has minor gain as compared to cpu.
Increasing batch size for a given epoch has minimal effect on time both cpu and gpu until batch size <= 10k. But after that increasing batch size from 10k -> 100k -> 200k the time also increase quite fast i.e for a given epoch let us say 10 batch size 10, 100, 1k, 10k the cpu time and gpu time remain pretty stable ~5-7 sec for gpu and 0.48-0.96 sec for cpu (meaning that sess.run has much higher overhead than computing of graph themselves ?), but increasing batch size further the compute time increase at much faster rate i.e for epoch 10 100k->200k gputime increase from 11 -> 19 sec and cpu time also doubles , why so ? It seems for larger batch size even though i have just one sess.run , but internally it splits that into smaller batch and calls sess.run twice because epoch 20 batch size 100k matches more closely with epoch 10 batch 200k ..
How can i improve my inference further , i believe i am not usding all gpus optimally.
Are there any ideas around how can i benchmark better to get better breakdowns of time for cpu-> gpu transfer and actual speedup for graph computation from moving from cpu to gpu ?
Loading data better directly if possible zero copy into gpu ?
Can i place some nodes to gpu only during inference to get better performance ?
Ideas around quantization or optimizing inference graph ?
Any more ideas to improve gpu based inference . May be xla based optimization or tensrort ? i want to have high performance inference code to run these computations on gpu while the application server crunches on cpu.
One source of information are the TensorFlow docs on performance, including Optimizing for GPU and High Performance Models.
That said, those guides tend to target training more than batch inference, though certainly some of the principles still apply.
I will note that, unless you are using DistributionStrategy, TensorFlow will not automatically put ops on more than a single GPU (source).
In your particularly case, I don't believe GPUs are yet well-tuned to do the type of sparse operation required for your model, so I don't actually expect it to do that well on a GPU (if you log the device placement there's a chance the lookup is done on the CPU). A logistic regression model has only an (sparse) input layer and an output layer, so generally there are very few math ops. GPUs excel the most when they are doing lots of matrix multiplies, convolutions, etc.
Finally, I would encourage you to use TensorRT to optimize your graph, though for your particular model there's no guarantee it does much better.
I have a 4 GPU machine on which I run Tensorflow (GPU) with Keras. Some of my classification problems take several hours to complete.
nvidia-smi returns Volatile GPU-Util which never exceeds 25% on any of my 4 GPUs.
How can I increase GPU Util% and speed up my training?
If your GPU util is below 80%, this is generally the sign of an input pipeline bottleneck. What this means is that the GPU sits idle much of the time, waiting for the CPU to prepare the data:
What you want is the CPU to keep preparing batches while the GPU is training to keep the GPU fed. This is called prefetching:
Great, but if the batch preparation is still way longer than the model training, the GPU will still remain idle, waiting for the CPU to finish the next batch. To make the batch preparation faster we can parallelize the different preprocessing operations:
We can go even further by parallelizing I/O:
Now to implement this in Keras, you need to use the Tensorflow Data API with Tensorflow version >= 1.9.0. Here is an example:
Let's assume, for the sake of this example that you have two numpy arrays x and y. You can use tf.data for any type of data but this is simpler to understand.
def preprocessing(x, y):
# Can only contain TF operations
...
return x, y
dataset = tf.data.Dataset.from_tensor_slices((x, y)) # Creates a dataset object
dataset = dataset.map(preprocessing, num_parallel_calls=64) # parallel preprocessing
dataset = dataset.batch(batch_size)
dataset = dataset.prefetch(None) # Will automatically prefetch batches
....
model = tf.keras.model(...)
model.fit(x=dataset) # Since tf 1.9.0 you can pass a dataset object
tf.data is very flexible, but as anything in Tensorflow (except eager), it uses a static graph. This can be a pain sometimes but the speed up is worth it.
To go further, you can have a look at the performance guide and the Tensorflow data guide.
I've got similar issue - the memory of all the GPUs were allocated by Keras, but Volatile was around 0% and training was taking almost the same amount of time as on CPU. I was using ImageDataGenerator, which turned out to be a bottleneck. When I increased the number of workers in fit_generator method from default value 1 to all available CPUs, then the training time dropped rapidly.
You can also load the data to the memory and then use flow method to prepare batches with augmented images.
What does 'Use CPUs to deploy clones' mean in the following snippet (slim/train_image_classifier.py):
tf.app.flags.DEFINE_boolean(
'clone_on_cpu', False,
'Use CPUs to deploy clones.'
)
Use CPUs to deploy clones' mean
In general setup model losses and gradients are calculated on GPUs, a single clone use a single GPU. For multi GPUs training multiples clones are created. If you have 4 GPUs 4 clones are created and loss for separate batches are computed simultaneously (data parallelism). That said, Now if you don't have GPUs you can use multiple CPUs to for data parallelism ( will be slower than GPU off course). USE CPUs to deploy clones option let you use CPUs for data parallelism; to compute model losses and gradients on cpus.
I tried to setup a very simple Mnist example with an Estimator.
First I used the estimator's deprecated fit() parameters x, y and batch_size. This executed very fast and utilized about 100% of my GPU while not effecting the CPU much (about 10% utilization). So it worked as expected.
Because the x, y and batch_size parameters are deprecated, I wanted to use the input_fn parameter for the fit() function. To build the input_fn, I used a tf.slice_input_producer and batched it with tf.train.batch. This is my code https://gist.github.com/andreas-eberle/11f650fca0dce4c9d3d6c0955145e80d. You should be able to just run it with tensorflow 1.0.
My problem is that the training now runs very slow and only utilizes about 30 % of my GPU (shown in nvidia-smi).
I also tried to increase the queue capacity of the slice_input_producer and to increase the number of threads used for batching. However, this only helped to get to about 45% of GPU utilization and resulted in a 100 % GPU utilization.
What am I doing wrong? Is there a better way for feeding the inputs and batching them? I do not want to create the batches manually (creating subarrays of the numpy input array) because I want to use this example for a more complex input queue where I'll be reading and preprocessing the images in the graph.
I don't think my hardware should be the problem:
List item
Windows 10
NVidia GTX 960M
i7-6700HQ
32 GB RAM