Difference in Performance between Cloud Compute VM and AI Platform - tensorflow

I have a GCP cloud compute VM, which is an n1-standard-16, with 4 P100 GPUs attached, and a solid state drive for storing data. I'll refer to this as "the VM".
I've previously used the VM to train a tensorflow based CNN. I want to move away from this to using AI Platform so I can run multiple jobs simultaneously. However I've run into some problems.
Problems
When the training is run on the VM I can set a batch size of 400, and the standard time for an epoch to complete is around 25 minutes.
When the training is running on a complex_model_m_p100 AI platform machine, which I believe to be equivalent to the VM, I can set a maximum batch size of 128, and the standard time for an epoch to complete is 1 hour 40 minutes.
Differences: the VM vs AI Platform
The VM uses TF1.12 and AI Platform uses TF1.15. Consequently there is a difference in GPU drivers (CUDA 9 vs CUDA 10).
The VM is equipped with a solid state drive, which I don't think is the case for AI platform machines.
I want to understand the cause of the reduced batch size, and decrease the epoch times on AI Platform to comparable levels to Glamdring. Has anyone else run into this issue? Am I running on the correct kind of AI Platform machine? Any advice would be welcome!

Could be a bunch of stuff. There's two ways to go about, making the VM look more like AI Platform:
export IMAGE_FAMILY="tf-latest-gpu" # 1.15 instead of 1.12
export ZONE=...
export INSTANCE_NAME=...
gcloud compute instances create $INSTANCE_NAME \
--zone=$ZONE \
--image-family=$IMAGE_FAMILY \
--image-project=deeplearning-platform-release \
--maintenance-policy=TERMINATE \
--metadata="install-nvidia-driver=True"
n and then attach 4 GPUs after that.
... or making AI Platform looking more like the VM:
https://cloud.google.com/ai-platform/training/docs/machine-types#gpus-and-tpus,
because you are using a Legacy Machine right now.

After following the advice of #Frederik Bode and creating a replica VM with TF 1.15 and associated drivers installed I've managed to solve my problem.
Rather than using the multi_gpu_model function call within tf.keras, it's actually best to use a distributed strategy and run the model within that scope.
There is a guide describing how to do it here.
Essentially now my code looks like this:
mirrored_strategy = tf.distribute.MirroredStrategy()
with mirrored_strategy.scope():
training_dataset, validation_dataset = get_datasets()
model = setup_model()
# Don't do this, it's not necessary!
#### NOT NEEDED model = tf.keras.utils.multi_gpu_model(model, 4)
opt = tf.keras.optimizers.Adam(learning_rate=args.learning_rate)
model.compile(loss='sparse_categorical_crossentropy',
optimizer=opt,
metrics=['accuracy'])
steps_per_epoch = args.steps_per_epoch
validation_steps = args.validation_steps
model.fit(training_dataset, steps_per_epoch=steps_per_epoch, epochs=args.num_epochs,
validation_data=validation_dataset, validation_steps=validation_steps)
I setup a small dataset so I could rapidly prototype this.
With a single P100 GPU the epoch time average to 66 seconds.
With 4 GPUs, using the code above, the averagge epoch time was 19 seconds.

Related

TF 2.11 CNN training with 20k Image and NVIDIA GeForce RTX 4090 GPU running too slow

I have Linux-x86_64 operating system and I am running TF 2.11 on conda environment. I just got a workstation which includes NVIDIA GeForce RTX 4090 24GB GPU. I'd like to perform CNN image classification, and my dataset contains 20k images, 14k of which are for training, 3k for validation, and 3k for testing. The code also does hyperparameter tuning using the tensorboard API. So basically, I am expecting to finish around 10k experiments. My epoch numbers in the algorithm 300. Batch size varies within the range of 16, 32, 64.
Before, I was running a CNN with 2k image data using the same logic and number of experiments and honestly it was taking like 2 weeks to finish everything. Now, I was expecting for it to run super fast since I upgraded it from GeForce 2060 to 4090, however it's not the case.
As you see in the following pictures, there is no issue with running it on GPU, the problem is that why it runs very slow. it's like finishing the first Epoch 1/300 while it includes 450 substeps takes up to 2 - 2.5 hour. Afterward, it goes to 2/300. This is incredible. It means the whole process can take months.
I just got confused over GPU utilization but I am assuming using 0.9 percent makes sense. I checked all updates and CUDA things, they seem correct.
What do you think the issue could be? 20k image data is not huge for this GPU. I tried to run it through terminal or Jupyter notebook, those are the same. I feel like this tf.session command can create some issues? Should there be a specific open and close sessions?
Those are my parameters that needs to be optimized:
EDIT: if I run it on RTX 2060, it's definitely going too fast compared to Linux RTX 4090, I have not figured it out what the problem is. It's like finishing the first run 1/300 takes just 1.5 minutes, it's like 1.5 hr on linux 4090 workstation!
GPU UTILIZATION BEFORE TRAINING:
enter image description here
GPU UTILIZATION AFTER STARTING TRAINING:
enter image description here
how I generate the data:
train_data = train_datagen.flow_from_directory(directory=train_path,
target_size=(1000, 9), color_mode="grayscale", class_mode="categorical", shuffle=True)
valid_data = valid_datagen.flow_from_directory(directory=valid_path,
target_size=(1000, 9), color_mode="grayscale", class_mode="categorical", shuffle=False)
test_data = test_datagen.flow_from_directory(directory=test_path,
target_size=(1000, 9), color_mode="grayscale", class_mode="categorical", shuffle=False)

How to use multiple GPUs to train model in tensorflow

I've read the keras official document and it says
To do single-host, multi-device synchronous training with a Keras model, you would use the tf.distribute.MirroredStrategy API. Here's how it works:
Instantiate a MirroredStrategy, optionally configuring which specific devices you want to use (by default the strategy will use all GPUs available).
Use the strategy object to open a scope, and within this scope, create all the Keras objects you need that contain variables. Typically, that means creating & compiling the model inside the distribution scope.
Train the model via fit() as usual.
Here is what I did. Basically, I have 8 GPUs but only 3 are available for the task (5, 6 & 7). I create a strategy using these 3 GPUs and compile the model inside the scope. However, each epoch in my training process takes as much time as using a single GPU, and nvidia also shows that only GPU 7 is in use when I do nvidia-smi in the terminal. Maybe the warning message shows the problem? But I am not an expert... If it is the issue, could someone translate it into plain English or provide a solution? Thanks a lot!!
strategy = tf.distribute.MirroredStrategy(["GPU:5", "GPU:6", "GPU:7"])
print('Number of devices: {}'.format(strategy.num_replicas_in_sync))
WARNING:tensorflow:Some requested devices in `tf.distribute.Strategy` are not visible to TensorFlow: /job:localhost/replica:0/task:0/device:GPU:6,/job:localhost/replica:0/task:0/device:GPU:5,/job:localhost/replica:0/task:0/device:GPU:7
INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:5', '/job:localhost/replica:0/task:0/device:GPU:6', '/job:localhost/replica:0/task:0/device:GPU:7')
Number of devices: 3
with strategy.scope():
model_test = model_unet
model_test.compile(loss=loss,
optimizer = adam_opt,
metrics=['accuracy',segmentation_models.metrics.IOUScore()],
)
model_test.fit(x_train,y_train,
validation_data=(x_val,y_val),
batch_size= 16,
epochs= 6, verbose=1, callbacks = callbacks
)
# an example of the first epoch
Train on 14400 samples, validate on 3600 samples
Epoch 1/6
6384/14400 [============>.................] - ETA: 5:35 - loss: 0.0045 - accuracy: 0.9833 - iou_score: 0.8918
Only GPU 7 is in use

How to merge two or more trained weights?

I implemented a 5x5 Gomoku by CNN + DQN.
Here is the github link:
https://github.com/bokidigital/CNN_DQN_5x5_Gomoku
My problem is this code has no parallelization design.
This means when running this code on an Intel Skylake server ( 2 CPU, 80 cores ), the CPU usage is just about 90%.
I think the idea CPU usage should be 8000% ( 80 cores ).
Since I have some customized rule in gaming ( not only neural network part which consumes GPU about 75% ), it consumes CPU and no parallelization.
My environment is:
Skylake CPU X 2
NVIDIA P100 X 2 ( only use 1 )
40GB RAM
Tensorflow 1.14.0
Keras
Python 3.7
Ubuntu 16.04
My idea is to run this program separately ( run many copies of this process in the different folder which then generates different weights ) then CPU usage could reach ideally to 8000% ( as long as many processes run at the same time )
Since it is the training process, it doesn't matter how each process trained their weights.
Q1. The problem is how to merge their results(the trained weights)? (A+B)/2?
Q2. It seems 1 GPU can only be used by 1 process, I tried to run 3 process at the same time, the GPU seems to hang.
Q3. If I disabled GPU, 80 core Skylake will faster than NVIDIA P100?
I expect to use more CPU usage to speed up this training process.
Since 5x5 agent trained with 5 days, I tested the same code but change grid size to 9x9, I estimated the training time needs 3 months.

How to prevent TPUEstimator from using GPU or TPU

I need to force TPUEstimator to use the CPU. I have a rented google machine and the GPU is already running training. Since the CPUs are idle, I want to start a second Tensorflow session for evaluation but I want to force the evaluation cycle to use CPUs only so that it does not steal GPU time.
I am assuming there is a flag in the run_config or similar for doing this but am struggling to find one in the TF documentation.
run_config = tf.contrib.tpu.RunConfig(
cluster=tpu_cluster_resolver,
master=FLAGS.master,
model_dir=FLAGS.output_dir,
save_checkpoints_steps=FLAGS.save_checkpoints_steps,
tpu_config=tf.contrib.tpu.TPUConfig(
iterations_per_loop=FLAGS.iterations_per_loop,
num_shards=FLAGS.num_tpu_cores,
per_host_input_for_training=is_per_host))
You can run a TPUEstimator locally by including two arguments: (1) use_tpu should be set to False, and (2) tf.contrib.tpu.RunConfig should be passed as the config argument.
my_tpu_estimator = tf.contrib.tpu.TPUEstimator(
model_fn=my_model_fn,
config=tf.contrib.tpu.RunConfig()
use_tpu=False)
The majority of example TPU models can be run in local mode by setting the command line flags:
$> python mnist_tpu.py --use_tpu=false --master=''
More documentation can be found here.

Tensorflow-loss not decreasing when training

I am using tensorflow object detection api for my own dataset I am facing some problem. I am using centos , with GPU Geforce 1080, 8 GB GPU memory, tensorflow 1.2.1 . I have 500 images in training set and 40 in test. I did the following steps and I have two problems.
1.I annotated my images using LabelImg tool
2.Created tfrecord successfully
3.I used ssd_inception_v2_coco.config. I modified the only path, no of class and I did not train from scratch, I used ssd_inception_v2_coco model checkpoints.
Problem 1: from step 0 until 3000, my loss has dramatically decreased but after that, it stays constant between 5 to 6 . Not getting how I reduce it but still my model able to detect required object. Here is my Tensorborad samples
Even i tried for diffent model eg. faster_rcnn_inception_resnet_v2_atrous_coco after some steps loss stay constant between 1 and 2
Problem 2: according to a document I able to run eval.py but getting the following error:
WARNING:root:The following classes have no ground truth examples: 0 after that program terminate.
I try to run train.py and eval.py at the same time still same error.
Please give me a suggestion. I am tensorflow beginner required suggestion.
The loss curve you're seeing on Tensorboard is quite normal. Initially, the loss will drop very quickly, but will seemingly "bottom out" over time. Training is a slow process, you should see a steady drop over time after more iterations.