Keras on GPU - low GPU utilisation - tensorflow

I am training a model using Keras 2.2.4 and python 3.5.3 and Tensorflow on GCP virtual machine with K80 GPU.
GPU utilisation oscillates between 25 and 50% while CPU process with python eats 98%
I assume python is too slow to feed K80 with data.
The code as below.
There are multiple days of data for each epoch.
Each day has around 20K samples - number is a bit different for each.
Batch size is fixed by variable window_size=900
So I feed it around 19K batches for a day. Batch 0 starts with sample 0 and takes 900 samples, batch 1 starts from sample 1 and takes 900 samples and so on until the day ends.
So I have 3 loops - epoch, days, batches.
I feel the epoch and days loops should be preserved for clarity. I don't think they are the problem
I think the most inner loop should be looked at.
The implementation of the inner loop is naïve. Is there some trickery that can make work with arrays faster?
# d is tuple from groupby - d[0] = date, d[1] = values
for epoch in epochs:
print('epoch: ', epoch)
for d in days :
print(' day: ', d[0])
# get arrays for the day
features = np.asarray(d[1])[:,2:9].astype(dtype = 'float32')
print(len(features), len(features[0]), features[1].dtype)
labels = np.asarray(d[1])[:, 9:].astype(dtype = 'int8')
print(len(labels), len(labels[0]), labels[1].dtype)
for batch in range(len(features) - window_size):
# # # can these be optimised?
fb = features[batch:batch+window_size,:]
lb = labels[batch:batch+window_size,:]
fb = fb.reshape(1, fb.shape[0], fb.shape[1])
lb = lb.reshape(1, lb.shape[0], lb.shape[1])
# # #
model.train_on_batch(fb, lb)
#for batches
#model.reset_states()
#for days
#for epoch

try wrapping your script with:
import tensorflow as tf
with tf.device('/device:GPU:0'):
<your code>
Check out the Tensorflow guide on using GPUs for more information

Related

PPO and tensorBoard: when is the training good enough?

I have my own environment, which is basically the simulation of a stochastic differential equation, and I'm trying to train it using PPO. I can see from TensorBoard that the average reward and length of the episode is increasing, but when is it good enough?
I though of comparing it with the maximum lenght of an episode, and I was thinking this was given by the parameter total timesteps in the command:
model.learn(total_timesteps=TIMESTEPS, reset_num_timesteps=False, tb_log_name=f"PPO")
Is this true? I'm using stable baseline 3 and I'm following the tutorial from here, so I have a similar code, but I work with a continuous action-state space. This is my code without the environment:
from stable_baselines3 import PPO
import os
from Measure_gym import *
import time
models_dir = f"models/PPO/"
logdir = f"logs/"
if not os.path.exists(models_dir):
os.makedirs(models_dir)
if not os.path.exists(logdir):
os.makedirs(logdir)
env = bit_flip()
env.reset()
model = PPO('MlpPolicy', env, verbose=1, tensorboard_log=logdir)
#training
TIMESTEPS = 100000
episodes = 1000
iters = 0
for ep in range(episodes):
iters += 1
model.learn(total_timesteps=TIMESTEPS, reset_num_timesteps=False, tb_log_name=f"PPO")
model.save(f"{models_dir}/{TIMESTEPS*iters}")

xgboost treemethod gpu-hist outperformed by hist using rtx3060ti and amd ryzen 9 5950x

I'm doing some hyper-parameter tuning, so speed is key. I've got a nice workstation with both an AMD Ryzen 9 5950x and an NVIDIA RTX3060ti 8GB.
Setup:
xgboost 1.5.1 using PyPi in an anaconda environment.
NVIDIA graphics driver 471.68
CUDA 11.0
When training a xgboost model using the scikit-learn API I pass the tree_method = gpu_hist parameter. And i notice that it is consistently outperformed by using the default tree_method = hist.
Somewhat surprisingly, even when I open multiple consoles (I work in spyder) and start an Optuna study in each of them, each using a different scikit-learn model until my CPU usage is at 100%. When I then compare the tree_method = gpu_hist with tree_method = hist, the tree_method = hist is still faster!
How is this possible? Do I have my drivers configured incorrectly?, is my dataset too small to enjoy a benefit from the tree_method = gpu_hist? (7000 samples, 50 features on a 3 class classification problem). Or is the RTX3060ti simply outclassed by the AMD Ryzen 9 5950x? Or none of the above?
Any help is highly appreciated :)
Edit #Ferdy:
I carried out this little experiment:
def fit_10_times(tree_method, X_train, y_train):
times = []
for i in range(10):
model = XGBClassifier(tree_method = tree_method)
start = time.time()
model.fit(X_train, y_train)
times.append(time.time()-start)
return times
cpu_times = fit_10_times('hist', X_train, y_train)
gpu_times = fit_10_times('gpu_hist', X_train, y_train)
print(X_train.describe())
print('mean cpu training times: ', np.mean(cpu_times), 'standard deviation :',np.std(cpu_times))
print('all training times :', cpu_times)
print('----------------------------------')
print('mean gpu training times: ', np.mean(gpu_times), 'standard deviation :',np.std(gpu_times))
print('all training times :', gpu_times)
Which yielded this output:
mean cpu training times: 0.5646213531494141 standard deviation : 0.010005875058323703
all training times : [0.5690040588378906, 0.5500047206878662, 0.5700047016143799, 0.563004732131958, 0.5570034980773926, 0.5486617088317871, 0.5630037784576416, 0.5680046081542969, 0.57651686668396, 0.5810048580169678]
----------------------------------
mean gpu training times: 2.0273998022079467 standard deviation : 0.05105794761358874
all training times : [2.0265607833862305, 2.0070691108703613, 1.9900789260864258, 1.9856727123260498, 1.9925382137298584, 2.0021069049835205, 2.1197071075439453, 2.1220884323120117, 2.0516715049743652, 1.9765043258666992]
The peak in CPU usage refers to the CPU training runs, and the peak in GPU usage the GPU training runs.
7000 samples is too small to fill the GPU pipeline, your GPU is likely to be starving. We usually work with millions of samples when using GPU acceleration.

How to read audio files using tf.data.Dataset.from_generator

Tensorflow Version : 2.1.0
Model built using tf.keras
Graphics Card : Nvidia GTX 1660TI 6GB DDR6
CPU : Intel i7 9th Gen
Ram : 16 GB DDR4
Storage Disk : SSD (NVME)
I wrote a code to read audio files in batches in a multithread manner using tf.keras.Sequences with multiple workers, but the issue with that code is the CPU is not concurrently reading the next set of audio batches while training the GPU due to which the GPU is being only utilized upto 30 percent of its max capacity (Training time for an epoch is around 25 minutes).
So I decided to move to tf.data.Datasets.from_generator to use the existing generator function to read the batches in a more efficient manner. But that input pipeline is performing more bad (taking 47 minutes to train an epoch). I have attached the code that I used to read create the input pipeline. I have read the file names and their categories from an excel file and fed them to the generator and created the pipeline.
Even after after applying prefetch the pipeline was performing really worse.
Since this is the first time that I am using tf.data API, I would like some insights if I have made any mistakes or not.
This is my code to generate the batches.
# Function read the audio files
def get_x(file):
data = []
for i in file:
audio, fs = sf.read(i, dtype="float32")
data.append(audio[::2])
data = np.array(data, dtype=np.float32)
data = np.expand_dims(data, axis=-1)
return data
def data_generator(files, labels, batchsize):
while True:
start = 0
end = batchsize
while start < len(files):
x = get_x(files[start:end])
y = np.array(tf.keras.utils.to_categorical(labels[start:end], num_classes=2), dtype=np.float32)
yield x, y
start += batchsize
end += batchsize
# Get the tensorflow data dataset object to generate batches
def tf_data_dataset(files, labels, batch_size):
autotune = tf.data.experimental.AUTOTUNE
dataset = tf.data.Dataset.from_generator(
data_generator,
output_types=(np.float32, np.float32),
output_shapes=(tf.TensorShape([None, 16000, 1]),
tf.TensorShape([None, 2])),
args=(files, labels, batch_size))
dataset = dataset.prefetch(buffer_size=autotune)
return dataset

Gluon code not much faster on GPU than on CPU

we're training a network for a recommender system, on triplets. The core code for the fit method is as follows:
for e in range(epochs):
start = time.time()
cumulative_loss = 0
for i, batch in enumerate(train_iterator):
# Forward + backward.
with autograd.record():
output = self.model(batch.data[0])
loss = loss_fn(output, batch.label[0])
# Calculate gradients
loss.backward()
# Update parameters of the network.
trainer_fn.step(batch_size)
# Calculate training metrics. Sum losses of every batch.
cumulative_loss += nd.mean(loss).asscalar()
train_iterator.reset()
where the train_iterator is a custom iterator class that inherits from mx.io.DataIter, and is returning the data ( triples) already in the appropriate context, as:
data = [mx.nd.array(data[:, :-1], self.ctx, dtype=np.int)]
labels = [mx.nd.array(data[:, -1], self.ctx)]
return mx.io.DataBatch(data, labels)
self.model.initialize(ctx=mx.gpu(0)) was also called before running the fit method. loss_fn = gluon.loss.L1Loss().
The trouble is that nvidia-smi reports that the process is correctly allocated into GPU. However, running fit in GPU is not much faster than running it in CPU. In addition, increasing batch_size from 50000 to 500000 increases time per batch by a factor of 10 (which I was not expecting, given GPU parallelization).
Specifically, for a 50k batch:
* output = self.model(batch.data[0]) takes 0.03 seconds on GPU, and 0.08 on CPU.
* loss.backward() takes 0.11 seconds, and 0.39 on CPU.
both assessed with nd.waitall() to avoid asynchronous calls leading to incorrect measurements.
In addition, a very similar code that was running on plain MXNet took less than 0.03 seconds for the corresponding part, which leads to a full epoch taking from slightly above one minute with MXNet, up to 15 minutes with Gluon.
Any ideas on what might be happening here?
Thanks in advance!
The problem is in the following line:
cumulative_loss += nd.mean(loss).asscalar()
When you call asscalar(), MXNet has to implicitly do synchronized call to copy the result from GPU to CPU: it is essentially the same as calling nd.waitall(). Since you do it for every iteration, it is going to do the sync every iteration degrading your wall clock time significantly.
What you can do is to keep and update your cumulative_loss in GPU and copy it to CPU only when you actually need to display it - it can be every N iterations or after epoch is actually done, depending on how long does it take to do each iteration.

Breaking down Tensorflow performance with timeline and benchmarking

Using TF 0.12.1, we are trying to understand how the performance of Tensorflow breaks down. In particular, we are looking at the Inception-v3 model, and how long the forward pass step takes.
The first step we looked at was to run a benchmark on just in the inference step. To avoid queueing time, we set the training example to a constant tensor and run it through the inception model. The train method in the code is below
def train(dataset):
"""Train on dataset for a number of steps."""
with tf.Graph().as_default(), tf.device('/cpu:0'):
# Create a variable to count the number of train() calls. This equals the
# number of batches processed * FLAGS.num_gpus.
global_step = tf.get_variable(
'global_step', [],
initializer=tf.constant_initializer(0), trainable=False)
# Calculate the learning rate schedule.
num_batches_per_epoch = (dataset.num_examples_per_epoch() /
FLAGS.batch_size)
decay_steps = int(num_batches_per_epoch * FLAGS.num_epochs_per_decay)
# Decay the learning rate exponentially based on the number of steps.
lr = tf.train.exponential_decay(FLAGS.initial_learning_rate,
global_step,
decay_steps,
FLAGS.learning_rate_decay_factor,
staircase=True)
# Create an optimizer that performs gradient descent.
opt = tf.train.RMSPropOptimizer(lr, RMSPROP_DECAY,
momentum=RMSPROP_MOMENTUM,
epsilon=RMSPROP_EPSILON)
# Get images and labels for ImageNet and split the batch across GPUs.
assert FLAGS.batch_size % FLAGS.num_gpus == 0, (
'Batch size must be divisible by number of GPUs')
split_batch_size = int(FLAGS.batch_size / FLAGS.num_gpus)
num_classes = dataset.num_classes() + 1
# Calculate the gradients for each model tower.
tower_grads = []
reuse_variables = None
for i in xrange(FLAGS.num_gpus):
with tf.device('/gpu:%d' % i):
with tf.name_scope('%s_%d' % (inception.TOWER_NAME, i)) as scope:
# Force all Variables to reside on the CPU.
with slim.arg_scope([slim.variables.variable], device='/cpu:0'):
# Calculate the loss for one tower of the ImageNet model. This
# function constructs the entire ImageNet model but shares the
# variables across all towers.
image_shape = (FLAGS.batch_size, FLAGS.image_size, FLAGS.image_size, 3)
labels_shape = (FLAGS.batch_size)
images = tf.zeros(image_shape, dtype=tf.float32)
labels = tf.zeros(labels_shape, dtype=tf.int32)
logits = _tower_loss(images, labels, num_classes,
scope, reuse_variables)
# Reuse variables for the next tower.
reuse_variables = True
# Build an initialization operation to run below.
init = tf.initialize_all_variables()
# Start running operations on the Graph. allow_soft_placement must be set to
# True to build towers on GPU, as some of the ops do not have GPU
# implementations.
sess = tf.Session(config=tf.ConfigProto(
allow_soft_placement=True,
log_device_placement=FLAGS.log_device_placement))
sess.run(init)
# Start the queue runners.
tf.train.start_queue_runners(sess=sess)
for step in xrange(FLAGS.max_steps):
start_time = time.time()
loss_value = sess.run(logits)
duration = time.time() - start_time
examples_per_sec = FLAGS.batch_size / float(duration)
format_str = ('%s: step %d, loss =(%.1f examples/sec; %.3f '
'sec/batch)')
print(format_str % (datetime.now(), step,
examples_per_sec, duration))
For 8 GPUs, a batch size of 32, and 1 param server, we observe 0.44 seconds per logits operation which does the forward pass. However, when we run the timeline tool, we observe a much smaller inference time (see figure below). For the GPU runtime, observe that there is an initial burst followed by a break, followed by a longer GPU burst. We assume the initial burst is the forward pass while the second burst is the backpropagation.
If the initial burst really is the forward pass time, it is substantially less than 0.44 seconds. Can anyone explain the discrepancy between these results? Is it a mistake with the benchmarking app or is the timeline tool not capturing the full picture? Additionally, there are a couple of GPU operations before the first large burst that we cannot really explain. Any insight into this would be very much appreciated!
TensorFlow has undergone a number of significant performance improvements since TF 0.12.1. If you are interested in solid performance numbers, please use the latest version of TensorFlow, or version 1.2 when it is released.
If you would like to work from a high-performance model as a starting point, I strongly recommend working from https://github.com/tensorflow/benchmarks which include an Inception-v3 model.
As for trying to understand the detailed performance of a single step, I recommend instrumenting the C++ TensorFlow runtime. (The overhead from within Python can be significant, and could introduce uncertainty in your measurements.)
Additionally, it's important to run the experiment a number of iterations to allow the system to "warm up" and fully initialize.
One thing to note: if you are trying to tune your model, be sure to avoid setting allow_soft_placement=True. For now, it's better to ensure that all operations you expect are truly placed on the GPUs. You can confirm by looking at the log output controlled by the log_device_placement parameter.