xgboost treemethod gpu-hist outperformed by hist using rtx3060ti and amd ryzen 9 5950x - gpu

I'm doing some hyper-parameter tuning, so speed is key. I've got a nice workstation with both an AMD Ryzen 9 5950x and an NVIDIA RTX3060ti 8GB.
Setup:
xgboost 1.5.1 using PyPi in an anaconda environment.
NVIDIA graphics driver 471.68
CUDA 11.0
When training a xgboost model using the scikit-learn API I pass the tree_method = gpu_hist parameter. And i notice that it is consistently outperformed by using the default tree_method = hist.
Somewhat surprisingly, even when I open multiple consoles (I work in spyder) and start an Optuna study in each of them, each using a different scikit-learn model until my CPU usage is at 100%. When I then compare the tree_method = gpu_hist with tree_method = hist, the tree_method = hist is still faster!
How is this possible? Do I have my drivers configured incorrectly?, is my dataset too small to enjoy a benefit from the tree_method = gpu_hist? (7000 samples, 50 features on a 3 class classification problem). Or is the RTX3060ti simply outclassed by the AMD Ryzen 9 5950x? Or none of the above?
Any help is highly appreciated :)
Edit #Ferdy:
I carried out this little experiment:
def fit_10_times(tree_method, X_train, y_train):
times = []
for i in range(10):
model = XGBClassifier(tree_method = tree_method)
start = time.time()
model.fit(X_train, y_train)
times.append(time.time()-start)
return times
cpu_times = fit_10_times('hist', X_train, y_train)
gpu_times = fit_10_times('gpu_hist', X_train, y_train)
print(X_train.describe())
print('mean cpu training times: ', np.mean(cpu_times), 'standard deviation :',np.std(cpu_times))
print('all training times :', cpu_times)
print('----------------------------------')
print('mean gpu training times: ', np.mean(gpu_times), 'standard deviation :',np.std(gpu_times))
print('all training times :', gpu_times)
Which yielded this output:
mean cpu training times: 0.5646213531494141 standard deviation : 0.010005875058323703
all training times : [0.5690040588378906, 0.5500047206878662, 0.5700047016143799, 0.563004732131958, 0.5570034980773926, 0.5486617088317871, 0.5630037784576416, 0.5680046081542969, 0.57651686668396, 0.5810048580169678]
----------------------------------
mean gpu training times: 2.0273998022079467 standard deviation : 0.05105794761358874
all training times : [2.0265607833862305, 2.0070691108703613, 1.9900789260864258, 1.9856727123260498, 1.9925382137298584, 2.0021069049835205, 2.1197071075439453, 2.1220884323120117, 2.0516715049743652, 1.9765043258666992]
The peak in CPU usage refers to the CPU training runs, and the peak in GPU usage the GPU training runs.

7000 samples is too small to fill the GPU pipeline, your GPU is likely to be starving. We usually work with millions of samples when using GPU acceleration.

Related

TF 2.11 CNN training with 20k Image and NVIDIA GeForce RTX 4090 GPU running too slow

I have Linux-x86_64 operating system and I am running TF 2.11 on conda environment. I just got a workstation which includes NVIDIA GeForce RTX 4090 24GB GPU. I'd like to perform CNN image classification, and my dataset contains 20k images, 14k of which are for training, 3k for validation, and 3k for testing. The code also does hyperparameter tuning using the tensorboard API. So basically, I am expecting to finish around 10k experiments. My epoch numbers in the algorithm 300. Batch size varies within the range of 16, 32, 64.
Before, I was running a CNN with 2k image data using the same logic and number of experiments and honestly it was taking like 2 weeks to finish everything. Now, I was expecting for it to run super fast since I upgraded it from GeForce 2060 to 4090, however it's not the case.
As you see in the following pictures, there is no issue with running it on GPU, the problem is that why it runs very slow. it's like finishing the first Epoch 1/300 while it includes 450 substeps takes up to 2 - 2.5 hour. Afterward, it goes to 2/300. This is incredible. It means the whole process can take months.
I just got confused over GPU utilization but I am assuming using 0.9 percent makes sense. I checked all updates and CUDA things, they seem correct.
What do you think the issue could be? 20k image data is not huge for this GPU. I tried to run it through terminal or Jupyter notebook, those are the same. I feel like this tf.session command can create some issues? Should there be a specific open and close sessions?
Those are my parameters that needs to be optimized:
EDIT: if I run it on RTX 2060, it's definitely going too fast compared to Linux RTX 4090, I have not figured it out what the problem is. It's like finishing the first run 1/300 takes just 1.5 minutes, it's like 1.5 hr on linux 4090 workstation!
GPU UTILIZATION BEFORE TRAINING:
enter image description here
GPU UTILIZATION AFTER STARTING TRAINING:
enter image description here
how I generate the data:
train_data = train_datagen.flow_from_directory(directory=train_path,
target_size=(1000, 9), color_mode="grayscale", class_mode="categorical", shuffle=True)
valid_data = valid_datagen.flow_from_directory(directory=valid_path,
target_size=(1000, 9), color_mode="grayscale", class_mode="categorical", shuffle=False)
test_data = test_datagen.flow_from_directory(directory=test_path,
target_size=(1000, 9), color_mode="grayscale", class_mode="categorical", shuffle=False)

tf.distribute.MirroredStrategy - suggestion for improving test mean_iou for segmentation network using distributed training

I am using tensorflow 2.5.0 and implemented semantic segmatation network. used DeepLab_v3_plus network with ResNet101 backbone, adam optimizer and Categorical cross entropy loss to train network. I have first build code for single gpu and achieved test accuracy (mean_iou) of 54% trained for 96 epochs. Then added tf MirroredStrategy (one machine) in code to support for multi gpu training. Surprisingly with 2 gpus, training for 48 epochs, test mean_iou is just 27% and training with 4 gpus, for 24 epochs, test mean_iou can around 12% for same dataset.
Code I have modified to support multi-gpu training from single-gpu training.
By following tensorflow blog for distributed training, created mirrored strategy and created model, model compilation and dataset_generator inside strategy scope. As per my understanding, by doing so, model.fit() method will take care of synchronization of gradients and distributing data on each gpus for training. Though code was running without any error, and also training time reduced compared to single gpu for same number of image training, test mean_iou keep getting worst with more number of gpus.
Replaced BatchNormalization with SyncBatchNormalization, but no improvement.
used warmup learning rate with linear scaling of learning rate with number of gpus, but no improvement.
in cross entropy loss, used both losses_utils.ReductionV2.AUTO and losses_utils.ReductionV2.NONE.
loss = ce(y_true, y_pred)
# reshape loss for each sample (BxHxWxC -> BxN)
# Normalize loss by number of non zero elements and sum for each sample and mean across all samples.
using .AUTO/.NONE options, I am not scaling loss by global_batch_size understanding tf will take care of it and I am already normalizing for each gpus. but with both options, didn't get any luck.
changed data_generator to tf.data.Dataset obj. Though it has helped in training time, but test mean_iou become even worst.
I would appreciate if any lead or suggestion for improving test_iou in distributed training.
let me know if you need any additional details.
Thank you

Optimising GPU use for Keras model training

I'm training a Keras model. During the training, I'm only utilising between 5 and 20% of my CUDA cores and an equally small proportion of my NVIDIA RTX 2070 memory. Model training is pretty slow currently and I would really like to take advantage of as many of my available CUDA cores as possible to speed this up!
nvidia dmon # (during model training)
# gpu pwr gtemp mtemp sm mem enc dec mclk pclk
# Idx W C C % % % % MHz MHz
0 45 49 - 9 6 0 0 6801 1605
What parameters should I look to tune in order to increase CUDA core utilisation with the aim of training the same model faster?
Here's a simplified example of my current image generation and training steps (I can elaborate / edit, if required, but I currently believe these are the key steps for the purpose of the question):
train_datagen = ImageDataGenerator(rescale=1./255)
test_datagen = ImageDataGenerator(rescale=1./255)
train_generator = train_datagen.flow_from_directory(
r'./input_training_examples',
target_size=(150, 150),
batch_size=32,
class_mode='binary'
)
validation_generator = test_datagen.flow_from_directory(
r'./input_validation_examples',
target_size=(150, 150),
batch_size=32,
class_mode='binary'
)
history = model.fit(
train_generator,
steps_per_epoch=128, epochs=30,
validation_data=validation_generator, validation_steps=50,
)
Hardware: NVIDIA 2070 GPU
Platform: Linux 5.4.0-29-generic #33-Ubuntu x86_64, NVIDIA driver 440.64, CUDA 10.2, Tensorflow 2.2.0-rc3
GPU utilization is a tricky business, there are too many factors involved.
The first thing to try obviously: increase batch size.
But that solely doesn't ensure the max utilization, maybe your I/O is slow so there is a bottleneck in the data_generator.
You can try loading the full data as a NumPy array if you have enough ram memory.
You can try increasing number of workers in multiprocessing scheme.
model.fit(..., use_multiprocessing=True, workers=8)
Finally, depends on your model, if your model is too light and not deep, your utilization will be low and there's no standard way to improve it any further.

How to fix low volatile GPU-Util with Tensorflow-GPU and Keras?

I have a 4 GPU machine on which I run Tensorflow (GPU) with Keras. Some of my classification problems take several hours to complete.
nvidia-smi returns Volatile GPU-Util which never exceeds 25% on any of my 4 GPUs.
How can I increase GPU Util% and speed up my training?
If your GPU util is below 80%, this is generally the sign of an input pipeline bottleneck. What this means is that the GPU sits idle much of the time, waiting for the CPU to prepare the data:
What you want is the CPU to keep preparing batches while the GPU is training to keep the GPU fed. This is called prefetching:
Great, but if the batch preparation is still way longer than the model training, the GPU will still remain idle, waiting for the CPU to finish the next batch. To make the batch preparation faster we can parallelize the different preprocessing operations:
We can go even further by parallelizing I/O:
Now to implement this in Keras, you need to use the Tensorflow Data API with Tensorflow version >= 1.9.0. Here is an example:
Let's assume, for the sake of this example that you have two numpy arrays x and y. You can use tf.data for any type of data but this is simpler to understand.
def preprocessing(x, y):
# Can only contain TF operations
...
return x, y
dataset = tf.data.Dataset.from_tensor_slices((x, y)) # Creates a dataset object
dataset = dataset.map(preprocessing, num_parallel_calls=64) # parallel preprocessing
dataset = dataset.batch(batch_size)
dataset = dataset.prefetch(None) # Will automatically prefetch batches
....
model = tf.keras.model(...)
model.fit(x=dataset) # Since tf 1.9.0 you can pass a dataset object
tf.data is very flexible, but as anything in Tensorflow (except eager), it uses a static graph. This can be a pain sometimes but the speed up is worth it.
To go further, you can have a look at the performance guide and the Tensorflow data guide.
I've got similar issue - the memory of all the GPUs were allocated by Keras, but Volatile was around 0% and training was taking almost the same amount of time as on CPU. I was using ImageDataGenerator, which turned out to be a bottleneck. When I increased the number of workers in fit_generator method from default value 1 to all available CPUs, then the training time dropped rapidly.
You can also load the data to the memory and then use flow method to prepare batches with augmented images.

Hardware requirement for installing cntk

Are there any recommended or minimum system requirements for Microsoft Cognitive Network Toolkit? I cannot find this information anywhere on the git.
GPU requirement is CUDA enabled card with compute capability 3.0 or higher. I've tried to run training on PC with GPU GeForce GT 610 and got this message:
The GPU (GeForce GT 610) has compute capability 2.1. CNTK is only
supported on GPUs with compute capability 3.0 or greater
You can find some references to requirements for GPU hardware here:
https://github.com/Microsoft/CNTK/wiki/Setup-CNTK-on-Windows
I tested some of the simple image recognition tutorials on an older desktop machine with a GPU with too low score (so only using the CPU) and it took more than an hour to complete the training. On a Surface Book (1. generation) it took a few minutes. The first-generation Surface Book uses what AnandTech said was approximately equivalent to a GeForce GT 940M. I have not tested on a desktop machine with some of the newer high end GPU cards to see how they perform, but it would be interesting to know.
I performed a little testing using this tutorial: https://github.com/Microsoft/CNTK/blob/master/Tutorials/CNTK_201B_CIFAR-10_ImageHandsOn.ipynb
On my Surface Book (1. generation) I get the following results for the 1. part of the training:
Finished Epoch [1]: [Training] loss = 2.063133 * 50000, metric = 75.6% * 50000 16.486s (3032.8 samples per second)
Finished Epoch [2]: [Training] loss = 1.677638 * 50000, metric = 61.5% * 50000 16.717s (2990.9 samples per second)
Finished Epoch [3]: [Training] loss = 1.524161 * 50000, metric = 55.4% * 50000 16.758s (2983.7 samples per second)
These are the results from running on an C6 Azure VM with one Nvidia K80 GPU:
Finished Epoch [1]: [Training] loss = 2.061817 * 50000, metric = 75.5% * 50000 9.766s (5120.0 samples per second)
Finished Epoch [2]: [Training] loss = 1.679222 * 50000, metric = 61.5% * 50000 10.312s (4848.5 samples per second)
Finished Epoch [3]: [Training] loss = 1.524643 * 50000, metric = 55.6% * 50000 8.375s (5970.1 samples per second)
As you can see, the Azure VM is about 2x faster than my Surface Book, so if you need to experiment and you don't have a machine with a powerful GPU, Azure could be an option. The K80 GPU also have a lot more memory onboard, so it can run models with higher memory requirements. The VM in Azure can be started only when needed to save cost.
On my Surface Book, I will easily get memory errors like this:
RuntimeError: CUDA failure 2: out of memory ; GPU=0 ; hostname=OLAVT01 ; expr=cudaMalloc((void**) &deviceBufferPtr, sizeof(AllocatedElemType) * numElements)
This is due to the fact that the Surface Book (1. generation) only have 1GB GPU memory.
Update: When I first ran the tests the code was running on CPU. The results above are all from using the GPU.
To check if you are running on the CPU or the GPU use the following code:
import cntk as C
if C.device.default().type() == 0:
print('running on CPU')
else:
print('running on GPU')
To ask CNTK to use the GPU use:
from cntk.device import set_default_device, gpu
set_default_device(gpu(0))
CNTK itself has minimal requirements. However training some of the bigger more demanding models can be slow so having a GPU (or 8) can help.