I am using this code to train an xgboost model on a GPU
The problem is that both GPU (NVIDIA 1050) and CPU cores are being used at the same time.
NVIDIA system monitor shows a utilization of 85 to 90% and linux system monitor shows all cores working.
there are two problems here
1 .Why is xgb_cv using both when the tree_method defined is 'gpu_hist'
When the model is trained with 'hist' and not 'gpu_hist' it finishes in half the time using CPU cores only
Thanks
model_xgb = XGBClassifier(verbose= 1,objective = 'multi:softmax',num_classes=3,tree_method='gpu_hist',predictor = 'gpu_predictor')
xgb_cv = GridSearchCV(model_xgb,
{"colsample_bytree":[0.8,0.6]
,"min_child_weight":[0,5 ]
,'max_depth': [3,4,]
,'n_estimators': [500]
,'learning_rate' :[0.01, 0.1]},cv = 2,verbose = 1)
## Fit with cross validation
start_time = time.time()
xgb_cv.fit(X_train,Y_train,verbose = 1)
duration = (time.time() - start_time)/60
print("XGBOOST HyperParameter Tuning %s minutes ---" % + duration)
Related
I'm doing some hyper-parameter tuning, so speed is key. I've got a nice workstation with both an AMD Ryzen 9 5950x and an NVIDIA RTX3060ti 8GB.
Setup:
xgboost 1.5.1 using PyPi in an anaconda environment.
NVIDIA graphics driver 471.68
CUDA 11.0
When training a xgboost model using the scikit-learn API I pass the tree_method = gpu_hist parameter. And i notice that it is consistently outperformed by using the default tree_method = hist.
Somewhat surprisingly, even when I open multiple consoles (I work in spyder) and start an Optuna study in each of them, each using a different scikit-learn model until my CPU usage is at 100%. When I then compare the tree_method = gpu_hist with tree_method = hist, the tree_method = hist is still faster!
How is this possible? Do I have my drivers configured incorrectly?, is my dataset too small to enjoy a benefit from the tree_method = gpu_hist? (7000 samples, 50 features on a 3 class classification problem). Or is the RTX3060ti simply outclassed by the AMD Ryzen 9 5950x? Or none of the above?
Any help is highly appreciated :)
Edit #Ferdy:
I carried out this little experiment:
def fit_10_times(tree_method, X_train, y_train):
times = []
for i in range(10):
model = XGBClassifier(tree_method = tree_method)
start = time.time()
model.fit(X_train, y_train)
times.append(time.time()-start)
return times
cpu_times = fit_10_times('hist', X_train, y_train)
gpu_times = fit_10_times('gpu_hist', X_train, y_train)
print(X_train.describe())
print('mean cpu training times: ', np.mean(cpu_times), 'standard deviation :',np.std(cpu_times))
print('all training times :', cpu_times)
print('----------------------------------')
print('mean gpu training times: ', np.mean(gpu_times), 'standard deviation :',np.std(gpu_times))
print('all training times :', gpu_times)
Which yielded this output:
mean cpu training times: 0.5646213531494141 standard deviation : 0.010005875058323703
all training times : [0.5690040588378906, 0.5500047206878662, 0.5700047016143799, 0.563004732131958, 0.5570034980773926, 0.5486617088317871, 0.5630037784576416, 0.5680046081542969, 0.57651686668396, 0.5810048580169678]
----------------------------------
mean gpu training times: 2.0273998022079467 standard deviation : 0.05105794761358874
all training times : [2.0265607833862305, 2.0070691108703613, 1.9900789260864258, 1.9856727123260498, 1.9925382137298584, 2.0021069049835205, 2.1197071075439453, 2.1220884323120117, 2.0516715049743652, 1.9765043258666992]
The peak in CPU usage refers to the CPU training runs, and the peak in GPU usage the GPU training runs.
7000 samples is too small to fill the GPU pipeline, your GPU is likely to be starving. We usually work with millions of samples when using GPU acceleration.
I am training a model using Keras 2.2.4 and python 3.5.3 and Tensorflow on GCP virtual machine with K80 GPU.
GPU utilisation oscillates between 25 and 50% while CPU process with python eats 98%
I assume python is too slow to feed K80 with data.
The code as below.
There are multiple days of data for each epoch.
Each day has around 20K samples - number is a bit different for each.
Batch size is fixed by variable window_size=900
So I feed it around 19K batches for a day. Batch 0 starts with sample 0 and takes 900 samples, batch 1 starts from sample 1 and takes 900 samples and so on until the day ends.
So I have 3 loops - epoch, days, batches.
I feel the epoch and days loops should be preserved for clarity. I don't think they are the problem
I think the most inner loop should be looked at.
The implementation of the inner loop is naïve. Is there some trickery that can make work with arrays faster?
# d is tuple from groupby - d[0] = date, d[1] = values
for epoch in epochs:
print('epoch: ', epoch)
for d in days :
print(' day: ', d[0])
# get arrays for the day
features = np.asarray(d[1])[:,2:9].astype(dtype = 'float32')
print(len(features), len(features[0]), features[1].dtype)
labels = np.asarray(d[1])[:, 9:].astype(dtype = 'int8')
print(len(labels), len(labels[0]), labels[1].dtype)
for batch in range(len(features) - window_size):
# # # can these be optimised?
fb = features[batch:batch+window_size,:]
lb = labels[batch:batch+window_size,:]
fb = fb.reshape(1, fb.shape[0], fb.shape[1])
lb = lb.reshape(1, lb.shape[0], lb.shape[1])
# # #
model.train_on_batch(fb, lb)
#for batches
#model.reset_states()
#for days
#for epoch
try wrapping your script with:
import tensorflow as tf
with tf.device('/device:GPU:0'):
<your code>
Check out the Tensorflow guide on using GPUs for more information
we're training a network for a recommender system, on triplets. The core code for the fit method is as follows:
for e in range(epochs):
start = time.time()
cumulative_loss = 0
for i, batch in enumerate(train_iterator):
# Forward + backward.
with autograd.record():
output = self.model(batch.data[0])
loss = loss_fn(output, batch.label[0])
# Calculate gradients
loss.backward()
# Update parameters of the network.
trainer_fn.step(batch_size)
# Calculate training metrics. Sum losses of every batch.
cumulative_loss += nd.mean(loss).asscalar()
train_iterator.reset()
where the train_iterator is a custom iterator class that inherits from mx.io.DataIter, and is returning the data ( triples) already in the appropriate context, as:
data = [mx.nd.array(data[:, :-1], self.ctx, dtype=np.int)]
labels = [mx.nd.array(data[:, -1], self.ctx)]
return mx.io.DataBatch(data, labels)
self.model.initialize(ctx=mx.gpu(0)) was also called before running the fit method. loss_fn = gluon.loss.L1Loss().
The trouble is that nvidia-smi reports that the process is correctly allocated into GPU. However, running fit in GPU is not much faster than running it in CPU. In addition, increasing batch_size from 50000 to 500000 increases time per batch by a factor of 10 (which I was not expecting, given GPU parallelization).
Specifically, for a 50k batch:
* output = self.model(batch.data[0]) takes 0.03 seconds on GPU, and 0.08 on CPU.
* loss.backward() takes 0.11 seconds, and 0.39 on CPU.
both assessed with nd.waitall() to avoid asynchronous calls leading to incorrect measurements.
In addition, a very similar code that was running on plain MXNet took less than 0.03 seconds for the corresponding part, which leads to a full epoch taking from slightly above one minute with MXNet, up to 15 minutes with Gluon.
Any ideas on what might be happening here?
Thanks in advance!
The problem is in the following line:
cumulative_loss += nd.mean(loss).asscalar()
When you call asscalar(), MXNet has to implicitly do synchronized call to copy the result from GPU to CPU: it is essentially the same as calling nd.waitall(). Since you do it for every iteration, it is going to do the sync every iteration degrading your wall clock time significantly.
What you can do is to keep and update your cumulative_loss in GPU and copy it to CPU only when you actually need to display it - it can be every N iterations or after epoch is actually done, depending on how long does it take to do each iteration.
For unknown reasons, the following code takes twice much slower on GPU then on CPU. Could anyone explain why:
import time
import tensorflow as tf
with tf.device('/device:GPU:0'): # gpu takes: 5.132448434829712 seconds
# with tf.device('/cpu:0'): # cpu takes: 3.440524101257324 seconds
i = tf.constant(0)
while_condition = lambda i: tf.less(i, 2 ** 20)
a = tf.fill([16, 16], 1.1)
b = tf.fill([16, 16], 2.2)
def body(i):
res = tf.matmul(a, b)
# increment i
add = tf.add(i, 1)
return (add,)
ini_matmul = tf.matmul(a, b)
# do the loop:
loop = tf.while_loop(while_condition, body, [i])
with tf.Session(config=tf.ConfigProto(log_device_placement=True)) as sess:
sess.run(ini_matmul) # force GPU to initilise anything it needs.
t0 = time.time()
sess.run(loop)
t1 = time.time()
print(t1 - t0)
sess.close()
Note: Usually, the GPU runs for 5 seconds, CPU runs for 3 seconds, and CPU version using numpy runs for only 1.5 seconds.
Hardware: Tensorflow code running on Google's Colab. Numpy code running on local Intel Core i5-7267U.
Numpy version:
import numpy as np
import time
i = 0
a = np.full([16,16],1.1)
b = np.full([16,16],2.2)
t0 = time.time()
while i < 2**20:
a.dot(b)
i += 1
t1 = time.time()
print(t1-t0)
Update
This is becoming more and more wired to me because scaling up the matrix does not really helps. Here's the updated code and data in it (running of Titan XP card/Intel i7 CPU). Essentially tensorflow is running much slower.
import time
import tensorflow as tf
dimension = 11
repeat = 2**10
use_gpu = False
# Device: /device:GPU:0, Dimension 11, Repeat: 1024, Time cost: 0.00457597 seconds.
# Device: /cpu:0, Dimension 11, Repeat: 1024, Time cost: 0.00353599 seconds.
dev_name = '/device:GPU:0' if use_gpu else '/cpu:0'
with tf.device(dev_name):
i = tf.constant(0)
while_condition = lambda i: tf.less(i, repeat)
a = tf.constant(1.1, shape=[2**dimension, 2**dimension])
b = tf.constant(2.2, shape=[2**dimension, 2**dimension])
def body(i):
res = tf.matmul(a, b)
add = tf.add(i, 1)
return (add,)
ini_matmul = tf.matmul(a, b)
# do the loop:
loop = tf.while_loop(while_condition, body, [i])
with tf.Session(config=tf.ConfigProto(log_device_placement=True)) as sess:
sess.run(ini_matmul) # force initialisation.
t0 = time.time()
sess.run(loop)
t1 = time.time()
print('Device: {dev}, Dimension {dim:d}, Repeat: {r:d}, Time cost: {t:.8f} seconds.'.format(
dev = dev_name,
dim = dimension, r = repeat,
t = t1 - t0
))
sess.close()
In the end, I figured out that the matmul operation is not executed by tensorflow, because it is an orphan node in the graph.
This is an interesting question.
The relative slowdown you're seeing between GPU and CPU execution in the TensorFlow snippet is almost certainly due to GPU memory allocation overhead. To summarize the link, cudaMalloc is slower than malloc. This memory allocation slowdown is offset by a speedup in the requested operation (matmul in this case) if and only if the speedup exceeds the difference in memory allocation times. This is always true for matmul when the matrices are large. It is not true when the matrices are small, as is the case in your example. To validate this hypothesis, iteratively increase the size of the multiplicands and record both the CPU and GPU running times - the two should converge, then cross, if memory allocation is indeed the issue.
The difference between the Numpy running time and the CPU-only running time is likely due to very subtle difference between the Numpy and TensorFlow code. Note that in the Numpy code you only instantiate a and b once. It looks like you do the same thing in the TensorFlow code because you only call your initialization once, but you're still populating the tensors in every iteration! To see why, note that tf.fill returns a Tensor. By definition, Tensor objects are populated each time sess.run is called on the graph that contains them. Thus, the two snippets actually do slightly different things. A more direct comparison would be to make a and b both tf.constant in the TensorFlow snippet.
Are there any recommended or minimum system requirements for Microsoft Cognitive Network Toolkit? I cannot find this information anywhere on the git.
GPU requirement is CUDA enabled card with compute capability 3.0 or higher. I've tried to run training on PC with GPU GeForce GT 610 and got this message:
The GPU (GeForce GT 610) has compute capability 2.1. CNTK is only
supported on GPUs with compute capability 3.0 or greater
You can find some references to requirements for GPU hardware here:
https://github.com/Microsoft/CNTK/wiki/Setup-CNTK-on-Windows
I tested some of the simple image recognition tutorials on an older desktop machine with a GPU with too low score (so only using the CPU) and it took more than an hour to complete the training. On a Surface Book (1. generation) it took a few minutes. The first-generation Surface Book uses what AnandTech said was approximately equivalent to a GeForce GT 940M. I have not tested on a desktop machine with some of the newer high end GPU cards to see how they perform, but it would be interesting to know.
I performed a little testing using this tutorial: https://github.com/Microsoft/CNTK/blob/master/Tutorials/CNTK_201B_CIFAR-10_ImageHandsOn.ipynb
On my Surface Book (1. generation) I get the following results for the 1. part of the training:
Finished Epoch [1]: [Training] loss = 2.063133 * 50000, metric = 75.6% * 50000 16.486s (3032.8 samples per second)
Finished Epoch [2]: [Training] loss = 1.677638 * 50000, metric = 61.5% * 50000 16.717s (2990.9 samples per second)
Finished Epoch [3]: [Training] loss = 1.524161 * 50000, metric = 55.4% * 50000 16.758s (2983.7 samples per second)
These are the results from running on an C6 Azure VM with one Nvidia K80 GPU:
Finished Epoch [1]: [Training] loss = 2.061817 * 50000, metric = 75.5% * 50000 9.766s (5120.0 samples per second)
Finished Epoch [2]: [Training] loss = 1.679222 * 50000, metric = 61.5% * 50000 10.312s (4848.5 samples per second)
Finished Epoch [3]: [Training] loss = 1.524643 * 50000, metric = 55.6% * 50000 8.375s (5970.1 samples per second)
As you can see, the Azure VM is about 2x faster than my Surface Book, so if you need to experiment and you don't have a machine with a powerful GPU, Azure could be an option. The K80 GPU also have a lot more memory onboard, so it can run models with higher memory requirements. The VM in Azure can be started only when needed to save cost.
On my Surface Book, I will easily get memory errors like this:
RuntimeError: CUDA failure 2: out of memory ; GPU=0 ; hostname=OLAVT01 ; expr=cudaMalloc((void**) &deviceBufferPtr, sizeof(AllocatedElemType) * numElements)
This is due to the fact that the Surface Book (1. generation) only have 1GB GPU memory.
Update: When I first ran the tests the code was running on CPU. The results above are all from using the GPU.
To check if you are running on the CPU or the GPU use the following code:
import cntk as C
if C.device.default().type() == 0:
print('running on CPU')
else:
print('running on GPU')
To ask CNTK to use the GPU use:
from cntk.device import set_default_device, gpu
set_default_device(gpu(0))
CNTK itself has minimal requirements. However training some of the bigger more demanding models can be slow so having a GPU (or 8) can help.