Hardware requirement for installing cntk - cntk

Are there any recommended or minimum system requirements for Microsoft Cognitive Network Toolkit? I cannot find this information anywhere on the git.

GPU requirement is CUDA enabled card with compute capability 3.0 or higher. I've tried to run training on PC with GPU GeForce GT 610 and got this message:
The GPU (GeForce GT 610) has compute capability 2.1. CNTK is only
supported on GPUs with compute capability 3.0 or greater

You can find some references to requirements for GPU hardware here:
https://github.com/Microsoft/CNTK/wiki/Setup-CNTK-on-Windows
I tested some of the simple image recognition tutorials on an older desktop machine with a GPU with too low score (so only using the CPU) and it took more than an hour to complete the training. On a Surface Book (1. generation) it took a few minutes. The first-generation Surface Book uses what AnandTech said was approximately equivalent to a GeForce GT 940M. I have not tested on a desktop machine with some of the newer high end GPU cards to see how they perform, but it would be interesting to know.
I performed a little testing using this tutorial: https://github.com/Microsoft/CNTK/blob/master/Tutorials/CNTK_201B_CIFAR-10_ImageHandsOn.ipynb
On my Surface Book (1. generation) I get the following results for the 1. part of the training:
Finished Epoch [1]: [Training] loss = 2.063133 * 50000, metric = 75.6% * 50000 16.486s (3032.8 samples per second)
Finished Epoch [2]: [Training] loss = 1.677638 * 50000, metric = 61.5% * 50000 16.717s (2990.9 samples per second)
Finished Epoch [3]: [Training] loss = 1.524161 * 50000, metric = 55.4% * 50000 16.758s (2983.7 samples per second)
These are the results from running on an C6 Azure VM with one Nvidia K80 GPU:
Finished Epoch [1]: [Training] loss = 2.061817 * 50000, metric = 75.5% * 50000 9.766s (5120.0 samples per second)
Finished Epoch [2]: [Training] loss = 1.679222 * 50000, metric = 61.5% * 50000 10.312s (4848.5 samples per second)
Finished Epoch [3]: [Training] loss = 1.524643 * 50000, metric = 55.6% * 50000 8.375s (5970.1 samples per second)
As you can see, the Azure VM is about 2x faster than my Surface Book, so if you need to experiment and you don't have a machine with a powerful GPU, Azure could be an option. The K80 GPU also have a lot more memory onboard, so it can run models with higher memory requirements. The VM in Azure can be started only when needed to save cost.
On my Surface Book, I will easily get memory errors like this:
RuntimeError: CUDA failure 2: out of memory ; GPU=0 ; hostname=OLAVT01 ; expr=cudaMalloc((void**) &deviceBufferPtr, sizeof(AllocatedElemType) * numElements)
This is due to the fact that the Surface Book (1. generation) only have 1GB GPU memory.
Update: When I first ran the tests the code was running on CPU. The results above are all from using the GPU.
To check if you are running on the CPU or the GPU use the following code:
import cntk as C
if C.device.default().type() == 0:
print('running on CPU')
else:
print('running on GPU')
To ask CNTK to use the GPU use:
from cntk.device import set_default_device, gpu
set_default_device(gpu(0))

CNTK itself has minimal requirements. However training some of the bigger more demanding models can be slow so having a GPU (or 8) can help.

Related

TF 2.11 CNN training with 20k Image and NVIDIA GeForce RTX 4090 GPU running too slow

I have Linux-x86_64 operating system and I am running TF 2.11 on conda environment. I just got a workstation which includes NVIDIA GeForce RTX 4090 24GB GPU. I'd like to perform CNN image classification, and my dataset contains 20k images, 14k of which are for training, 3k for validation, and 3k for testing. The code also does hyperparameter tuning using the tensorboard API. So basically, I am expecting to finish around 10k experiments. My epoch numbers in the algorithm 300. Batch size varies within the range of 16, 32, 64.
Before, I was running a CNN with 2k image data using the same logic and number of experiments and honestly it was taking like 2 weeks to finish everything. Now, I was expecting for it to run super fast since I upgraded it from GeForce 2060 to 4090, however it's not the case.
As you see in the following pictures, there is no issue with running it on GPU, the problem is that why it runs very slow. it's like finishing the first Epoch 1/300 while it includes 450 substeps takes up to 2 - 2.5 hour. Afterward, it goes to 2/300. This is incredible. It means the whole process can take months.
I just got confused over GPU utilization but I am assuming using 0.9 percent makes sense. I checked all updates and CUDA things, they seem correct.
What do you think the issue could be? 20k image data is not huge for this GPU. I tried to run it through terminal or Jupyter notebook, those are the same. I feel like this tf.session command can create some issues? Should there be a specific open and close sessions?
Those are my parameters that needs to be optimized:
EDIT: if I run it on RTX 2060, it's definitely going too fast compared to Linux RTX 4090, I have not figured it out what the problem is. It's like finishing the first run 1/300 takes just 1.5 minutes, it's like 1.5 hr on linux 4090 workstation!
GPU UTILIZATION BEFORE TRAINING:
enter image description here
GPU UTILIZATION AFTER STARTING TRAINING:
enter image description here
how I generate the data:
train_data = train_datagen.flow_from_directory(directory=train_path,
target_size=(1000, 9), color_mode="grayscale", class_mode="categorical", shuffle=True)
valid_data = valid_datagen.flow_from_directory(directory=valid_path,
target_size=(1000, 9), color_mode="grayscale", class_mode="categorical", shuffle=False)
test_data = test_datagen.flow_from_directory(directory=test_path,
target_size=(1000, 9), color_mode="grayscale", class_mode="categorical", shuffle=False)

GPU Memory Spiking in Keras [duplicate]

I work in an environment in which computational resources are shared, i.e., we have a few server machines equipped with a few Nvidia Titan X GPUs each.
For small to moderate size models, the 12 GB of the Titan X is usually enough for 2–3 people to run training concurrently on the same GPU. If the models are small enough that a single model does not take full advantage of all the computational units of the GPU, this can actually result in a speedup compared with running one training process after the other. Even in cases where the concurrent access to the GPU does slow down the individual training time, it is still nice to have the flexibility of having multiple users simultaneously train on the GPU.
The problem with TensorFlow is that, by default, it allocates the full amount of available GPU memory when it is launched. Even for a small two-layer neural network, I see that all 12 GB of the GPU memory is used up.
Is there a way to make TensorFlow only allocate, say, 4 GB of GPU memory, if one knows that this is enough for a given model?
You can set the fraction of GPU memory to be allocated when you construct a tf.Session by passing a tf.GPUOptions as part of the optional config argument:
# Assume that you have 12GB of GPU memory and want to allocate ~4GB:
gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=0.333)
sess = tf.Session(config=tf.ConfigProto(gpu_options=gpu_options))
The per_process_gpu_memory_fraction acts as a hard upper bound on the amount of GPU memory that will be used by the process on each GPU on the same machine. Currently, this fraction is applied uniformly to all of the GPUs on the same machine; there is no way to set this on a per-GPU basis.
config = tf.ConfigProto()
config.gpu_options.allow_growth=True
sess = tf.Session(config=config)
https://github.com/tensorflow/tensorflow/issues/1578
For TensorFlow 2.0 and 2.1 (docs):
import tensorflow as tf
tf.config.gpu.set_per_process_memory_growth(True)
For TensorFlow 2.2+ (docs):
import tensorflow as tf
gpus = tf.config.experimental.list_physical_devices('GPU')
for gpu in gpus:
tf.config.experimental.set_memory_growth(gpu, True)
The docs also list some more methods:
Set environment variable TF_FORCE_GPU_ALLOW_GROWTH to true.
Use tf.config.experimental.set_virtual_device_configuration to set a hard limit on a Virtual GPU device.
Here is an excerpt from the Book Deep Learning with TensorFlow
In some cases it is desirable for the process to only allocate a subset of the available memory, or to only grow the memory usage as it is needed by the process. TensorFlow provides two configuration options on the session to control this. The first is the allow_growth option, which attempts to allocate only as much GPU memory based on runtime allocations, it starts out allocating very little memory, and as sessions get run and more GPU memory is needed, we extend the GPU memory region needed by the TensorFlow process.
1) Allow growth: (more flexible)
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
session = tf.Session(config=config, ...)
The second method is per_process_gpu_memory_fraction option, which determines the fraction of the overall amount of memory that each visible GPU should be allocated. Note: No release of memory needed, it can even worsen memory fragmentation when done.
2) Allocate fixed memory:
To only allocate 40% of the total memory of each GPU by:
config = tf.ConfigProto()
config.gpu_options.per_process_gpu_memory_fraction = 0.4
session = tf.Session(config=config, ...)
Note:
That's only useful though if you truly want to bind the amount of GPU memory available on the TensorFlow process.
For Tensorflow version 2.0 and 2.1 use the following snippet:
import tensorflow as tf
gpu_devices = tf.config.experimental.list_physical_devices('GPU')
tf.config.experimental.set_memory_growth(gpu_devices[0], True)
For prior versions , following snippet used to work for me:
import tensorflow as tf
tf_config=tf.ConfigProto()
tf_config.gpu_options.allow_growth=True
sess = tf.Session(config=tf_config)
All the answers above assume execution with a sess.run() call, which is becoming the exception rather than the rule in recent versions of TensorFlow.
When using the tf.Estimator framework (TensorFlow 1.4 and above) the way to pass the fraction along to the implicitly created MonitoredTrainingSession is,
opts = tf.GPUOptions(per_process_gpu_memory_fraction=0.333)
conf = tf.ConfigProto(gpu_options=opts)
trainingConfig = tf.estimator.RunConfig(session_config=conf, ...)
tf.estimator.Estimator(model_fn=...,
config=trainingConfig)
Similarly in Eager mode (TensorFlow 1.5 and above),
opts = tf.GPUOptions(per_process_gpu_memory_fraction=0.333)
conf = tf.ConfigProto(gpu_options=opts)
tfe.enable_eager_execution(config=conf)
Edit: 11-04-2018
As an example, if you are to use tf.contrib.gan.train, then you can use something similar to bellow:
tf.contrib.gan.gan_train(........, config=conf)
You can use
TF_FORCE_GPU_ALLOW_GROWTH=true
in your environment variables.
In tensorflow code:
bool GPUBFCAllocator::GetAllowGrowthValue(const GPUOptions& gpu_options) {
const char* force_allow_growth_string =
std::getenv("TF_FORCE_GPU_ALLOW_GROWTH");
if (force_allow_growth_string == nullptr) {
return gpu_options.allow_growth();
}
Tensorflow 2.0 Beta and (probably) beyond
The API changed again. It can be now found in:
tf.config.experimental.set_memory_growth(
device,
enable
)
Aliases:
tf.compat.v1.config.experimental.set_memory_growth
tf.compat.v2.config.experimental.set_memory_growth
References:
https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/config/experimental/set_memory_growth
https://www.tensorflow.org/guide/gpu#limiting_gpu_memory_growth
See also:
Tensorflow - Use a GPU: https://www.tensorflow.org/guide/gpu
for Tensorflow 2.0 Alpha see: this answer
All the answers above refer to either setting the memory to a certain extent in TensorFlow 1.X versions or to allow memory growth in TensorFlow 2.X.
The method tf.config.experimental.set_memory_growth indeed works for allowing dynamic growth during the allocation/preprocessing. Nevertheless one may like to allocate from the start a specific-upper limit GPU memory.
The logic behind allocating a specific GPU memory would also be to prevent OOM memory during training sessions. For example, if one trains while opening video-memory consuming Chrome tabs/any other video consumption process, the tf.config.experimental.set_memory_growth(gpu, True) could result in OOM errors thrown, hence the necessity of allocating from the start more memory in certain cases.
The recommended and correct way in which to allot memory per GPU in TensorFlow 2.X is done in the following manner:
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
# Restrict TensorFlow to only allocate 1GB of memory on the first GPU
try:
tf.config.experimental.set_virtual_device_configuration(
gpus[0],
[tf.config.experimental.VirtualDeviceConfiguration(memory_limit=1024)]
Shameless plug: If you install the GPU supported Tensorflow, the session will first allocate all GPUs whether you set it to use only CPU or GPU. I may add my tip that even you set the graph to use CPU only you should set the same configuration(as answered above:) ) to prevent the unwanted GPU occupation.
And in an interactive interface like IPython and Jupyter, you should also set that configure, otherwise, it will allocate all memory and leave almost none for others. This is sometimes hard to notice.
If you're using Tensorflow 2 try the following:
config = tf.compat.v1.ConfigProto()
config.gpu_options.allow_growth = True
session = tf.compat.v1.Session(config=config)
For Tensorflow 2.0 this this solution worked for me. (TF-GPU 2.0, Windows 10, GeForce RTX 2070)
physical_devices = tf.config.experimental.list_physical_devices('GPU')
assert len(physical_devices) > 0, "Not enough GPU hardware devices available"
tf.config.experimental.set_memory_growth(physical_devices[0], True)
# allocate 60% of GPU memory
from keras.backend.tensorflow_backend import set_session
import tensorflow as tf
config = tf.ConfigProto()
config.gpu_options.per_process_gpu_memory_fraction = 0.6
set_session(tf.Session(config=config))
this code has worked for me:
import tensorflow as tf
config = tf.compat.v1.ConfigProto()
config.gpu_options.allow_growth = True
session = tf.compat.v1.InteractiveSession(config=config)
Well I am new to tensorflow, I have Geforce 740m or something GPU with 2GB ram, I was running mnist handwritten kind of example for a native language with training data containing of 38700 images and 4300 testing images and was trying to get precision , recall , F1 using following code as sklearn was not giving me precise reults. once i added this to my existing code i started getting GPU errors.
TP = tf.count_nonzero(predicted * actual)
TN = tf.count_nonzero((predicted - 1) * (actual - 1))
FP = tf.count_nonzero(predicted * (actual - 1))
FN = tf.count_nonzero((predicted - 1) * actual)
prec = TP / (TP + FP)
recall = TP / (TP + FN)
f1 = 2 * prec * recall / (prec + recall)
plus my model was heavy i guess, i was getting memory error after 147, 148 epochs, and then I thought why not create functions for the tasks so I dont know if it works this way in tensrorflow, but I thought if a local variable is used and when out of scope it may release memory and i defined the above elements for training and testing in modules, I was able to achieve 10000 epochs without any issues, I hope this will help..
i tried to train unet on voc data set but because of huge image size, memory finishes. i tried all the above tips, even tried with batch size==1, yet to no improvement. sometimes TensorFlow version also causes the memory issues. try by using
pip install tensorflow-gpu==1.8.0

xgboost treemethod gpu-hist outperformed by hist using rtx3060ti and amd ryzen 9 5950x

I'm doing some hyper-parameter tuning, so speed is key. I've got a nice workstation with both an AMD Ryzen 9 5950x and an NVIDIA RTX3060ti 8GB.
Setup:
xgboost 1.5.1 using PyPi in an anaconda environment.
NVIDIA graphics driver 471.68
CUDA 11.0
When training a xgboost model using the scikit-learn API I pass the tree_method = gpu_hist parameter. And i notice that it is consistently outperformed by using the default tree_method = hist.
Somewhat surprisingly, even when I open multiple consoles (I work in spyder) and start an Optuna study in each of them, each using a different scikit-learn model until my CPU usage is at 100%. When I then compare the tree_method = gpu_hist with tree_method = hist, the tree_method = hist is still faster!
How is this possible? Do I have my drivers configured incorrectly?, is my dataset too small to enjoy a benefit from the tree_method = gpu_hist? (7000 samples, 50 features on a 3 class classification problem). Or is the RTX3060ti simply outclassed by the AMD Ryzen 9 5950x? Or none of the above?
Any help is highly appreciated :)
Edit #Ferdy:
I carried out this little experiment:
def fit_10_times(tree_method, X_train, y_train):
times = []
for i in range(10):
model = XGBClassifier(tree_method = tree_method)
start = time.time()
model.fit(X_train, y_train)
times.append(time.time()-start)
return times
cpu_times = fit_10_times('hist', X_train, y_train)
gpu_times = fit_10_times('gpu_hist', X_train, y_train)
print(X_train.describe())
print('mean cpu training times: ', np.mean(cpu_times), 'standard deviation :',np.std(cpu_times))
print('all training times :', cpu_times)
print('----------------------------------')
print('mean gpu training times: ', np.mean(gpu_times), 'standard deviation :',np.std(gpu_times))
print('all training times :', gpu_times)
Which yielded this output:
mean cpu training times: 0.5646213531494141 standard deviation : 0.010005875058323703
all training times : [0.5690040588378906, 0.5500047206878662, 0.5700047016143799, 0.563004732131958, 0.5570034980773926, 0.5486617088317871, 0.5630037784576416, 0.5680046081542969, 0.57651686668396, 0.5810048580169678]
----------------------------------
mean gpu training times: 2.0273998022079467 standard deviation : 0.05105794761358874
all training times : [2.0265607833862305, 2.0070691108703613, 1.9900789260864258, 1.9856727123260498, 1.9925382137298584, 2.0021069049835205, 2.1197071075439453, 2.1220884323120117, 2.0516715049743652, 1.9765043258666992]
The peak in CPU usage refers to the CPU training runs, and the peak in GPU usage the GPU training runs.
7000 samples is too small to fill the GPU pipeline, your GPU is likely to be starving. We usually work with millions of samples when using GPU acceleration.

Difference in Performance between Cloud Compute VM and AI Platform

I have a GCP cloud compute VM, which is an n1-standard-16, with 4 P100 GPUs attached, and a solid state drive for storing data. I'll refer to this as "the VM".
I've previously used the VM to train a tensorflow based CNN. I want to move away from this to using AI Platform so I can run multiple jobs simultaneously. However I've run into some problems.
Problems
When the training is run on the VM I can set a batch size of 400, and the standard time for an epoch to complete is around 25 minutes.
When the training is running on a complex_model_m_p100 AI platform machine, which I believe to be equivalent to the VM, I can set a maximum batch size of 128, and the standard time for an epoch to complete is 1 hour 40 minutes.
Differences: the VM vs AI Platform
The VM uses TF1.12 and AI Platform uses TF1.15. Consequently there is a difference in GPU drivers (CUDA 9 vs CUDA 10).
The VM is equipped with a solid state drive, which I don't think is the case for AI platform machines.
I want to understand the cause of the reduced batch size, and decrease the epoch times on AI Platform to comparable levels to Glamdring. Has anyone else run into this issue? Am I running on the correct kind of AI Platform machine? Any advice would be welcome!
Could be a bunch of stuff. There's two ways to go about, making the VM look more like AI Platform:
export IMAGE_FAMILY="tf-latest-gpu" # 1.15 instead of 1.12
export ZONE=...
export INSTANCE_NAME=...
gcloud compute instances create $INSTANCE_NAME \
--zone=$ZONE \
--image-family=$IMAGE_FAMILY \
--image-project=deeplearning-platform-release \
--maintenance-policy=TERMINATE \
--metadata="install-nvidia-driver=True"
n and then attach 4 GPUs after that.
... or making AI Platform looking more like the VM:
https://cloud.google.com/ai-platform/training/docs/machine-types#gpus-and-tpus,
because you are using a Legacy Machine right now.
After following the advice of #Frederik Bode and creating a replica VM with TF 1.15 and associated drivers installed I've managed to solve my problem.
Rather than using the multi_gpu_model function call within tf.keras, it's actually best to use a distributed strategy and run the model within that scope.
There is a guide describing how to do it here.
Essentially now my code looks like this:
mirrored_strategy = tf.distribute.MirroredStrategy()
with mirrored_strategy.scope():
training_dataset, validation_dataset = get_datasets()
model = setup_model()
# Don't do this, it's not necessary!
#### NOT NEEDED model = tf.keras.utils.multi_gpu_model(model, 4)
opt = tf.keras.optimizers.Adam(learning_rate=args.learning_rate)
model.compile(loss='sparse_categorical_crossentropy',
optimizer=opt,
metrics=['accuracy'])
steps_per_epoch = args.steps_per_epoch
validation_steps = args.validation_steps
model.fit(training_dataset, steps_per_epoch=steps_per_epoch, epochs=args.num_epochs,
validation_data=validation_dataset, validation_steps=validation_steps)
I setup a small dataset so I could rapidly prototype this.
With a single P100 GPU the epoch time average to 66 seconds.
With 4 GPUs, using the code above, the averagge epoch time was 19 seconds.

How to merge two or more trained weights?

I implemented a 5x5 Gomoku by CNN + DQN.
Here is the github link:
https://github.com/bokidigital/CNN_DQN_5x5_Gomoku
My problem is this code has no parallelization design.
This means when running this code on an Intel Skylake server ( 2 CPU, 80 cores ), the CPU usage is just about 90%.
I think the idea CPU usage should be 8000% ( 80 cores ).
Since I have some customized rule in gaming ( not only neural network part which consumes GPU about 75% ), it consumes CPU and no parallelization.
My environment is:
Skylake CPU X 2
NVIDIA P100 X 2 ( only use 1 )
40GB RAM
Tensorflow 1.14.0
Keras
Python 3.7
Ubuntu 16.04
My idea is to run this program separately ( run many copies of this process in the different folder which then generates different weights ) then CPU usage could reach ideally to 8000% ( as long as many processes run at the same time )
Since it is the training process, it doesn't matter how each process trained their weights.
Q1. The problem is how to merge their results(the trained weights)? (A+B)/2?
Q2. It seems 1 GPU can only be used by 1 process, I tried to run 3 process at the same time, the GPU seems to hang.
Q3. If I disabled GPU, 80 core Skylake will faster than NVIDIA P100?
I expect to use more CPU usage to speed up this training process.
Since 5x5 agent trained with 5 days, I tested the same code but change grid size to 9x9, I estimated the training time needs 3 months.