My Keras network with multi_gpu_model uses only 1 GPU - tensorflow

I have a problem when trying to use Keras with three GPUs.
My psuedocode is as follows:
import keras
import keras.models as M
from keras.utils import multi_gpu_model
i = M.Input(None,None,6)
o1,o2,o3 = my_Network(i)
net = M.Model(inputs = i, outputs = [o1,o2,o3])
net = multi_gpu_model(net,gpus = 3)
net.compile( ~~~~~ )
net.fit(~~~~~ )
My code is training my network, however, only one GPU is utilised.
My configuration is as follows:
keras : 2.3.1
tensorflow : 2.1.0
Cuda : 10.0
windows : 10
GPU : Tesla 100 x 3 (VRAM : 32GB x 3 )
What is the mistake?

I solved my problem by using codes under :
str = tf.distribute.MirroredStrategy(devices=["/gpu:0","/gpu:1", "/gpu:2"])
with str.scope():
epsnet = M.Model(inputs = [img_in,img_lv],outputs = [out_d,out_s,out_l])
epsnet = multi_gpu_model(epsnet,gpus=3)
Hope this gives you some inspiration.
Thank you all repliers.

Something that you must consider is the batch size when executing fit. You aren't showing this here but you need to make sure that you are giving it a batch size that is divisible by 3 in order to parallelize it across your 3 GPUs. If you are giving it a batch size of 1, for example, it will not be able to distribute the training across the GPUs.
You did not provide much information but based on your execution of multi_gpu_model, I don't see anything clearly wrong.

Related

Floating point operations results are different in Android, TensorFlow and Pytorch

I am trying to compare the floating point numbers on Android, Tensorflow and Pytorch. What I have observed is I am getting same result for Tensorflow and Android but different on Pytorch as Android and Tensorflow are performing round-down operation. Please see the following result:
TensorFlow
import tensorflow as tf
a=tf.convert_to_tensor(np.array([0.9764764, 0.79078835, 0.93181187]), dtype=tf.float32)
session = tf.Session()
result = session.run(a*a*a*a)
print(result)
PyTorch
import torch as th
th.set_printoptions(precision=8)
a=th.from_numpy(np.array([0.9764764, 0.79078835, 0.93181187])).type(th.FloatTensor)
result = a*a*a*a
print(result)
Android:
for (index in 0 until a.size) {
var res = a[index] * a[index] * a[index] * a[index]
result.add(res)
}
print("r=$result")
The result is as follows:
Android: [0.9091739, 0.3910579, 0.7538986]
TensorFlow: [0.9091739, 0.3910579, 0.7538986]
PyTorch: [0.90917391, 0.39105791, 0.75389862]
You can see that PyTorch value is different. I know that this effect is minimal in this example but when we are performing training, and we are running for 1000 rounds with different batches and epochs, this different can be accumulated and can show un-desirable results. Can anyone point out how can we fix to have same number on three platforms.
Thanks.
You are not using the same level of precision when printing, hence why you get different results. Internally, those results are identical, it's just an artifact that you see due do the default of python to print only 7 digits after the comma.
If we set the same level of precision in numpy as the one you set in PyTorch, we get:
import numpy as np
import tensorflow as tf
# setting the print precision of numpy to 8 like in your pytorch example
np.set_printoptions(precision=8, floatmode="fixed")
a=tf.convert_to_tensor(np.array([0.9764764, 0.79078835, 0.93181187]), dtype=tf.float32)
session = tf.Session()
result = session.run(a*a*a*a)
print(result)
Results in:
[0.90917391 0.39105791 0.75389862]
Exactly the same as in PyTorch.

How to get reproducible result in Amazon SageMaker with TensorFlow Estimator?

I am currently using AWS SageMaker Python SDK to train EfficientNet model (https://github.com/qubvel/efficientnet) to my data. Specifically, I use TensorFlow estimator as below. This code is in SageMaker notebook instance
import sagemaker
from sagemaker.tensorflow.estimator import TensorFlow
### sagemaker version = 1.50.17, python version = 3.6
estimator = TensorFlow("train.py", py_version = "py3", framework_version = "2.1.0",
role = sagemaker.get_execution_role(),
train_instance_type = "ml.m5.xlarge",
train_instance_count = 1,
image_name = 'xxx.dkr.ecr.xxx.amazonaws.com/xxx',
hyperparameters = {list of hyperparameters here: epochs, batch size},
subnets = [xxx],
security_group_ids = [xxx]
estimator.fit({
'class_1': 's3_path_class_1',
'class_2': 's3_path_class_2'
})
The code for train.py contains the usual training procedure, getting the image and labels from S3, transform them into the right array shape for EfficientNet input, and split into train, validation, and test set. In order to get reproducible result, I use the following reset_random_seeds function (If Keras results are not reproducible, what's the best practice for comparing models and choosing hyper parameters?) before calling EfficientNet model itself.
### code of train.py
import os
os.environ['PYTHONHASHSEED']=str(1)
import numpy as np
import tensorflow as tf
import efficientnet.tfkeras as efn
import random
### tensorflow version = 2.1.0
### tf.keras version = 2.2.4-tf
### efficientnet version = 1.1.0
def reset_random_seeds():
os.environ['PYTHONHASHSEED']=str(1)
tf.random.set_seed(1)
np.random.seed(1)
random.seed(1)
if __name__ == "__main__":
### code for getting training data
### ... (I have made sure that the training input is the same every time i re-run the code)
### end of code
reset_random_seeds()
model = efn.EfficientNetB5(include_top = False,
weights = 'imagenet',
input_shape = (80, 80, 3),
pooling = 'avg',
classes = 3)
model.compile(optimizer = 'Adam', loss = 'categorical_crossentropy')
model.fit(X_train, Y_train, batch_size = 64, epochs = 30, shuffle = True, verbose = 2)
### Prediction section here
However, each time i run the notebook instance, i always get a different result from the previous run. When I switched train_instance_type to "local" i always get the same result each time i run the notebook. Therefore, is the non-reproducible result caused by the training instance type that I have chosen? this instance (ml.m5.xlarge) has 4 vCPU, 16 Mem (GiB), and no GPU. If so, how to obtain reproducible results under this training instance?
Is it possible that your inconsistent result is getting from the
tf.random.set_seed()
Came across a post here: Tensorflow: Different results with the same random seed

Run TensorFlow op in graph mode in tf 2.x

I would like to benchmark some TensorFlow operations (for example between them or against PyTorch). However most of the time I will write something like:
import numpy as np
import tensorflow as tf
tf_device = '/GPU:0'
a = np.random.normal(scale=100, size=shape).astype(np.int64)
b = np.array(7).astype(np.int64)
with tf.device(tf_device):
a_tf = tf.constant(a)
b_tf = tf.constant(b)
%timeit tf.math.floormod(a_tf, b_tf)
The problem with this approach is that it does the computation in eager-mode (I think in particular that it has to perform GPU to CPU placement). Eventually, I want to use those ops in a tf.keras model and therefore would like to evaluate their performance in graph mode.
What is the preferred way to do it?
My google searches have led to nothing and I don't know how to use sessions like in tf 1.x.
What you are looking for is tf.function. Check this tutorial and this docs.
As the tutorial says, in TensorFlow 2, eager execution is turned on by default. The user interface is intuitive and flexible (running one-off operations is much easier and faster), but this can come at the expense of performance and deployability. To get performant and portable models, use tf.function to make graphs out of your programs.
Check this code:
import numpy as np
import tensorflow as tf
import timeit
tf_device = '/GPU:0'
shape = [100000]
a = np.random.normal(scale=100, size=shape).astype(np.int64)
b = np.array(7).astype(np.int64)
#tf.function
def experiment(a_tf, b_tf):
tf.math.floormod(a_tf, b_tf)
with tf.device(tf_device):
a_tf = tf.constant(a)
b_tf = tf.constant(b)
# warm up
experiment(a_tf, b_tf)
print("In graph mode:", timeit.timeit(lambda: experiment(a_tf, b_tf), number=10))
print("In eager mode:", timeit.timeit(lambda: tf.math.floormod(a_tf, b_tf), number=10))

Keras / tensorflow - limit number of cores (intra_op_parallelism_threads not working)

I've been trying to run keras on a CPU cluster, and for this I need to limit the number of cores used (it's a shared system). So to limit the number of cores, I landed on this answer. However, this simply doesn't work. I tried running with this basic code:
from keras.applications.vgg16 import VGG16
from keras import backend as K
import numpy as np
conf = K.tf.ConfigProto(device_count={'CPU': 1},
intra_op_parallelism_threads=2,
inter_op_parallelism_threads=2)
K.set_session(K.tf.Session(config=conf))
model = VGG16(weights='imagenet', include_top=False)
x = np.random.randn(1000, 224, 224, 3)
features = model.predict(x)
When I run this and check htop, it uses all (128) logical cores. Is this a bug in keras? Or am I doing something wrong?
Keras says that my CPU supports SSE4.1 and SSE4.2, which are not used because I didn't compile from binary. Will compiling from binary also fix the original question?
EDIT: I've found a workaround when launching the keras script from a unix machine:
taskset -c 0-23 python keras_script.py
This will run the script on the first 24 cores of the machine. It works, but it would still be nice if this was available from within keras/tensorflow.
I found this snippet of code that works for me, hope it helps:
from keras import backend as K
import tensorflow as tf
jobs = 2 # it means number of cores
config = tf.ConfigProto(intra_op_parallelism_threads=jobs,
inter_op_parallelism_threads=jobs,
allow_soft_placement=True,
device_count={'CPU': jobs})
session = tf.Session(config=config)
K.set_session(session)

Tensorflow MKL-DNN build in Ubuntu silently produces erroneus results

I have built Tensorflow 1.6.0-rc0 from source in Ubuntu 16.04 with MKL-DNN support following this guide. The build proceeds without any problem. Testing it with keras 2.1.3 on a simple convnet from this example "as is" is two times slower than with the non-MKL build.
Now, tuning the MKL parameters as recommended in the guide leads to almost 2 times speedup over the non-MKL build. But produces complete nonsense in terms of accuracy (and loss):
This comes with no errors or warnings from the console. The MKL parameters were tuned as follows:
from keras import backend as K
K.set_session(K.tf.Session(config=K.tf.ConfigProto(inter_op_parallelism_threads=1)))
os.environ["KMP_BLOCKTIME"] = "0"
os.environ["KMP_AFFINITY"] = "granularity=fine,verbose,compact,1,0"
The CPU is an i7-4790K.
For reference, results obtained from a run without tuning the MKL parameters are as expected:
Did anyone come across a similar issue? Just to check it against the community before firing an issue on GitHub.
You won't get such flat accuracy if the parameter "inter_op_parallelism_threads" is 2.
Given below is a modified version of the MKL tuning parameters that might speed up your code:
from keras import backend as K
import tensorflow as tf
config = tf.ConfigProto(intra_op_parallelism_threads=<Number.of Cores>, inter_op_parallelism_threads=2, allow_soft_placement=True, device_count = {'CPU': <Number.of Cores>})
session = tf.Session(config=config)
K.set_session(session)
os.environ["OMP_NUM_THREADS"] = "<Number.of Cores>"
os.environ["KMP_BLOCKTIME"] = "30"
os.environ["KMP_SETTINGS"] = "1"
os.environ["KMP_AFFINITY"]= "granularity=fine,verbose,compact,1,0"
eg :
from keras import backend as K
import tensorflow as tf
config = tf.ConfigProto(intra_op_parallelism_threads=8, inter_op_parallelism_threads=2, allow_soft_placement=True, device_count = {'CPU': 8})
session = tf.Session(config=config)
K.set_session(session)
os.environ["OMP_NUM_THREADS"] = "8"
os.environ["KMP_BLOCKTIME"] = "30"
os.environ["KMP_SETTINGS"] = "1"
os.environ["KMP_AFFINITY"]= "granularity=fine,verbose,compact,1,0"
Since certain optimization parameters works differently for different code, you could also play around with the parameters above to see a better speed up