Tensorflow MKL-DNN build in Ubuntu silently produces erroneus results - tensorflow

I have built Tensorflow 1.6.0-rc0 from source in Ubuntu 16.04 with MKL-DNN support following this guide. The build proceeds without any problem. Testing it with keras 2.1.3 on a simple convnet from this example "as is" is two times slower than with the non-MKL build.
Now, tuning the MKL parameters as recommended in the guide leads to almost 2 times speedup over the non-MKL build. But produces complete nonsense in terms of accuracy (and loss):
This comes with no errors or warnings from the console. The MKL parameters were tuned as follows:
from keras import backend as K
K.set_session(K.tf.Session(config=K.tf.ConfigProto(inter_op_parallelism_threads=1)))
os.environ["KMP_BLOCKTIME"] = "0"
os.environ["KMP_AFFINITY"] = "granularity=fine,verbose,compact,1,0"
The CPU is an i7-4790K.
For reference, results obtained from a run without tuning the MKL parameters are as expected:
Did anyone come across a similar issue? Just to check it against the community before firing an issue on GitHub.

You won't get such flat accuracy if the parameter "inter_op_parallelism_threads" is 2.
Given below is a modified version of the MKL tuning parameters that might speed up your code:
from keras import backend as K
import tensorflow as tf
config = tf.ConfigProto(intra_op_parallelism_threads=<Number.of Cores>, inter_op_parallelism_threads=2, allow_soft_placement=True, device_count = {'CPU': <Number.of Cores>})
session = tf.Session(config=config)
K.set_session(session)
os.environ["OMP_NUM_THREADS"] = "<Number.of Cores>"
os.environ["KMP_BLOCKTIME"] = "30"
os.environ["KMP_SETTINGS"] = "1"
os.environ["KMP_AFFINITY"]= "granularity=fine,verbose,compact,1,0"
eg :
from keras import backend as K
import tensorflow as tf
config = tf.ConfigProto(intra_op_parallelism_threads=8, inter_op_parallelism_threads=2, allow_soft_placement=True, device_count = {'CPU': 8})
session = tf.Session(config=config)
K.set_session(session)
os.environ["OMP_NUM_THREADS"] = "8"
os.environ["KMP_BLOCKTIME"] = "30"
os.environ["KMP_SETTINGS"] = "1"
os.environ["KMP_AFFINITY"]= "granularity=fine,verbose,compact,1,0"
Since certain optimization parameters works differently for different code, you could also play around with the parameters above to see a better speed up

Related

Converting tf.keras model to TFLite: Model is slow and doesn't work with XNN Pack

Until recently I had been training a model using TF 1.15 based on MobileNetV2.
After training I had always been able to run these commands to generate a TFLite version:
tf.keras.backend.set_learning_phase(0)
converter = tf.lite.TFLiteConverter.from_keras_model_file(
tf_keras_path)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.target_spec.supported_types = [
tf.lite.constants.FLOAT16]
tflite_model = converter.convert()
The resulting model was fast enough for our needs and when our Android developer used XNN Pack, we got an extra 30% reduction in inference time.
More recently I've developed a replacement model using TF2.4.1, based on the built-in keras implementation of efficientnet-b2.
This new model has larger input image size ((260,260) vs (224,224)) and its keras inference time is about 1.5x that of the older model.
However, when I convert to TFLite using these commands:
converter = tf.lite.TFLiteConverter.from_keras_model(newest_v3)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.target_spec.supported_types = [tf.float16]
tflite_model = converter.convert()
there are a number of problems:
inference time is 5x slower than the old model
on Android our developer sees this error: "Failed to apply XNNPACK delegate: Attempting to use a delegate that only supports static-sized tensors with a graph that has dynamic-sized tensors."
When I run the conversion I can see a message that says: "function_optimizer: function_optimizer did nothing. time = 9.854ms."
I have also tried saving as SavedModel and converting the saved model.
Another attempt I made was to use the command line tool with these arguments (and as far as I recall, pretty much every permutation of arguments possible):
tflite_convert --saved_model_dir newest_v3/ \
--enable_v1_converter \
--experimental_new_converter True \
--input-shape=1,260,260,3 \
--input-array=input_1:0 \
--post_training_quantize \
--quantize_to_float16 \
--output_file newest_v3d.tflite\
--allow-custom-ops
If anyone can shed some light onto what's going on here I'd be very grateful.
Tensorflow-lite does currently support tensors with dynamic shapes(enabled by default and explicitly by the "experimental_new_converter True" option on your conversion) but this issue below points out that XNNPack does not:
https://github.com/tensorflow/tensorflow/issues/42491
As XNNPack is not able to optimize the graph of the EfficientNet model you are not getting a boost in performance making the inference 5 times slower than before instead of just around 1.5 times.
Personally I would just recommend to move to EfficientNet-lite as it's the mobile/TPU counterpart of EfficientNet and was designed taking into account the restricted sets of operations available in Tensorflow-lite:
https://blog.tensorflow.org/2020/03/higher-accuracy-on-vision-models-with-efficientnet-lite.html

Run TensorFlow op in graph mode in tf 2.x

I would like to benchmark some TensorFlow operations (for example between them or against PyTorch). However most of the time I will write something like:
import numpy as np
import tensorflow as tf
tf_device = '/GPU:0'
a = np.random.normal(scale=100, size=shape).astype(np.int64)
b = np.array(7).astype(np.int64)
with tf.device(tf_device):
a_tf = tf.constant(a)
b_tf = tf.constant(b)
%timeit tf.math.floormod(a_tf, b_tf)
The problem with this approach is that it does the computation in eager-mode (I think in particular that it has to perform GPU to CPU placement). Eventually, I want to use those ops in a tf.keras model and therefore would like to evaluate their performance in graph mode.
What is the preferred way to do it?
My google searches have led to nothing and I don't know how to use sessions like in tf 1.x.
What you are looking for is tf.function. Check this tutorial and this docs.
As the tutorial says, in TensorFlow 2, eager execution is turned on by default. The user interface is intuitive and flexible (running one-off operations is much easier and faster), but this can come at the expense of performance and deployability. To get performant and portable models, use tf.function to make graphs out of your programs.
Check this code:
import numpy as np
import tensorflow as tf
import timeit
tf_device = '/GPU:0'
shape = [100000]
a = np.random.normal(scale=100, size=shape).astype(np.int64)
b = np.array(7).astype(np.int64)
#tf.function
def experiment(a_tf, b_tf):
tf.math.floormod(a_tf, b_tf)
with tf.device(tf_device):
a_tf = tf.constant(a)
b_tf = tf.constant(b)
# warm up
experiment(a_tf, b_tf)
print("In graph mode:", timeit.timeit(lambda: experiment(a_tf, b_tf), number=10))
print("In eager mode:", timeit.timeit(lambda: tf.math.floormod(a_tf, b_tf), number=10))

My Keras network with multi_gpu_model uses only 1 GPU

I have a problem when trying to use Keras with three GPUs.
My psuedocode is as follows:
import keras
import keras.models as M
from keras.utils import multi_gpu_model
i = M.Input(None,None,6)
o1,o2,o3 = my_Network(i)
net = M.Model(inputs = i, outputs = [o1,o2,o3])
net = multi_gpu_model(net,gpus = 3)
net.compile( ~~~~~ )
net.fit(~~~~~ )
My code is training my network, however, only one GPU is utilised.
My configuration is as follows:
keras : 2.3.1
tensorflow : 2.1.0
Cuda : 10.0
windows : 10
GPU : Tesla 100 x 3 (VRAM : 32GB x 3 )
What is the mistake?
I solved my problem by using codes under :
str = tf.distribute.MirroredStrategy(devices=["/gpu:0","/gpu:1", "/gpu:2"])
with str.scope():
epsnet = M.Model(inputs = [img_in,img_lv],outputs = [out_d,out_s,out_l])
epsnet = multi_gpu_model(epsnet,gpus=3)
Hope this gives you some inspiration.
Thank you all repliers.
Something that you must consider is the batch size when executing fit. You aren't showing this here but you need to make sure that you are giving it a batch size that is divisible by 3 in order to parallelize it across your 3 GPUs. If you are giving it a batch size of 1, for example, it will not be able to distribute the training across the GPUs.
You did not provide much information but based on your execution of multi_gpu_model, I don't see anything clearly wrong.

How to run TensorFlow for SSE4.1 SSE4.2 AVX AVX2 FMA on Python with Spyder on MacOs

I am trying to run the code:
from keras.datasets import imdb as im
from keras.preprocessing import sequence as seq
from keras.models import Sequential
from keras.layers import Embedding
from keras.layers import LSTM
from keras.layers import Dense
train_set, test_set = im.load_data(num_words = 10000)
X_train, y_train = train_set
X_test, y_test = test_set
X_train_padded = seq.pad_sequences(X_train, maxlen = 100)
X_test_padded = seq.pad_sequences(X_test, maxlen = 100)
model = Sequential()
model.add(Embedding(input_dim=10000, output_dim=128))
model.add(LSTM(units=128))
model.add(Dense(units=1, activation='sigmoid'))
model.compile(loss='binary_crossentropy',
optimizer='sgd',
metrics=['accuracy'])
scores = model.fit(X_train_padded,y_train)
When I run the code, it gives me a message:
I tensorflow/core/platform/cpu_feature_guard.cc:145] This TensorFlow binary is optimized with Intel(R) MKL-DNN to use the following CPU instructions in performance critical operations: SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in non-MKL-DNN operations, rebuild TensorFlow with the appropriate compiler flags.
I tensorflow/core/common_runtime/process_util.cc:115] Creating new thread pool with default inter op setting: 4. Tune using inter_op_parallelism_threads for best performance.
I don't understand what the issue is and what I am supposed to do next. I installed the "tenserflow" package (1.14.0) but that doesn't solve the issue.
I have looked at this reference but I don't know what I am looking for:
https://stackoverflow.com/questions/41293077/how-to-compile-tensorflow-with-sse4-2-and-avx-instructions
Can someone please help me. Thanks.
my config: osx-64, MacOS Mojave v.10.14.6, Python 3.7 with Spyder with Anaconda, conda version : 4.7.12
You can ignore the message and everything will work fine.
As far as I can gather from https://github.com/tensorflow/tensorflow/pull/24782/commits/7faefa4bb665e115cc744d7895a407338624993f, when TensorFlow is compiled with MKL-DNN support (which it is, according to your message), MKL-DNN will take care of using all available CPU performance features. So it doesn't matter that TensorFlow wasn't compiled to use them.
This might not be answering the exact question you have put, but I had a very similar error message when running a similar task.
In addition to the error message above, I also had the following error message:
OMP: Error #15: Initializing libiomp5.dylib, but found libiomp5.dylib already initialized.
OMP: Hint This means that multiple copies of the OpenMP runtime have been linked into the program. That is dangerous, since it can degrade performance or cause incorrect results. The best thing to do is to ensure that only a single OpenMP runtime is linked into the process, e.g. by avoiding static linking of the OpenMP runtime in any library. As an unsafe, unsupported, undocumented workaround you can set the environment variable KMP_DUPLICATE_LIB_OK=TRUE to allow the program to continue to execute, but that may cause crashes or silently produce incorrect results. For more information, please see http://www.intel.com/software/products/support/.
Error was solved with:
conda install nomkl
This is as per this stackoverflow post

Keras / tensorflow - limit number of cores (intra_op_parallelism_threads not working)

I've been trying to run keras on a CPU cluster, and for this I need to limit the number of cores used (it's a shared system). So to limit the number of cores, I landed on this answer. However, this simply doesn't work. I tried running with this basic code:
from keras.applications.vgg16 import VGG16
from keras import backend as K
import numpy as np
conf = K.tf.ConfigProto(device_count={'CPU': 1},
intra_op_parallelism_threads=2,
inter_op_parallelism_threads=2)
K.set_session(K.tf.Session(config=conf))
model = VGG16(weights='imagenet', include_top=False)
x = np.random.randn(1000, 224, 224, 3)
features = model.predict(x)
When I run this and check htop, it uses all (128) logical cores. Is this a bug in keras? Or am I doing something wrong?
Keras says that my CPU supports SSE4.1 and SSE4.2, which are not used because I didn't compile from binary. Will compiling from binary also fix the original question?
EDIT: I've found a workaround when launching the keras script from a unix machine:
taskset -c 0-23 python keras_script.py
This will run the script on the first 24 cores of the machine. It works, but it would still be nice if this was available from within keras/tensorflow.
I found this snippet of code that works for me, hope it helps:
from keras import backend as K
import tensorflow as tf
jobs = 2 # it means number of cores
config = tf.ConfigProto(intra_op_parallelism_threads=jobs,
inter_op_parallelism_threads=jobs,
allow_soft_placement=True,
device_count={'CPU': jobs})
session = tf.Session(config=config)
K.set_session(session)