How to run TensorFlow for SSE4.1 SSE4.2 AVX AVX2 FMA on Python with Spyder on MacOs - tensorflow

I am trying to run the code:
from keras.datasets import imdb as im
from keras.preprocessing import sequence as seq
from keras.models import Sequential
from keras.layers import Embedding
from keras.layers import LSTM
from keras.layers import Dense
train_set, test_set = im.load_data(num_words = 10000)
X_train, y_train = train_set
X_test, y_test = test_set
X_train_padded = seq.pad_sequences(X_train, maxlen = 100)
X_test_padded = seq.pad_sequences(X_test, maxlen = 100)
model = Sequential()
model.add(Embedding(input_dim=10000, output_dim=128))
model.add(LSTM(units=128))
model.add(Dense(units=1, activation='sigmoid'))
model.compile(loss='binary_crossentropy',
optimizer='sgd',
metrics=['accuracy'])
scores = model.fit(X_train_padded,y_train)
When I run the code, it gives me a message:
I tensorflow/core/platform/cpu_feature_guard.cc:145] This TensorFlow binary is optimized with Intel(R) MKL-DNN to use the following CPU instructions in performance critical operations: SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in non-MKL-DNN operations, rebuild TensorFlow with the appropriate compiler flags.
I tensorflow/core/common_runtime/process_util.cc:115] Creating new thread pool with default inter op setting: 4. Tune using inter_op_parallelism_threads for best performance.
I don't understand what the issue is and what I am supposed to do next. I installed the "tenserflow" package (1.14.0) but that doesn't solve the issue.
I have looked at this reference but I don't know what I am looking for:
https://stackoverflow.com/questions/41293077/how-to-compile-tensorflow-with-sse4-2-and-avx-instructions
Can someone please help me. Thanks.
my config: osx-64, MacOS Mojave v.10.14.6, Python 3.7 with Spyder with Anaconda, conda version : 4.7.12

You can ignore the message and everything will work fine.
As far as I can gather from https://github.com/tensorflow/tensorflow/pull/24782/commits/7faefa4bb665e115cc744d7895a407338624993f, when TensorFlow is compiled with MKL-DNN support (which it is, according to your message), MKL-DNN will take care of using all available CPU performance features. So it doesn't matter that TensorFlow wasn't compiled to use them.

This might not be answering the exact question you have put, but I had a very similar error message when running a similar task.
In addition to the error message above, I also had the following error message:
OMP: Error #15: Initializing libiomp5.dylib, but found libiomp5.dylib already initialized.
OMP: Hint This means that multiple copies of the OpenMP runtime have been linked into the program. That is dangerous, since it can degrade performance or cause incorrect results. The best thing to do is to ensure that only a single OpenMP runtime is linked into the process, e.g. by avoiding static linking of the OpenMP runtime in any library. As an unsafe, unsupported, undocumented workaround you can set the environment variable KMP_DUPLICATE_LIB_OK=TRUE to allow the program to continue to execute, but that may cause crashes or silently produce incorrect results. For more information, please see http://www.intel.com/software/products/support/.
Error was solved with:
conda install nomkl
This is as per this stackoverflow post

Related

'use_multiprocessing=True' in Mask RCNN with Keras 2.x & Tensorflow 2.x

I am using Keras 2.9.0 and Tensorflow 2.9.2 and already managed to make the necessary changes to compile the Mask-RCNN model (there are many compatability issues as it is a 2017 model).
The code is running on Colab with GPU.
I get now the following warning:
WARNING:tensorflow:Using a generator with `use_multiprocessing=True` and multiple workers may duplicate your data. Please consider using the `keras.utils.Sequence` class.
This comes from the following lines in the model.py file:
self.keras_model.fit(
train_generator,
initial_epoch=self.epoch,
epochs=epochs,
steps_per_epoch=self.config.STEPS_PER_EPOCH,
callbacks=callbacks,
validation_data=val_generator,
validation_steps=self.config.VALIDATION_STEPS,
max_queue_size=100,
workers=workers,
use_multiprocessing=True,
)
The problem is that the training is stuck at Epoch 1/XXX before even starting to really train.
I am pretty sure this is due tot he multiprocessing warning (from other threads here).
this thread is referring here but its a very different approach to generating the data than in Mask RCNN and therefore I'd like to avoid making such a big change (potentially will create many other issues).
Moreover, if I set use_multiprocessing=False (default) there is the following error:
RuntimeError: Your generator is NOT thread-safe.Keras requires a thread-safe generator when use_multiprocessing=False, workers > 1
as far as I understand, the solutions suggested here are not directly the mask-rcnn model.
Question: is there a way to resolve the issue with Mask-RCNN? preferably keep the option to run with multiprocessing (to be faster) ?
EDIT:
even if I reduce the original amount of workers (12) to 1 (as hinted in the warning message), the model is still stuck at the same stage.

Python 3.8.8 Jupyter notebook kernel dies when I call model.fit() when I try to use my GPU

My tensorflow recognizes my gpu
However, when I call model.fit() on my data it shows:
epoch(1/2) and then the kernel dies immediately
If I run this in a separate virtual environment with no GPU it works fine:
I have simplified the model architecture and number of training points to only ten as a quick test and it still fails
Simple example
from numpy import loadtxt
from keras.models import Sequential
from keras.layers import Dense
model = keras.Sequential()
model.add(Dense(4,
activation='relu'))
model.add(Dense(1, activation='sigmoid'))
opt = keras.optimizers.Adam(learning_rate=.001)
model.compile(loss = 'binary_crossentropy' , optimizer = opt, metrics = ['accuracy'] )
info = model.fit(X_train, y_train, epochs=2, batch_size=2,shuffle=True, verbose=1)
versions:
Python 3.8.8
Num GPUs Available 1
2.5.0-dev20210227
2.4.3
cuda v11.2
I am going to answer my own question rather than deleting this because maybe someone else will be making the same simple mistake I was.
The main mistake I made was having the incorrect CUDA download. you can refer to the what versions are correct at this link:
https://www.tensorflow.org/install/source#gpu
TLDR: Just follow this video:
https://www.youtube.com/watch?v=hHWkvEcDBO0
This also highlighted the importance of a virtual environment where you control the package versions to prevent incompatibilities.
I had the same problem. I transferred the code into a python file and found the root cause. In my case it was copying cudnn dll files into C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.6\bin. Check the following link as well:
Could not load dynamic library 'cudnn64_8.dll'; dlerror: cudnn64_8.dll not found

NotImplementedError: Cannot convert a symbolic Tensor to a numpy array

The code below used to work last year, but updates in keras/tensorflow/numpy broke it. It now outputs the exception below. Does anyone know how to make it work again?
I'm using:
Tensorflow 2.4.1
Keras 2.4.3
Numpy 1.20.1
Python 3.9.1
import numpy as np
from keras.layers import LSTM, Embedding, Input, Bidirectional
dim = 30
max_seq_length = 40
vecs = np.random.rand(45,dim)
input_layer = Input(shape=(max_seq_length,))
embedding_layer = Embedding(len(vecs), dim, weights=[vecs], input_length=max_seq_length, trainable=False, name="layerA")(input_layer)
lstm_nobi = LSTM(max_seq_length, return_sequences=True, activation="linear", name="layerB")
lstm = Bidirectional(lstm_nobi, name="layerC")(embedding_layer)
Complete output of the script above: https://pastebin.com/DsQNWVwz
Shortened output:
2021-02-10 17:51:13.037468: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-02-10 17:51:13.037899: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: SSE3 SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-02-10 17:51:13.038418: I tensorflow/core/common_runtime/process_util.cc:146] Creating new thread pool with default inter op setting: 2. Tune using inter_op_parallelism_threads for best performance.
Traceback (most recent call last):
File "/run/media/volker/DATA/configruns/load/./test.py", line 13, in <module>
lstm = Bidirectional(lstm_nobi, name="layerC")(embedding_layer)
... omitted, see pastebin ...
File "/usr/lib/python3.9/site-packages/tensorflow/python/framework/ops.py", line 852, in __array__
raise NotImplementedError(
NotImplementedError: Cannot convert a symbolic Tensor (layerC/forward_layerB/strided_slice:0) to a numpy array. This error may indicate that you're trying to pass a Tensor to a NumPy call, which is not supported
Installing numpy 1.19.5 works for me even with Python 3.9
pip install -U numpy==1.19.5
My environment is Windows and since I do not have Visual C++ Compiler installed, I ride on third party whl file installation
pip install -U https://mirrors.aliyun.com/pypi/packages/bc/40/d6f7ba9ce5406b578e538325828ea43849a3dfd8db63d1147a257d19c8d1/numpy-1.19.5-cp39-cp39-win_amd64.whl#sha256=0eef32ca3132a48e43f6a0f5a82cb508f22ce5a3d6f67a8329c81c8e226d3f6e
Solution: Use Python 3.8, because Python 3.9 is not supported by Tensorflow.

Keras / tensorflow - limit number of cores (intra_op_parallelism_threads not working)

I've been trying to run keras on a CPU cluster, and for this I need to limit the number of cores used (it's a shared system). So to limit the number of cores, I landed on this answer. However, this simply doesn't work. I tried running with this basic code:
from keras.applications.vgg16 import VGG16
from keras import backend as K
import numpy as np
conf = K.tf.ConfigProto(device_count={'CPU': 1},
intra_op_parallelism_threads=2,
inter_op_parallelism_threads=2)
K.set_session(K.tf.Session(config=conf))
model = VGG16(weights='imagenet', include_top=False)
x = np.random.randn(1000, 224, 224, 3)
features = model.predict(x)
When I run this and check htop, it uses all (128) logical cores. Is this a bug in keras? Or am I doing something wrong?
Keras says that my CPU supports SSE4.1 and SSE4.2, which are not used because I didn't compile from binary. Will compiling from binary also fix the original question?
EDIT: I've found a workaround when launching the keras script from a unix machine:
taskset -c 0-23 python keras_script.py
This will run the script on the first 24 cores of the machine. It works, but it would still be nice if this was available from within keras/tensorflow.
I found this snippet of code that works for me, hope it helps:
from keras import backend as K
import tensorflow as tf
jobs = 2 # it means number of cores
config = tf.ConfigProto(intra_op_parallelism_threads=jobs,
inter_op_parallelism_threads=jobs,
allow_soft_placement=True,
device_count={'CPU': jobs})
session = tf.Session(config=config)
K.set_session(session)

Tensorflow MKL-DNN build in Ubuntu silently produces erroneus results

I have built Tensorflow 1.6.0-rc0 from source in Ubuntu 16.04 with MKL-DNN support following this guide. The build proceeds without any problem. Testing it with keras 2.1.3 on a simple convnet from this example "as is" is two times slower than with the non-MKL build.
Now, tuning the MKL parameters as recommended in the guide leads to almost 2 times speedup over the non-MKL build. But produces complete nonsense in terms of accuracy (and loss):
This comes with no errors or warnings from the console. The MKL parameters were tuned as follows:
from keras import backend as K
K.set_session(K.tf.Session(config=K.tf.ConfigProto(inter_op_parallelism_threads=1)))
os.environ["KMP_BLOCKTIME"] = "0"
os.environ["KMP_AFFINITY"] = "granularity=fine,verbose,compact,1,0"
The CPU is an i7-4790K.
For reference, results obtained from a run without tuning the MKL parameters are as expected:
Did anyone come across a similar issue? Just to check it against the community before firing an issue on GitHub.
You won't get such flat accuracy if the parameter "inter_op_parallelism_threads" is 2.
Given below is a modified version of the MKL tuning parameters that might speed up your code:
from keras import backend as K
import tensorflow as tf
config = tf.ConfigProto(intra_op_parallelism_threads=<Number.of Cores>, inter_op_parallelism_threads=2, allow_soft_placement=True, device_count = {'CPU': <Number.of Cores>})
session = tf.Session(config=config)
K.set_session(session)
os.environ["OMP_NUM_THREADS"] = "<Number.of Cores>"
os.environ["KMP_BLOCKTIME"] = "30"
os.environ["KMP_SETTINGS"] = "1"
os.environ["KMP_AFFINITY"]= "granularity=fine,verbose,compact,1,0"
eg :
from keras import backend as K
import tensorflow as tf
config = tf.ConfigProto(intra_op_parallelism_threads=8, inter_op_parallelism_threads=2, allow_soft_placement=True, device_count = {'CPU': 8})
session = tf.Session(config=config)
K.set_session(session)
os.environ["OMP_NUM_THREADS"] = "8"
os.environ["KMP_BLOCKTIME"] = "30"
os.environ["KMP_SETTINGS"] = "1"
os.environ["KMP_AFFINITY"]= "granularity=fine,verbose,compact,1,0"
Since certain optimization parameters works differently for different code, you could also play around with the parameters above to see a better speed up