I'm trying to run Albert Tensorflow hub version on multiple GPUs in the same machine. The model works perfectly on single GPU.
This is the structure of my code:
strategy = tf.distribute.MirroredStrategy()
print('Number of devices: {}'.format(strategy.num_replicas_in_sync)) # it prints 2 .. correct
if __name__ == "__main__":
with strategy.scope():
run()
Where in run() function, I read the data, build the model, and fit it.
I'm getting this error:
Traceback (most recent call last):
File "Albert.py", line 130, in <module>
run()
File "Albert.py", line 88, in run
model = build_model(bert_max_seq_length)
File "Albert.py", line 55, in build_model
model.compile(loss="categorical_crossentropy", optimizer=optimizer, metrics=["accuracy"])
File "/home/****/py_transformers/lib/python3.5/site-packages/tensorflow_core/python/training/tracking/base.py", line 457, in _method_wrapper
result = method(self, *args, **kwargs)
File "/home/bighanem/py_transformers/lib/python3.5/site-packages/tensorflow_core/python/keras/engine/training.py", line 471, in compile
' model.compile(...)'% (v, strategy))
ValueError: Variable (<tf.Variable 'bert/embeddings/word_embeddings:0' shape=(30000, 128) dtype=float32>) was not created in the distribution strategy scope of (<tensorflow.python.distribute.mirrored_strategy.MirroredStrategy object at 0x7f62e399df60>). It is most likely due to not all layers or the model or optimizer being created outside the distribution strategy scope. Try to make sure your code looks similar to the following.
with strategy.scope():
model=_create_model()
model.compile(...)
Is it possible that this error occures because Albert model was prepared before by tensorflow team (built and compiled)?
Edited:
To be precise, Tensorflow version is 2.1.
Also, this is the way I load Albert pretrained model:
features = {"input_ids": in_id, "input_mask": in_mask, "segment_ids": in_segment, }
albert = hub.KerasLayer(
"https://tfhub.dev/google/albert_xxlarge/3",
trainable=False, signature="tokens", output_key="pooled_output",
)
x = albert(features)
Following this tutorial: SavedModels from TF Hub in TensorFlow 2
Two-part answer:
1) TF Hub hosts two versions of ALBERT (each in several sizes):
https://tfhub.dev/google/albert_base/3 etc. from the Google research team that originally developed ALBERT comes in the hub.Module format for TF1. This will likely not work with a TF2 distribution strategy.
https://tfhub.dev/tensorflow/albert_en_base/1 etc. from the TensorFlow Model Garden comes in the revised TF2 SavedModel format. Please try this one for use in TF2 with a distribution strategy.
2) That said, the immediate problem appears to be what is explained in the error message (abridged):
Variable 'bert/embeddings/word_embeddings' was not created in the distribution strategy scope ... Try to make sure your code looks similar to the following.
with strategy.scope():
model = _create_model()
model.compile(...)
For a SavedModel (from TF Hub or otherwise), it's the loading that needs to happen under the distribution strategy scope, because that's what's re-creating the tf.Variable objects in the current program. Specifically, any of the following ways to load a TF2 SavedModel from TF Hub have to occur under the distribution strategy scope for distribution to work:
tf.saved_model.load();
hub.load(), which just calls tf.saved_model.load() (after downloading if necessary);
hub.KerasLayer when used with a string-valued model handle, on which it then calls hub.load().
Related
I am trying to use a custom optimiser to train a NN in Keras. The original algorithm has been developed to train CNNs and is based on using different adaptive learning rates for every weight of the network. The algorithm is called WAME (weight-wise adaptive learning rates with moving average estimator).
The optimiser has been developed by a former University of London student and can be found in this GitHub repo (lines 54 to 153).
As you can see in the code, it is built as a subclass of the optimizer_v2 superclass available in TensorFlow. This class is called WAMEprop.
What I am trying to do is simply:
pasting the code that defines the class into my Google Colab notebook
using the new optimizer as follows:
#building the model
wame_model = models.Sequential()
wame_model.add(layers.Dense(32, activation='relu', input_shape=(11,)))
wame_model.add(layers.Dense(32, activation='relu'))
wame_model.add(layers.Dense(1))
#compiling the model using WAMEprop as optimizer
wame_model.compile(optimizer=WAMEprop(name='wame'), loss='mse', metrics=['mae'])
#fitting the model
history = wame_model.fit(train_features, train_targets,
validation_data=(test_features, test_targets),
epochs=50, batch_size=1, verbose=1)
Now, I get the following error:
ValueError Traceback (most recent call last)
<ipython-input-55-a13c54fc61f2> in <module>()
5 wame_model.add(layers.Dense(32, activation='relu'))
6 wame_model.add(layers.Dense(1))
----> 7 wame_model.compile(optimizer=WAMEprop(name='wame'), loss='mse', metrics=['mae'])
8
9 history = wame_model.fit(white_train_features_df_s.to_numpy(), white_train_targets.to_numpy(),
5 frames
/usr/local/lib/python3.7/dist-packages/keras/optimizers.py in get(identifier)
131 else:
132 raise ValueError(
--> 133 'Could not interpret optimizer identifier: {}'.format(identifier))
ValueError: Could not interpret optimizer identifier: <__main__.WAMEprop object at 0x7fd4bc1d01d0>
Since I am new to Keras and I, unfortunately, don't know Tensorflow, I am not sure what exactly it is not able to find.
Did I get some import wrong?
Did I use the keras compile() method in the wrong way?
Also, if I don't pass the parameter name='wame' to the WAMEprop() call, I get an error message saying that the positional argument 'name' is required. Strangely enough, there is no parameter 'name' in the class constructor. Does this depend on the interaction with the superclass?
Thank you very much a lot in advance for any help you could offer!
Cheers!
UPDATE:
the error message refers to a method get() that takes as input identifier in the optimizers.py file that must have been installed with TensorFlow. Now, this function is expecting to get a string (I guess fr the readily available optimizers), a configuration dictionary, an optimizer_v2.OptimizerV2 object or a tf.compat.v1.train.Optimizer object.
I think the object I am passing as an optimizer is no one of these.
If I run:
my_optimizer = WAMEprop(name='wame')
print(type(my_optimizer))
I get <class '__main__.WAMEprop'>.
So, I suspect the object I am dealing with is something different from what Keras is expecting.
UPDATE2: it runs on my laptop, I have tensorflow installed within an Anaconda environment. Now, I am convinced there is some installation or import problem in Google Colab
Problem solved. I needed to uninstall tensorflow 2.6.0 from Google Colab and install tensorflow 2.0.0 instead.
I am running into an error after upgrading to TFF 0.17.0. The same code works perfectly in TFF 0.16.1. The training works just fine in both versions however when I try to copy weights from the FL state to model to evaluate it on test dataset, I get the following error:
File "fl/main_fl.py", line 166, in keras_evaluate
loss, accuracy = self.model.evaluate(test_dataset, verbose=0)
File "/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/tensorflow/python/keras/engine/training_v1.py", line 905, in evaluate
self._assert_built_as_v1()
File "/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/tensorflow/python/keras/engine/base_layer_v1.py", line 852, i$ _assert_built_as_v1
(type(self),))
ValueError: Your Layer or Model is in an invalid state. This can happen for the following cases:
1. You might be interleaving estimator/non-estimator models or interleaving models/layers made in tf.compat.v1.Graph.as_default() with model$/layers created outside of it. Converting a model to an estimator (via model_to_estimator) invalidates all models/layers made before the conv$rsion (even if they were not the model converted to an estimator). Similarly, making a layer or a model inside a a tf.compat.v1.Graph invalid$tes all layers/models you previously made outside of the graph.
2. You might be using a custom keras layer implementation with custom __init__ which didn't call super().__init__. Please check the impleme$tation of <class 'tensorflow.python.keras.engine.functional.Functional'> and its bases.
Below is my keras_evaluate method:
def keras_evaluate(self, test_dataset, mode='test', step=0):
self.state.model.assign_weights_to(self.model)
loss, accuracy = self.model.evaluate(test_dataset, verbose=0)
print('Mode={}, Loss={}, Accuracy={}'.format(mode, loss, accuracy))
self.state is the state returned by tff.learning.build_federated_averaging_process i.e tff.templates.IterativeProcess, test_dataset is of type tf.data.Dataset and self.model is tf.keras.Model type i.e keras functional model. I have one custom layer however it does have super() method so point 2 in the error is misleading me.
Any help will be appreciated.
I'm using the Mask_RCNN package from this repo: https://github.com/matterport/Mask_RCNN.
I tried to train my own dataset using this package but it gives me an error at the beginning.
2020-11-30 12:13:16.577252: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcuda.so.1
2020-11-30 12:13:16.587017: E tensorflow/stream_executor/cuda/cuda_driver.cc:314] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2020-11-30 12:13:16.587075: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (7612ade969e5): /proc/driver/nvidia/version does not exist
2020-11-30 12:13:16.587479: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN)to use the following CPU instructions in performance-critical operations: AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2020-11-30 12:13:16.593569: I tensorflow/core/platform/profile_utils/cpu_utils.cc:104] CPU Frequency: 2300000000 Hz
2020-11-30 12:13:16.593811: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x1b2aa00 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-11-30 12:13:16.593846: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
Traceback (most recent call last):
File "machines.py", line 345, in <module>
model_dir=args.logs)
File "/content/Mask_RCNN/mrcnn/model.py", line 1837, in __init__
self.keras_model = self.build(mode=mode, config=config)
File "/content/Mask_RCNN/mrcnn/model.py", line 1934, in build
anchors = KL.Lambda(lambda x: tf.Variable(anchors), name="anchors")(input_image)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/base_layer.py", line 926, in __call__
input_list)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/base_layer.py", line 1117, in _functional_construction_call
outputs = call_fn(cast_inputs, *args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/layers/core.py", line 904, in call
self._check_variables(created_variables, tape.watched_variables())
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/layers/core.py", line 931, in _check_variables
raise ValueError(error_str)
ValueError:
The following Variables were created within a Lambda layer (anchors)
but are not tracked by said layer:
<tf.Variable 'anchors/Variable:0' shape=(1, 261888, 4) dtype=float32>
The layer cannot safely ensure proper Variable reuse across multiple
calls, and consquently this behavior is disallowed for safety. Lambda
layers are not well suited to stateful computation; instead, writing a
subclassed Layer is the recommend way to define layers with
Variables.
I looked up the part of code responsible for the problem (located at file: /mrcnn/model.py line: 1935 in the repo):
IN[0]: anchors = KL.Lambda(lambda x: tf.Variable(anchors), name="anchors")(input_image)
If anyone have an idea how to solve it or have already solved it, please mention the solution.
Go to mrcnn/model.py and add:
class AnchorsLayer(KL.Layer):
def __init__(self, anchors, name="anchors", **kwargs):
super(AnchorsLayer, self).__init__(name=name, **kwargs)
self.anchors = tf.Variable(anchors)
def call(self, dummy):
return self.anchors
def get_config(self):
config = super(AnchorsLayer, self).get_config()
return config
Then find the line:
anchors = KL.Lambda(lambda x: tf.Variable(anchors), name="anchors")(input_image)
and replace it with:
anchors = AnchorsLayer(anchors, name="anchors")(input_image)
Works like a charm in TF 2.4!
ROOT CAUSE:
The bahavior of Lambda layer of Keras in Tensorflow 2.X was changed from Tensorflow 1.X.
In Keras in Tensorflow 1.X, all tf.Variable and tf.get_variable are automatically tracked into the layer.weights via variable creator context so they receive gradient and trainable automatically. Such approach has problem with auto graph compilation that convert Python code into Execution Graph in Tensorflow 2.X so it is removed and now Lambda layer has the code to check for variable creation and raise the error as you see. In short, Lambda layer in Tensorflow 2.X has to be stateless. If you want to create variable, the correct way in Tensorflow 2.X is to subclass layer class and add trainable weight as a class member.
SOLUTIONS:
There are 2 choices -
Change to use Tensorflow 1.X.. This error will not be raised.
Replace the Lambda layer with subclass of Keras Layer:
class AnchorsLayer(tensorflow.keras.layers.Layer):
def __init__(self, anchors):
super(AnchorLayer, self).__init__()
self.anchors_v = tf.Variable(anchors)
def call(self):
return self.anchors_v
# Then replace the Lambda call with this:
anchors_layer = AnchorLayers(anchors)
anchors = anchors_layer()
I try to describe the situations completely. But due to my ability of language, there will be possible some unclear statements. Please let me know. I will try to explain my meaning.
Recently, I want to apply facenet (I mean davisking's project on github) to my project. Therefore, I wrote a class
class FacenetEmbedding:
def __init__(self, model_path):
self.sess = tf.InteractiveSession()
self.sess.run(tf.global_variables_initializer())
# Load the model
facenet.load_model(model_path)
# Get input and output tensors
self.images_placeholder = tf.get_default_graph().get_tensor_by_name("input:0")
self.tf_embeddings = tf.get_default_graph().get_tensor_by_name("embeddings:0")
self.phase_train_placeholder = tf.get_default_graph().get_tensor_by_name("phase_train:0")
def get_embedding(self, images):
feed_dict = {self.images_placeholder: images, self.phase_train_placeholder: False}
embedding = self.sess.run(self.tf_embeddings, feed_dict=feed_dict)
return embedding
def free(self):
self.sess.close()
I can use this class independent in flask.
model_path = "models/20191025-223514/"
fe = FacenetEmbedding(model_path)
But I have different demands later. I train two models by using keras. I want to use them (.h5 model) with the above facenet model to predict. I load them first.
modelPic = load_model('models/pp.h5')
lePic = pickle.loads(open('models/pp.pickle', "rb").read())
print(modelPic.predict(np.zeros((1, 128, 128, 3))))
modelM = load_model('models/pv.h5')
leM = pickle.loads(open('models/pv.pickle', "rb").read())
print(modelM.predict(np.zeros((1, 128, 128, 3))))
I print the fake image to test the models. It seems to work normally. But when I run flask server and try to post an image to this api, the message pop up and the prediction doesn't work.
Tensor input_1_3:0, specified in either feed_devices or fetch_devices was not found in the Graph
Exception ignored in: <bound method BaseSession._Callable.__del__ of <tensorflow.python.client.session.BaseSession._Callable object at 0x7ff27d0f0dd8>>
Traceback (most recent call last):
File "/home/idgate/.virtualenvs/Line_POC/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1455, in __del__
self._session._session, self._handle, status)
File "/home/idgate/.virtualenvs/Line_POC/lib/python3.6/site-packages/tensorflow/python/framework/errors_impl.py", line 528, in __exit__
c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.InvalidArgumentError: No such callable handle: 140675571821088
I try to use these two keras model without loading facenet model in flask server. It works normally. I think that it must collide with something (maybe about session?) to make these three models cannot work simultaneously. But I don't know how to solve this problem. Please help me! Thanks in advance.
Update: As rvinas pointed out, I had forgotten to add inputs_aux as second input in Model. Fixed now, and it works. So ConditionalRNN can readily be used to do what I want.
I'd like to treat time-series together with non-time-series characteristics in extended LSTM cells (a requirement also discussed here). ConditionalRNN (cond-rnn) for Tensorflow in Python seems to allow this.
Can it be used in Keras Functional API (without eager execution)?
That is, does anyone have a clue how to fix my failed approach below, or a different example where ConditionalRNN (or alternatives) are used to readily combine TS and non-TS data in LSTM-style cells or any equivalent?
I've seen the eager execution-bare tf example on Pilippe Remy's ConditionalRNN github page, but I did not manage to extend it to a readily fittable version in Keras Functional API.
My code looks as follows; it works if, instead of the ConditionalRNN, I use a standard LSTM cell (and adjust the model 'x' input correspondingly). With ConditionalRNN, I did not get it to execute; I receive either the must feed a value for placeholder tensor 'in_aux' error (cf. below), or instead some different type of input size complaints when I change the code, despite trying to be careful about data dimensions compatibility.
(Using Python 3.6, Tensorflow 2.1, cond-rnn 2.1, on Ubuntu 16.04)
import numpy as np
from tensorflow.keras.models import Model
from tensorflow.keras.layers import LSTM, Dense, Input
from cond_rnn import ConditionalRNN
inputs = Input(name='in',shape=(5,5)) # Each observation has 5 dimensions à 5 time-steps each
x = Dense(64)(inputs)
inputs_aux = Input(name='in_aux', shape=[5]) # For each of the 5 dimensions, a non-time-series observation too
x = ConditionalRNN(7, cell='LSTM')([x,inputs_aux]) # Updated Syntax for cond_rnn v2.1
# x = ConditionalRNN(7, cell='LSTM', cond=inputs_aux)(x) # Syntax for cond_rnn in some version before v2.1
predictions = Dense(1)(x)
model = Model(inputs=[inputs, inputs_aux], outputs=predictions) # With this fix, [inputs, inputs_aux], it now works, solving the issue
#model = Model(inputs=inputs, outputs=predictions)
model.compile(optimizer='rmsprop', loss='mean_squared_error', metrics=['mse'])
data = np.random.standard_normal([100,5,5]) # Sample of 100 observations with 5 dimensions à 5 time-steps each
data_aux = np.random.standard_normal([100,5]) # Sample of 100 observations with 5 dimensions à only 1 non-time-series value each
labels = np.random.standard_normal(size=[100]) # For each of the 100 obs., a corresponding (single) outcome variable
model.fit([data,data_aux], labels)
The error I get is
tensorflow.python.framework.errors_impl.InvalidArgumentError: You must feed a value for placeholder tensor 'in_aux' with dtype float and shape [?,5]
[[{{node in_aux}}]]
and the traceback is
Traceback (most recent call last):
File "/home/florian/temp_nonclear/playground/test/est1ls_bare.py", line 20, in <module>
model.fit({'in': data, 'in_aux': data_aux}, labels) #model.fit([data,data_aux], labels) # Also crashes when using model.fit({'in': data, 'in_aux': data_aux}, labels)
File "/home/florian/BB/tsgenerator/ts_wgan/venv/lib/python3.5/site-packages/tensorflow/python/keras/engine/training.py", line 643, in fit
use_multiprocessing=use_multiprocessing)
File "/home/florian/BB/tsgenerator/ts_wgan/venv/lib/python3.5/site-packages/tensorflow/python/keras/engine/training_arrays.py", line 664, in fit
steps_name='steps_per_epoch')
File "/home/florian/BB/tsgenerator/ts_wgan/venv/lib/python3.5/site-packages/tensorflow/python/keras/engine/training_arrays.py", line 383, in model_iteration
batch_outs = f(ins_batch)
File "/home/florian/BB/tsgenerator/ts_wgan/venv/lib/python3.5/site-packages/tensorflow/python/keras/backend.py", line 3353, in __call__
run_metadata=self.run_metadata)
File "/home/florian/BB/tsgenerator/ts_wgan/venv/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1458, in __call__
run_metadata_ptr)
I noticed that you are not passing inputs_aux as input to your model. TF is complaining because this tensor is required to compute your output predictions and it is not being fed with any value. Defining your model as follows should solve the problem:
model = Model(inputs=[inputs, inputs_aux], outputs=predictions)