dataset from tf.data.Dataset.save(load) for GCS + TPU got NotFoundError Could not find metadata file. [Op:MakeIterator] - tensorflow

I tried this both on TF 2.8 and Nightly (2.10.0-dev20220616) on Colab. In 2.8, i used the experimental version of this API. I am trying this in the context of trying to train a huggingface bert model. The dataset has gone through several transformations to arrive at a tf dataset proper.
To repro:
In a session without TPU, source/construct the dataset
use tf.data.Dataset.save(train_set, 'gs://ai-tests/data/train_set'). I tried to read it back with load and able to iterate through it (it seems to work here).
In a separate TPU session, i tried to get train_set = tf.data.Dataset.load('gs://ai-tests/data/train_set')
Iterating get error
for x in train_set:
break
/usr/local/lib/python3.7/dist-packages/tensorflow/python/framework/ops.py
in raise_from_not_ok_status(e, name)
7207 def raise_from_not_ok_status(e, name):
7208 e.message += (" name: " + name if name is not None else "")
-> 7209 raise core._status_to_exception(e) from None # pylint: disable=protected-access
7210
7211
NotFoundError: Could not find metadata file. [Op:MakeIterator]

Related

I'm trying to use GAN with my own dataset but I'm running into problems when I change the keras version

I'm trying to run the GAN in the link with my own dataset. First of all, I wanted to try with MNIST dataset and see the results. I am running it on COLAB. When I use the existing versions of tensorflow and keras in Colab, the outputs are noisy and have bad results. An example from 1400.epoch:
But when I downgrade tensorflow to 2.2.0 and keras to 2.3.1 the results are very good. An example from 1350.epoch:
Then, when I ran it with my own dataset without changing the existing library versions in COLAB, I still got noisy and bad results. So I just updated the versions as before. However, as a result, I get the following error:
FailedPreconditionError: Error while reading resource variable
_AnonymousVar45 from Container: localhost. This could mean that the variable was uninitialized. Not found: Resource
localhost/_AnonymousVar45/N10tensorflow3VarE does not exist. [[node
mul_1/ReadVariableOp (defined at
/usr/local/lib/python3.7/dist-packages/keras/backend/tensorflow_backend.py:3009)
]] [Op:__inference_keras_scratch_graph_5103]
Function call stack: keras_scratch_graph
If this error was caused by tensorflow and keras versions, I think I would get the same error when I tried with MNIST. So I couldn't find the source of the error. Maybe it has to do with the way I load my data. However, existing library versions had no problems with this. Anyway I'm adding the way I load the data here:
import zipfile # unziping
import glob # finding image paths
import numpy as np # creating numpy arrays
from skimage.io import imread # reading images
from skimage.transform import resize # resizing images
# 1. Unzip images
path = '/content/gdrive/My Drive/gan/RealImages.zip'
with zipfile.ZipFile(path, 'r') as zip_ref:
zip_ref.extractall('/content/gdrive/My Drive/gan/extracted')
# 2. Obtain paths of images (.png used for example)
img_list = sorted(glob.glob('/content/gdrive/My Drive/gan/extracted/RealImages/RealImages/*.jpg'))
print(img_list)
# 3. Read images & convert to numpy arrays
## create placeholding numpy arrays
IMG_SIZE = 28
x_data = np.empty((len(img_list), IMG_SIZE, IMG_SIZE, 3), dtype=np.float32)
## read and convert to arrays
for i, img_path in enumerate(img_list):
# read image
img = imread(img_path)
print(img_path)
# resize image (1 channel used for example; 1 for gray-scale, 3 for RGB-scale)
img = resize(img, output_shape=(IMG_SIZE, IMG_SIZE,3), preserve_range=True)
# save to numpy array
x_data[i] = img
Then, I changed the old line:
(X_train, _), (_, _) = mnist.load_data()
to:
X_train=x_data
I couldn't find what I did wrong. I would be very happy if you help.

Issue Using tensorflow hub Universal Sentence Encoder with local runtime and GPU in Google Colab

I am working through a machine learning course to learn tensorflow. In one of the project I was performing text classification using a tensorflow_hub pre trained embedding, the Universal sentence encoder v4. The embeddings worked fine using the google Colab GPU, and also worked in my local runtime without my GPU. However, after I set up colab to be able to use my local GPU (RTX 3060), I started getting the error seen below. For reference, my python environment is through Anaconda, and I used conda install to install tensorflow_gpu and cudatoolkit and cudnn. I am not sure what this error means or how to even begin debugging it, any help would be greatly appreciated, thanks!
Code and error:
import tensorflow_hub as hub
tf_hub_embedding = hub.KerasLayer('https://tfhub.dev/google/universal-sentence-encoder/4',trainable=False,name='USE')
rand_sent = random.choice(train_sents)
print(f'Random sent: {rand_sent}\n')
print(f'Embedded sent: {tf_hub_embedding([rand_sent])[0][:30]}\n')
print(f'Embed length: {len(tf_hub_embedding([rand_sent])[0])}')
Random sent: Data of a Japanese study of patients with unresectable sacral chordoma showed comparable high control rates after hypofractionated carbon ion therapy only .
---------------------------------------------------------------------------
UnknownError Traceback (most recent call last)
Input In [55], in <cell line: 3>()
1 rand_sent = random.choice(train_sents)
2 print(f'Random sent: {rand_sent}\n')
----> 3 print(f'Embedded sent: {tf_hub_embedding([rand_sent])[0][:30]}\n')
4 print(f'Embed length: {len(tf_hub_embedding([rand_sent])[0])}')
File ~\anaconda3\lib\site-packages\keras\utils\traceback_utils.py:67, in filter_traceback.<locals>.error_handler(*args, **kwargs)
65 except Exception as e: # pylint: disable=broad-except
66 filtered_tb = _process_traceback_frames(e.__traceback__)
---> 67 raise e.with_traceback(filtered_tb) from None
68 finally:
69 del filtered_tb
File ~\anaconda3\lib\site-packages\tensorflow_hub\keras_layer.py:229, in KerasLayer.call(self, inputs, training)
223 # ...but we may also have to pass a Python boolean for `training`, which
224 # is the logical "and" of this layer's trainability and what the surrounding
225 # model is doing (analogous to tf.keras.layers.BatchNormalization in TF2).
226 # For the latter, we have to look in two places: the `training` argument,
227 # or else Keras' global `learning_phase`, which might actually be a tensor.
228 if not self._has_training_argument:
--> 229 result = f()
230 else:
231 if self.trainable:
UnknownError: Exception encountered when calling layer "USE" (type KerasLayer).
Graph execution error:
JIT compilation failed.
[[{{node EncoderDNN/EmbeddingLookup/EmbeddingLookupUnique/embedding_lookup/mod}}]] [Op:__inference_restored_function_body_36706]
Call arguments received by layer "USE" (type KerasLayer):
• inputs=["'Data of a Japanese study of patients with unresectable sacral chordoma showed comparable high control rates after hypofractionated carbon ion therapy only .'"]
• training=None
I had this issue, and I solve it by downgrading my TensorFlow version to 2.8.0.
Using this command
pip install tensorflow==2.8.0

How can I use a Cloud TPU with Tensorflow Lite Model Maker?

I'm training an object detection model (EfficientDet-Lite) using Tensorflow Lite Model Maker in Colab and I'd like to use a Cloud TPU. I have all the images in a GCS bucket and provide a CSV file. When I call object_detector.create I get the following error:
/usr/local/lib/python3.7/dist-packages/tensorflow/python/framework/ops.py in shape(self)
1196 # `_tensor_shape` is declared and defined in the definition of
1197 # `EagerTensor`, in C.
-> 1198 self._tensor_shape = tensor_shape.TensorShape(self._shape_tuple())
1199 except core._NotOkStatusException as e:
1200 six.raise_from(core._status_to_exception(e.code, e.message), None)
InvalidArgumentError: Unsuccessful TensorSliceReader constructor: Failed to get matching files on /tmp/tfhub_modules/db7544dcac01f8894d77bea9d2ae3c41ba90574c/variables/variables: Unimplemented: File system scheme '[local]' not implemented (file: '/tmp/tfhub_modules/db7544dcac01f8894d77bea9d2ae3c41ba90574c/variables/variables')
That looks like it's trying to process some local files in the CloudTPU, which doesn't work...
The gist of what I'm doing is:
tpu = tf.distribute.cluster_resolver.TPUClusterResolver()
train_data, validation_data, test_data = object_detector.DataLoader.from_csv(
drive_dir + csv_name,
images_dir = "images" if not tpu else None,
cache_dir = drive_dir + "cub_cache",
)
spec = MODEL_SPEC(tflite_max_detections=10, strategy='tpu', tpu=tpu.master(), gcp_project="xxx")
model = object_detector.create(train_data=train_data,
model_spec=spec,
validation_data=validation_data,
epochs=epochs,
batch_size=batch_size,
train_whole_model=True)
I can't find any example with Model Maker that uses Cloud TPU.
Edit: the error seems to occur when the EfficientDet model gets loaded, so somehow modelmaker must be pointing to a local file that doesn't work for CloudTPU?
Yeah the error is happening with TFHub, which seems to be well known. Basically TF Hub loading tries to use a local cache which TPU doesn't have access to (and the Colab doesn't even provide). Check out https://github.com/tensorflow/hub/issues/604 which should get you past this error.
Download from TF-Hub the model you would like to train (replace X: 0<=X<=4):
https://tfhub.dev/tensorflow/efficientdet/liteX/feature-vector/1
Extract the package twice until you get to the "keras_metadata.pb", "saved_model.pb" and "variables" folder
Upload these files and folders on a Google Cloud Bucket
Pass the uri argument to model_spec.get (https://www.tensorflow.org/lite/tutorials/model_maker_object_detection), pointing to the Cloud Bucket folder (in gs:// format)

Tf 2.0 MirroredStrategy on Albert TF Hub model (multi gpu)

I'm trying to run Albert Tensorflow hub version on multiple GPUs in the same machine. The model works perfectly on single GPU.
This is the structure of my code:
strategy = tf.distribute.MirroredStrategy()
print('Number of devices: {}'.format(strategy.num_replicas_in_sync)) # it prints 2 .. correct
if __name__ == "__main__":
with strategy.scope():
run()
Where in run() function, I read the data, build the model, and fit it.
I'm getting this error:
Traceback (most recent call last):
File "Albert.py", line 130, in <module>
run()
File "Albert.py", line 88, in run
model = build_model(bert_max_seq_length)
File "Albert.py", line 55, in build_model
model.compile(loss="categorical_crossentropy", optimizer=optimizer, metrics=["accuracy"])
File "/home/****/py_transformers/lib/python3.5/site-packages/tensorflow_core/python/training/tracking/base.py", line 457, in _method_wrapper
result = method(self, *args, **kwargs)
File "/home/bighanem/py_transformers/lib/python3.5/site-packages/tensorflow_core/python/keras/engine/training.py", line 471, in compile
' model.compile(...)'% (v, strategy))
ValueError: Variable (<tf.Variable 'bert/embeddings/word_embeddings:0' shape=(30000, 128) dtype=float32>) was not created in the distribution strategy scope of (<tensorflow.python.distribute.mirrored_strategy.MirroredStrategy object at 0x7f62e399df60>). It is most likely due to not all layers or the model or optimizer being created outside the distribution strategy scope. Try to make sure your code looks similar to the following.
with strategy.scope():
model=_create_model()
model.compile(...)
Is it possible that this error occures because Albert model was prepared before by tensorflow team (built and compiled)?
Edited:
To be precise, Tensorflow version is 2.1.
Also, this is the way I load Albert pretrained model:
features = {"input_ids": in_id, "input_mask": in_mask, "segment_ids": in_segment, }
albert = hub.KerasLayer(
"https://tfhub.dev/google/albert_xxlarge/3",
trainable=False, signature="tokens", output_key="pooled_output",
)
x = albert(features)
Following this tutorial: SavedModels from TF Hub in TensorFlow 2
Two-part answer:
1) TF Hub hosts two versions of ALBERT (each in several sizes):
https://tfhub.dev/google/albert_base/3 etc. from the Google research team that originally developed ALBERT comes in the hub.Module format for TF1. This will likely not work with a TF2 distribution strategy.
https://tfhub.dev/tensorflow/albert_en_base/1 etc. from the TensorFlow Model Garden comes in the revised TF2 SavedModel format. Please try this one for use in TF2 with a distribution strategy.
2) That said, the immediate problem appears to be what is explained in the error message (abridged):
Variable 'bert/embeddings/word_embeddings' was not created in the distribution strategy scope ... Try to make sure your code looks similar to the following.
with strategy.scope():
model = _create_model()
model.compile(...)
For a SavedModel (from TF Hub or otherwise), it's the loading that needs to happen under the distribution strategy scope, because that's what's re-creating the tf.Variable objects in the current program. Specifically, any of the following ways to load a TF2 SavedModel from TF Hub have to occur under the distribution strategy scope for distribution to work:
tf.saved_model.load();
hub.load(), which just calls tf.saved_model.load() (after downloading if necessary);
hub.KerasLayer when used with a string-valued model handle, on which it then calls hub.load().

SavedModel file does not exist at saved_model/{saved_model.pbtxt|saved_model.pb}

I'm try running Tensorflow Object Detection API on Tensorflow 2 and I got that error, can someone have a solution?
The code :
Loader
def load_model(model_name):
base_url = 'http://download.tensorflow.org/models/object_detection/'
model_file = model_name + '.tar.gz'
model_dir = tf.keras.utils.get_file(
fname=model_name,
origin=base_url + model_file,
untar=True)
​
model_dir = pathlib.Path(model_dir)/"saved_model"
​
model = tf.saved_model.load(str(model_dir))
model = model.signatures['serving_default']
​
return model
Loading label map
Label maps map indices to category names, so that when our convolution network predicts 5, we know that this corresponds to airplane. Here we use internal utility functions, but anything that returns a dictionary mapping integers to appropriate string labels would be fine
# List of the strings that is used to add correct label for each box.
PATH_TO_LABELS = 'data/mscoco_label_map.pbtxt'
category_index = label_map_util.create_category_index_from_labelmap(PATH_TO_LABELS, use_display_name=True)
For the sake of simplicity we will test on 2 images:
# If you want to test the code with your images, just add path to the images to the TEST_IMAGE_PATHS.
PATH_TO_TEST_IMAGES_DIR = pathlib.Path('test_images')
TEST_IMAGE_PATHS = sorted(list(PATH_TO_TEST_IMAGES_DIR.glob("*.jpg")))
TEST_IMAGE_PATHS
Detection
Load an object detection model:
model_name = 'ssd_mobilenet_v1_coco_11_06_2017'
detection_model = load_model(model_name)
and i got this error
OSError Traceback (most recent call last)
<ipython-input-7-e89d9e690495> in <module>
1 model_name = 'ssd_mobilenet_v1_coco_11_06_2017'
----> 2 detection_model = load_model(model_name)
<ipython-input-4-f8a3c92a04a4> in load_model(model_name)
9 model_dir = pathlib.Path(model_dir)/"saved_model"
10
---> 11 model = tf.saved_model.load(str(model_dir))
12 model = model.signatures['serving_default']
13
D:\Anaconda\lib\site-packages\tensorflow_core\python\saved_model\load.py in load(export_dir, tags)
515 ValueError: If `tags` don't match a MetaGraph in the SavedModel.
516 """
--> 517 return load_internal(export_dir, tags)
518
519
D:\Anaconda\lib\site-packages\tensorflow_core\python\saved_model\load.py in load_internal(export_dir, tags, loader_cls)
524 # sequences for nest.flatten, so we put those through as-is.
525 tags = nest.flatten(tags)
--> 526 saved_model_proto = loader_impl.parse_saved_model(export_dir)
527 if (len(saved_model_proto.meta_graphs) == 1
528 and saved_model_proto.meta_graphs[0].HasField("object_graph_def")):
D:\Anaconda\lib\site-packages\tensorflow_core\python\saved_model\loader_impl.py in parse_saved_model(export_dir)
81 (export_dir,
82 constants.SAVED_MODEL_FILENAME_PBTXT,
---> 83 constants.SAVED_MODEL_FILENAME_PB))
84
85
OSError: SavedModel file does not exist at: C:\Users\Asus\.keras\datasets\ssd_mobilenet_v1_coco_11_06_2017\saved_model/{saved_model.pbtxt|saved_model.pb}
I assume that you are running detection_model_zoo tutorial here. Note that maybe you can change the model name from ssd_mobilenet_v1_coco_11_06_2017 to ssd_mobilenet_v1_coco_2017_11_17, this will solve the problem in my test.
The content of these files can be seen below:
# ssd_mobilenet_v1_coco_11_06_2017
frozen_inference_graph.pb model.ckpt.data-00000-of-00001 model.ckpt.meta
graph.pbtxt model.ckpt.index
# ssd_mobilenet_v1_coco_2017_11_17
checkpoint model.ckpt.data-00000-of-00001 model.ckpt.meta
frozen_inference_graph.pb model.ckpt.index saved_model
Reference:
Where to find tensorflow pretrained models (list or download link)
detect_model_zoo
Using the SavedModel format official blog
Do not link all the way to the model name. Use the pathname to the folder containing the model.
In my case, this code is worked for me. I gave the path of the folder of my .pd file that was created by model checkpoint module :
import tensorflow as tf
if __name__ == '__main__':
# Update the input name and path for your Keras model
input_keras_model = 'my path/weights/my_trained_model/{the files inside this folder are: assets(folder), variables(folder),keras_metadata.pd,saved_model.pd}'
model = tf.keras.models.load_model(input_keras_model)
I was getting exactly this error when trying to use the saved_model.pb file.
I had gotten the .pb file along with a pre-trained model following some tutorial.
It was happening due to the following reasons:
first your already existing saved_model.pb file might be corrupt
second as the user #Mark Silla has mentioned, you are giving the wrong path to the file, just give the path of folder containing the .pb file excluding the file name
third, it might be due to Tensorflow versioning issues
I had to follow all of the above steps and upgraded Tensorflow from v2.3 to v2.3, and it finally created a new saved_model.pb which was not corrupt and I could run it.