how to debug tf2 in general and transformer in particular - tensorflow2.0

Looking for advice about a debugger for a TF2 model.
I would like to train the following transformer model
https://github.com/tensorflow/models/blob/master/official/transformer/v2
on my own data.
My problem is that I have difficulty figuring out shape of data returned by the _parse_example function from the data_pipeline.py file. To start with Pydev doesn't brake inside _parse_example function, neither does PyCharm, which appears to be using Pydev internally. The options offered by TensorBoard 2.0.0 seem to be applicable to TF1, not TF2:
sess = tf.Session()
sess = tf_debug.TensorBoardDebugWrapperSession(sess, "localhost:6064")
sess.run(my_fetches)
hook = tf_debug.TensorBoardDebugHook("localhost:6064")
my_estimator.fit(x=x_data, y=y_data, steps=1000, monitors=[hook])
keras.backend.set_session(
tf_debug.TensorBoardDebugWrapperSession(tf.Session(), "localhost:6064"))
model.fit(...)
So what tool can I use to see tensor's data and its shape? Option 2 from the above list seems to make sense, except for the fact that I don't see a call for my_estimator.fit from transformer's implementation for TF2.
My environment was created in Ubuntu 18.04 using anaconda:
conda create -n mytest tensorflow-gpu.
I use Eclipse with PyDev plugin.
Thanks.

Related

Can't use Keras CSVLogger callbacks in Sagemaker script mode. It fails to write the log file on S3 ( error - No such file or directory )

I have this script where I want to get the callbacks to a separate CSV file in sagemaker custom script docker container. But when I try to run in local mode, it fails giving the following error. I have a hyper-parameter tuning job(HPO) to run and this keeps giving me errors. I need to get this local mode run correctly before doing the HPO.
In the notebook I use the following code.
from sagemaker.tensorflow import TensorFlow
tf_estimator = TensorFlow(entry_point='lstm_model.py',
role=role,
code_location=custom_code_upload_location,
output_path=model_artifact_location+'/',
train_instance_count=1,
train_instance_type='local',
framework_version='1.12',
py_version='py3',
script_mode=True,
hyperparameters={'epochs': 1},
base_job_name='hpo-lstm-local-test'
)
tf_estimator.fit({'training': training_input_path, 'validation': validation_input_path})
In my lstm_model.py script the following code is used.
lgdir = os.path.join(model_dir, 'callbacks_log.csv')
csv_logger = CSVLogger(lgdir, append=True)
regressor.fit(x_train, y_train, batch_size=batch_size,
validation_data=(x_val, y_val),
epochs=epochs,
verbose=2,
callbacks=[csv_logger]
)
I tried creating a file before hand like shown below using tensorflow backend. But it doesn't create a file. ( K : tensorflow Backend, tf: tensorflow )
filename = tf.Variable(lgdir , tf.string)
content = tf.Variable("", tf.string)
sess = K.get_session()
tf.io.write_file(filename, content)
I can't use any other packages like pandas to create the file as the TensorFlow docker container in SageMaker for custom scripts doesn't provide them. They give only a limited amount of packages.
Is there a way I can write the csv file to the S3 bucket location, before the fit method try to write the callback. Or is that the solution to the problem? I am not sure.
If you can even suggest other suggestions to get callbacks, I would even accept that answer. But it should be worth the effort.
This docker image is really narrowing the scope.
Well for starters, you can always make your own docker image using the Tensorflow image as a base. I work in Tensorflow 2.0 so this will be slightly different for you but here is an example of my image pattern:
# Downloads the TensorFlow library used to run the Python script
FROM tensorflow/tensorflow:2.0.0a0 # you would use the equivalent for your TF version
# Contains the common functionality necessary to create a container compatible with Amazon SageMaker
RUN pip install sagemaker-containers -q
# Wandb allows us to customize and centralize logging while maintaining open-source agility
RUN pip install wandb -q # here you would install pandas
# Copies the training code inside the container to the design pattern created by the Tensorflow estimator
# here you could copy over a callbacks csv
COPY mnist-2.py /opt/ml/code/mnist-2.py
COPY callbacks.py /opt/ml/code/callbacks.py
COPY wandb_setup.sh /opt/ml/code/wandb_setup.sh
# Set the login script as the entry point
ENV SAGEMAKER_PROGRAM wandb_setup.sh # here you would instead launch lstm_model.py
I believe you are looking for a pattern similar to this, but I prefer to log all of my model data using Weights and Biases. They're a little out of data on their SageMaker integration but I'm actually in the midst of writing an updated tutorial for them. It should certainly be finished this month and include logging and comparing runs from hyperparameter tuning jobs

Running Tensorflow model inference script on multiple GPU

I'm trying to run the model scoring (inference graph) from tensorflow objec detection API to run it on multiple GPU's, tried specifying the GPU number in the main, but it runs only on single GPU.placed GPU utilization snapshot here
Using tensorflow-gpu==1.13.1, can you kindly point me what I'm missing here.
for i in range(2):
with tf.device('/gpu:{}' . format(i)):
tf_init()
init = tf.global_variables_initializer
with detection_graph.as_default():
with tf.Session(config=tf.ConfigProto(log_device_placement=True)) as session:
call to #run_inference_multiple_images function
The responses to this question should give you a few options for fixing this.
Usually TensorFlow will occupy all visible GPUs unless told otherwise. So if you haven't already tried, you could just remove the with tf.device line (assuming you only have the two GPUs) and TensorFlow should use them both.
Otherwise, I think the easiest is setting the environment variables with os.environ["CUDA_VISIBLE_DEVICES"] = "0,1".

"Unkown (custom) loss function" when using tflite_convert on a {TF 2.0.0-beta1 ; Keras} model

Summary
My question is composed by:
A context in which I present my project, my working environment and my workflow
The detailed problem
The concerned parts of my code
The solutions I tried to solve my problem
The question reminder
Context
I've written a Python Keras implementation of a downgraded version of the original Super-Resolution GAN. Now I want to test it using Google Firebase Machine Learning Kit, by hosting it in the Google servers. That's why I have to convert my Keras program to a TensorFlow Lite one.
Environment and workflow (with the problem)
I'm training my program on Google Colab working environment: there, I've installed TF 2.0.0-beta1 (this choice is motivated by this uncorrect answer: https://datascience.stackexchange.com/a/57408/78409).
Workflow (and problem):
I write locally my Python Keras program, keeping in mind that it will run on TF 2. So I use TF 2 imports, for example: from tensorflow.keras.optimizers import Adam and also from tensorflow.keras.layers import Conv2D, BatchNormalization
I send my code to my Drive
I run without any problem my Google Colab Notebook: TF 2 is used.
I get the output model in my Drive, and I download it.
I try to convert this model to the TFLite format by executing the following CLI: tflite_convert --output_file=srgan.tflite --keras_model_file=srgan.h5: here the problem appears.
The problem
Instead of outputing the TF Lite converted model from the TF (Keras) model, the previous CLI outputs this error:
ValueError: Unknown loss function:build_vgg19_loss_network
The function build_vgg19_loss_network is a custom loss function that I've implemented and that must be used by the GAN.
Parts of code that rise this problem
Presenting the custom loss function
The custom loss function is implemented like that:
def build_vgg19_loss_network(ground_truth_image, predicted_image):
loss_model = Vgg19Loss.define_loss_model(high_resolution_shape)
return mean(square(loss_model(ground_truth_image) - loss_model(predicted_image)))
Compiling the generator network with my custom loss function
generator_model.compile(optimizer=the_optimizer, loss=build_vgg19_loss_network)
What I've tried to do in order to solve the problem
As I read it on StackOverflow (link at the beginning of this question), TF 2 was thought to be sufficient to output a Keras model which would be correctly processed by my tflite_convert CLI. But it's not, obviously.
As I read it on GitHub, I tried to manually set my custom loss function among Keras' loss functions, by adding these lines: import tensorflow.keras.losses
tensorflow.keras.losses.build_vgg19_loss_network = build_vgg19_loss_network. It didn't work.
I read on GitHub I could use custom objects with load_model Keras function: but I only want to use compile Keras function. Not load_model.
My final question
I want to do only minor changes to my code, since it works fine. So I don't want, for example, to replace compile with load_model. With this constraint, could you help me, please, to make my CLI tflite_convert works with my custom loss function?
Since you are claiming that TFLite conversion is failing due to a custom loss function, you can save the model file without keep the optimizer details. To do that, set include_optimizer parameter to False as shown below:
model.save('model.h5', include_optimizer=False)
Now, if all the layers inside your model are convertible, they should get converted into TFLite file.
Edit:
You can then convert the h5 file like this:
import tensorflow as tf
model = tf.keras.models.load_model('model.h5') # srgan.h5 for you
converter = tf.lite.TFLiteConverter.from_keras_model(model)
tflite_model = converter.convert()
open("converted_model.tflite", "wb").write(tflite_model)
Usual practice to overcome the unsupported operators in TFLite conversion is documented here.
I had the same error. I recommend changing the loss to "mse" since you already have a well-trained model and you don't need to train with the .tflite file.

Why do I get AttributeError: module 'tensorflow' has no attribute 'placeholder'?

I was able to run my python program three weeks ago but now every time I try to run it, I get the following error:
AttributeError: module 'tensorflow' has no attribute 'placeholder'
I have tensorflow installed (version '2.0.0-alpha0').
I have read a couple of posts related to this issue. They say I should uninstall TensorFlow and re-install it again. The problem is that I am running this on a cluster computer and I do not have sudo permissions.
Any idea?
In Tensorflow 2.0, there is no placeholder. You need to update your TF1.x code to TF2.0 code and then run it on your cluster. Please take a look at the official doc on converting your TF1.x code to TF2.0.
In TF1.x codes, you build tensorflow graph (static graph) with placeholders, constants, variables. Then, run the code in a session with a tf.session() command. During that session, you provide the values for the placeholder and execute the static graph.
In TF2.0, models run eagerly as you enter commands. This is more pythonic. Check more details about TF 2.0 here. Thanks!
After including the tensorflow compat v1 libraries:
import tensorflow.compat.v1 as tf
tf.disable_v2_behavior()`
use the v1 syntax like this:
X = tf.compat.v1.placeholder(dtype="float",shape=[None, n_H0, n_W0, n_C0])
Y = tf.compat.v1.placeholder(dtype="float",shape=[None, n_y])
In addition to the #Vishnuvardhan Janapati's answer, you can update folders ("*TREE") and/or files to version 2 of TensorFlow. The upgrade tool tf_upgrade_v2 is automatically included in TensorFlow 1.13 and later.
tf_upgrade_v2 [-h] [--infile INPUT_FILE] [--outfile OUTPUT_FILE]
[--intree INPUT_TREE] [--outtree OUTPUT_TREE]
[--copyotherfiles COPY_OTHER_FILES] [--inplace]
[--reportfile REPORT_FILENAME] [--mode {DEFAULT,SAFETY}]
[--print_all]
An illustration of how the conversion fixed the "placeholder" error:
Note: this fixes similar complaints module 'tensorflow' has no attribute 'xxxxx' (not just the "placeholder").
Calling disable_v2_behavior() function is not necessary
just,
import tensorflow as tf
tf.compat.v1.placeholder()
Changing the library worked for me
#libraries
import tensorflow.compat.v1 as tf
tf.disable_v2_behavior()
If this doesn't work maybe you need you install TensorFlow again.
I hope it helps

Exporting a frozen graph .pb file in Tensorflow 2

I've beeen trying out the Tensorflow 2 alpha and I have been trying to freeze and export a model to a .pb graphdef file.
In Tensorflow 1 I could do something like this:
# Freeze the graph.
frozen_graph_def = tf.graph_util.convert_variables_to_constants(
sess,
sess.graph_def,
output_node_names)
# Save the frozen graph to .pb file.
with open('model.pb', 'wb') as f:
f.write(frozen_graph_def.SerializeToString())
However this doesn't seem possible anymore as convert_variables_to_constants is removed and use of sessions is discouraged.
I looked and found there is the freeze graph util
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/tools/freeze_graph.py that works with SavedModel exports.
Is there some way to do it within Python still or I am meant to switch and use this tool now?
I have also faced this same problem while migrating from tensorflow1.x to tensoflow2.0 beta.
This problem can be solved by 2 methods:
1st is to go to the tensflow2.0 docs search for the methods you have used and change the syntax for each line &
To use google's tf_ugrade_v2 script
tf_upgrade_v2 --infile your_tf1_script_file --outfile converted_tf2_file
You try above command to change your tensorflow1.x script to tensorflow2.0, it will solve all your problem.
Also, you can rename the method (Manual step by refering documentation)
Rename 'tf.graph_util.convert_variables_to_constants' to 'tf.compat.v1.graph_util.convert_variables_to_constants'
The measure problem is that in tensorflow2.0 is that many syntax and function has changed try referring the tensoflow2.0 docs or use the google's tf_upgrade_v2 script
Not sure if you've seen this Tensorflow 2.0 issue, but this response seems to be a work-around:
https://github.com/tensorflow/tensorflow/issues/29253#issuecomment-530782763
Note: this hasn't worked for my nlp model but maybe it will work for you. The suggested work-around is to use model.save_weights('weights.h5') while in TF 2.0 environment. Then create new environment with TF 1.14 and do all following steps in TF 1.14 env. Build your model model = create_model() and use model.load_weights('weights.h5') to load weights back into your model. Then save entire model with model.save('final_model.h5'). If you manage to have success with the above steps, then follow the rest of the steps in the link to use freeze_graph.