TFX custom config argument in trainer not working - tensorflow

This question is based on the TFX recommender tutorial. Please note that the code is being orchestrated by LocalDagRunner rather than run interactively in a notebook.
In the Trainer, we pass in a custom_config with the transformed ratings and movies:
trainer = tfx.components.Trainer(
module_file=os.path.abspath(_trainer_module_file),
examples=ratings_transform.outputs['transformed_examples'],
transform_graph=ratings_transform.outputs['transform_graph'],
schema=ratings_transform.outputs['post_transform_schema'],
train_args=tfx.proto.TrainArgs(num_steps=500),
eval_args=tfx.proto.EvalArgs(num_steps=10),
custom_config={
'epochs':5,
'movies':movies_transform.outputs['transformed_examples'],
'movie_schema':movies_transform.outputs['post_transform_schema'],
'ratings':ratings_transform.outputs['transformed_examples'],
'ratings_schema':ratings_transform.outputs['post_transform_schema']
})
The problem is that all of the outputs passed into custom_config seem to be empty. This results in errors, for example
class MovielensModel(tfrs.Model):
def __init__(self, user_model, movie_model, tf_transform_output, movies_uri):
super().__init__()
self.movie_model: tf.keras.Model = movie_model
self.user_model: tf.keras.Model = user_model
movies_artifact = movies_uri.get()[0]
complains that movie_uri.get() is empty. The same is true for ratings. Ratings passed in through the examples parameter however are not empty (the artefact uri is available), so it seems as though this custom_config is 'breaking things'.
I have tried debugging it but to no avail. I did notice that arguments in custom_config are serialised and deserialised, but this didn't seem to be the cause of the problem. Does anyone know why this happens and how to resolve this?

Related

Loading dataset/dataloader object onto GPU

I am running code from another repository, but my issue is general so I am posting it here. Running their code, I get the error along the lines of Expected all tensors to be on the same device, found two: cpu and cuda:0. I have already verified that the model is on cuda:0; the issue is that the dataloader object used is not set to the device. Also, the dataset/models I use here are huggingface-transformers models and huggingface datasets.
Here is the relevant block of code where the issue arises:
eval_dataset = self.eval_dataset if eval_dataset is None else eval_dataset
eval_dataloader = self.get_eval_dataloader(eval_dataset)
eval_examples = self.eval_examples if eval_examples is None else eval_examples
compute_metrics = self.compute_metrics
self.compute_metrics = None
eval_loop = (self.prediction_loop if self.args.use_legacy_prediction_loop else self.evaluation_loop)
try:
#this is where the error occurs
output = eval_loop(
eval_dataloader,
description="Evaluation",
prediction_loss_only=True if compute_metrics is None else None,
ignore_keys=ignore_keys,
)
For context, this occurs inside an evaluate() method of a class inheriting from Seq2SeqTrainer from huggingface. I have tried using something like
for i, (inputs, labels) in eval_dataloader:
inputs, labels = inputs.to(device), labels.to(device)
But that doesn't work (it gives an error of Too many values to unpack (expected 2). Is there any other way I can send this dataloader to the GPU? In particular, is there any way I can edit the evaluation_loop method of Transformers Trainer to move the batches to the GPU or something?

How to use tf.data.Dataset.ignore_errors to ignore errors in a Tensorflow Dataset?

When loading images from a directory in Tensorflow, you use something like:
dataset = tf.keras.utils.image_dataset_from_directory(
"S:\\Images",
batch_size=32,
image_size=(128,128),
label_mode=None,
validation_split=0.20, #Reserve 20% of images for validation
subset='training', #If we specify a validation_split, we *must* specify subset
seed=619 #If using validation_split we *must* specify a seed to ensure there is no overlap between training and validation data
)
But of course some of the images (.jpg, .png, .gif, .bmp) will be invalid. So we want to ignore those errors; just skip them (and ideally log the filenames so they can be repaired, removed, or deleted).
There have been some ideas along the way of how to ignore invalid images:
Method 1: tf.contrib.data.ignore_errors (Tensorflow 1.x only)
Warning: The tf.contrib module will not be included in TensorFlow 2.0.
Sample usage:
dataset = dataset.apply(tf.contrib.data.ignore_errors())
The only down-side of this method is that it was only available in Tensorflow 1. Trying to use it today simply won't work, as the tf.contib namespace no longer exists. That led to a built-in method:
Method 2: tf.data.experimental.ignore_errors(log_warning=False) (deprecated)
From the documentation:
Creates a Dataset from another Dataset and silently ignores any errors. (deprecated)
Deprecated: THIS FUNCTION IS DEPRECATED. It will be removed in a future version. Instructions for updating: Use tf.data.Dataset.ignore_errors instead.
Sample usage:
dataset = dataset.apply(tf.data.experimental.ignore_errors(log_warning=True))
And this method works. It works great. And it has the advantage of working.
But it's apparently deprecated, and they documentation says we should use method 3:
Method 3 - tf.data.Dataset.ignore_errors(log_warning=False, name=None)
Drops elements that cause errors.
Sample usage:
dataset = dataset.ignore_errors(log_warning=True, name="Loading images from directory")
Except it doesn't work
The dataset.ignore_errors attribute doesn't work, and gives the error:
AttributeError: 'BatchDataset' object has no attribute 'ignore_errors'
Which means:
the thing that works is deprecated
they tell us to use this other thing instead
and "provide the instructions for updating"
but the other thing doesn't work
So we ask Stackoverflow:
How do i use tf.data.Dataset.ignore_errors to ignore errors?
Bonus Reading
TensorFlow Dataset `.map` - Is it possible to ignore errors?
TensorFlow: How to skip broken data
Untested Workaround
Not only is it not what i was asking, but people are not allowed to read this:
It looks like the tf.data.Dataset.ignore_errors() method is not
available in the BatchDataset object, which is what you are using in
your code. You can try using tf.data.Dataset.filter() to filter out
elements that cause errors when loading the images. You can use a
try-except block inside the lambda function passed to filter() to
catch the errors and return False for elements that cause errors,
which will filter them out. Here's an example of how you can use
filter() to achieve this:
def filter_fn(x):
try:
# Load the image and do some processing
# Return True if the image is valid, False otherwise
return True
except:
return False
dataset = dataset.filter(filter_fn)
Alternatively, you can use the tf.data.experimental.ignore_errors()
method, which is currently available in TensorFlow 2.x. This method
will silently ignore any errors that occur while processing the
elements of the dataset. However, keep in mind that this method is
experimental and may be removed or changed in a future version.
tf.data.Dataset.ignore_errors was introduced in TensorFlow 2.11. You can use tf.data.experimental.ignore_errors for older versions like so:
dataset.apply(tf.data.experimental.ignore_errors())

Add file name and line number to tensorflow op name during debug mode

I am interested in a feature or hacky solution that allows every tensorflow (specifically tf1.x) op name to include the file name and line number where the op is defined, in an automated fashion across the entire code base. This will greatly facilitate tracking down the place where an op raises an error, such as the situation below:
File "tensorflow/contrib/distribute/python/mirrored_strategy.py", line 633, in _update
assert isinstance(var, values.DistributedVariable), var
AssertionError: Tensor("floordiv_2:0", shape=(), dtype=int64, device=/job:chief/replica:0/task:0/device:GPU:0)
Right now the best I can do is to take a wild guess where a floordiv might occur, but honestly I have no clue at the moment.
The easiest way might be to show the graph on tensorboard and then search for failing op by name. Among the names of parent scopes or preceding operations you would probably be able to tell which layer is failing.
If not that, wrap your layer calls, model constructions in scopes. Hopefully, if this is your tensor failing and not a part of optimizer or else, you would see the direction where to look at.
If you are dedicated to wrap every op in a file-name/line-number, you can try to monkey-patch tf.Operation constructor with a scope. Which should be something along the next lines:
from inspect import getframeinfo, stack
import tensorflow as tf
def scopify_ops(func):
def wrapper(*args, **kwargs):
caller = getframeinfo(stack()[1][0])
path = "%s:%d - %s" % (caller.filename, caller.lineno, message)
print("Caller info:", path)
with tf.name_scope("path")
return func(*args, **kwargs)
return wrapper
tf.Operation.__init__ = scopify_ops(tf.Operation.__init__)

How to initialize tf.contrib.lookup.HashTable used in Tensorflow Estimator model_fn?

I have a tf.contrib.lookup.HashTable declared inside a Tensorflow Estimator model_fn. As the session is not directly available to us in Estimators, I am stuck with not being able to initialize the table. I am aware that if not used with Estimators, table can be initialized with table.init.run() using the session
I tried to initialize the table by using a sessionRunHook which I was already using for some other purpose. I pass the table init op as argument to session run in the before_run function. But table is still not initialized. I also tried to pass tf.tables_initializer() instead, but that did not work too. Another option I tried without success is the tf.add_to_collection(tf.GraphKeys.TABLE_INITIALIZERS.. command.
#sessionRunHook code below
class SaveToCSVHook(tf.train.SessionRunHook):
def begin(self):
samples_weights_table = session.graph.get_tensor_by_name('samples_weights_table:0')
self.samples_weights_table_init_op = samples_weights_table.init
self.table_init_op = tf.tables_initializer() # also tried passing this to self.args instead - same result though
tf.add_to_collection(tf.GraphKeys.TABLE_INITIALIZERS, samples_weights_table.init)
def after_create_session(self, session, coord):
self.args ={'table_init_op':self.samples_weights_table_init_op}
def before_run(self, run_context):
return tf.train.SessionRunArgs(self.args)
def after_run(self, run_context, run_values):
print(f"Got Values: {run_values.results}")
# Estimator model_fn code below
def model_fn(..)
samples_weights_table = tf.contrib.lookup.HashTable(tf.contrib.lookup.KeyValueTensorInitializer(keysb, values, key_dtype=tf.string, value_dtype=tf.float32,name='samples_weights_table_init_op'), -1.0,name='samples_weights_table')
I get error:
FailedPreconditionError (see above for traceback): Table not initialized
which obviously means the table is not getting initialized
If anyone is interested to know the answer, the hashtable need not be explicitly initialized when used with Estimators. They are initialized by default for high level APIs like Estimators. The error goes away when the initializer code is removed and the table works as expected.

Retrieving an unnamed variable in tensorflow

I've trained up a model and saved it in a checkpoint, but only just realized that I forgot to name one of the variables I'd like to inspect when I restore the model.
I know how to retrieve named variables from tensorflow, (g = tf.get_default_graph() and then g.get_tensor_by_name([name])). In this case, I know its scope, but it is unnamed. I've tried looking in tf.GraphKeys.GLOBAL_VARIABLES, but it doesn't appear there, for some reason.
Here's how it's defined in the model:
with tf.name_scope("contrastive_loss") as scope:
l2_dist = tf.cast(tf.sqrt(1e-4 + tf.reduce_sum(tf.subtract(pred_left, pred_right), 1)), tf.float32) # the variable I want
# I use it here when calculating another named tensor, if that helps.
con_loss = contrastive_loss(l2_dist)
loss = tf.reduce_sum(con_loss, name="loss")
Is there any way of finding the variable without a name?
First of all, following up on my first comment, it makes sense that tf.get_collection given a name scope is not working. From the documentation, if you provide a scope, only variables or operations with assigned names will be returned. So that's out.
One thing you can try is to list the name of every node in your Graph with:
print([node.name for node in tf.get_default_graph().as_graph_def().node])
Or possibly, when restoring from a checkpoint:
saver = tf.train.import_meta_graph(/path/to/meta/graph)
sess = tf.Session()
saver.restore(sess, /path/to/checkpoints)
graph = sess.graph
print([node.name for node in graph.as_graph_def().node])
Another option is to display the graph using tensorboard or Jupyter Notebook and the show_graph command. There might be a built-in show_graph now, but that link is to a git repository where one is defined. You will then have to search for your operation in the graph and then probably retrieve it with:
my_op = tf.get_collection('full_operation_name')[0]
If you want to set it up in the future so that you can retrieve it by name, you need to add it to a collection using tf.add_to_collection:
my_op = tf.some_operation(stuff, name='my_op')
tf.add_to_collection('my_op_name', my_op)
Then retrieve it by restoring your graph and then using:
my_restored_op = tf.get_collection('my_op_name')[0]
You might also be able to get by just naming it and then specifying its scope in tf.get_collection instead, but I am not sure. More information and a helpful tutorial can be found here.
tf.get_collection does not work with unnamed variables. So list the operations with:
graph = sess.graph
print(graph.get_operations())
... find your tensor in the list and then:
global_step_tensor = graph.get_tensor_by_name('complete_operation_name:0')
And I found this tutorial very helpful to understand the mechanism behind these.