Add file name and line number to tensorflow op name during debug mode - tensorflow

I am interested in a feature or hacky solution that allows every tensorflow (specifically tf1.x) op name to include the file name and line number where the op is defined, in an automated fashion across the entire code base. This will greatly facilitate tracking down the place where an op raises an error, such as the situation below:
File "tensorflow/contrib/distribute/python/mirrored_strategy.py", line 633, in _update
assert isinstance(var, values.DistributedVariable), var
AssertionError: Tensor("floordiv_2:0", shape=(), dtype=int64, device=/job:chief/replica:0/task:0/device:GPU:0)
Right now the best I can do is to take a wild guess where a floordiv might occur, but honestly I have no clue at the moment.

The easiest way might be to show the graph on tensorboard and then search for failing op by name. Among the names of parent scopes or preceding operations you would probably be able to tell which layer is failing.
If not that, wrap your layer calls, model constructions in scopes. Hopefully, if this is your tensor failing and not a part of optimizer or else, you would see the direction where to look at.
If you are dedicated to wrap every op in a file-name/line-number, you can try to monkey-patch tf.Operation constructor with a scope. Which should be something along the next lines:
from inspect import getframeinfo, stack
import tensorflow as tf
def scopify_ops(func):
def wrapper(*args, **kwargs):
caller = getframeinfo(stack()[1][0])
path = "%s:%d - %s" % (caller.filename, caller.lineno, message)
print("Caller info:", path)
with tf.name_scope("path")
return func(*args, **kwargs)
return wrapper
tf.Operation.__init__ = scopify_ops(tf.Operation.__init__)

Related

How to use tf.data.Dataset.ignore_errors to ignore errors in a Tensorflow Dataset?

When loading images from a directory in Tensorflow, you use something like:
dataset = tf.keras.utils.image_dataset_from_directory(
"S:\\Images",
batch_size=32,
image_size=(128,128),
label_mode=None,
validation_split=0.20, #Reserve 20% of images for validation
subset='training', #If we specify a validation_split, we *must* specify subset
seed=619 #If using validation_split we *must* specify a seed to ensure there is no overlap between training and validation data
)
But of course some of the images (.jpg, .png, .gif, .bmp) will be invalid. So we want to ignore those errors; just skip them (and ideally log the filenames so they can be repaired, removed, or deleted).
There have been some ideas along the way of how to ignore invalid images:
Method 1: tf.contrib.data.ignore_errors (Tensorflow 1.x only)
Warning: The tf.contrib module will not be included in TensorFlow 2.0.
Sample usage:
dataset = dataset.apply(tf.contrib.data.ignore_errors())
The only down-side of this method is that it was only available in Tensorflow 1. Trying to use it today simply won't work, as the tf.contib namespace no longer exists. That led to a built-in method:
Method 2: tf.data.experimental.ignore_errors(log_warning=False) (deprecated)
From the documentation:
Creates a Dataset from another Dataset and silently ignores any errors. (deprecated)
Deprecated: THIS FUNCTION IS DEPRECATED. It will be removed in a future version. Instructions for updating: Use tf.data.Dataset.ignore_errors instead.
Sample usage:
dataset = dataset.apply(tf.data.experimental.ignore_errors(log_warning=True))
And this method works. It works great. And it has the advantage of working.
But it's apparently deprecated, and they documentation says we should use method 3:
Method 3 - tf.data.Dataset.ignore_errors(log_warning=False, name=None)
Drops elements that cause errors.
Sample usage:
dataset = dataset.ignore_errors(log_warning=True, name="Loading images from directory")
Except it doesn't work
The dataset.ignore_errors attribute doesn't work, and gives the error:
AttributeError: 'BatchDataset' object has no attribute 'ignore_errors'
Which means:
the thing that works is deprecated
they tell us to use this other thing instead
and "provide the instructions for updating"
but the other thing doesn't work
So we ask Stackoverflow:
How do i use tf.data.Dataset.ignore_errors to ignore errors?
Bonus Reading
TensorFlow Dataset `.map` - Is it possible to ignore errors?
TensorFlow: How to skip broken data
Untested Workaround
Not only is it not what i was asking, but people are not allowed to read this:
It looks like the tf.data.Dataset.ignore_errors() method is not
available in the BatchDataset object, which is what you are using in
your code. You can try using tf.data.Dataset.filter() to filter out
elements that cause errors when loading the images. You can use a
try-except block inside the lambda function passed to filter() to
catch the errors and return False for elements that cause errors,
which will filter them out. Here's an example of how you can use
filter() to achieve this:
def filter_fn(x):
try:
# Load the image and do some processing
# Return True if the image is valid, False otherwise
return True
except:
return False
dataset = dataset.filter(filter_fn)
Alternatively, you can use the tf.data.experimental.ignore_errors()
method, which is currently available in TensorFlow 2.x. This method
will silently ignore any errors that occur while processing the
elements of the dataset. However, keep in mind that this method is
experimental and may be removed or changed in a future version.
tf.data.Dataset.ignore_errors was introduced in TensorFlow 2.11. You can use tf.data.experimental.ignore_errors for older versions like so:
dataset.apply(tf.data.experimental.ignore_errors())

TFX custom config argument in trainer not working

This question is based on the TFX recommender tutorial. Please note that the code is being orchestrated by LocalDagRunner rather than run interactively in a notebook.
In the Trainer, we pass in a custom_config with the transformed ratings and movies:
trainer = tfx.components.Trainer(
module_file=os.path.abspath(_trainer_module_file),
examples=ratings_transform.outputs['transformed_examples'],
transform_graph=ratings_transform.outputs['transform_graph'],
schema=ratings_transform.outputs['post_transform_schema'],
train_args=tfx.proto.TrainArgs(num_steps=500),
eval_args=tfx.proto.EvalArgs(num_steps=10),
custom_config={
'epochs':5,
'movies':movies_transform.outputs['transformed_examples'],
'movie_schema':movies_transform.outputs['post_transform_schema'],
'ratings':ratings_transform.outputs['transformed_examples'],
'ratings_schema':ratings_transform.outputs['post_transform_schema']
})
The problem is that all of the outputs passed into custom_config seem to be empty. This results in errors, for example
class MovielensModel(tfrs.Model):
def __init__(self, user_model, movie_model, tf_transform_output, movies_uri):
super().__init__()
self.movie_model: tf.keras.Model = movie_model
self.user_model: tf.keras.Model = user_model
movies_artifact = movies_uri.get()[0]
complains that movie_uri.get() is empty. The same is true for ratings. Ratings passed in through the examples parameter however are not empty (the artefact uri is available), so it seems as though this custom_config is 'breaking things'.
I have tried debugging it but to no avail. I did notice that arguments in custom_config are serialised and deserialised, but this didn't seem to be the cause of the problem. Does anyone know why this happens and how to resolve this?

Error evaluating a TensorArray in a while loop

I've built the following TensorArray:
ta = tf.TensorArray(
dtype=tf.float32,
size=0,
dynamic_size=True,
element_shape=tf.TensorShape([None, None])
)
and called ta = ta.write(idx, my_tensor) inside a while_loop.
When evaluating the output = ta.stack() tensor in a session, I receive this error message:
ValueError: Cannot use '.../TensorArrayWrite/TensorArrayWriteV3' as
input to '.../TensorArrayStack_1/TensorArraySizeV3' because
'.../TensorArrayWrite/TensorArrayWriteV3' is in a while loop. See info
log for more details.
I don't understand this error message, could you please help me ?
Update: A minimal example might be difficult to come up with, but this is what I am doing: I am using the reference to the ta TensorArray inside the cell_input_fn of AttentionWrapper. This callback is used in AttentionWrapper's call method, where another TensorArray named alignment_history is being written. Therefore the while_loop code is not designed by me, it's part of the TF dynamic RNN computation tf.nn.dynamic_rnn.
Not sure if this is what's biting you, but you have to make sure your while_loop function takes the tensor array as input and emits an updated one as output; and you have to use the final version of the TensorArray at the end of the while_loop:
def fn(ta_old):
return ta_old.write(...)
ta_final = while_loop(..., body=fn, [tf.TensorArray(...)])
values = ta_final.stack()
specifically you should never access ta_old outside of fn().

Declaring theano variables for pymc3

I am having issues replicating a pymc2 code using pymc3.
I believe it is due to the fact pymc3 is using the theano type variables which are not compatible with the numpy operations I am using. So I am using the #theano.decorator:
I have this function:
with pymc3.Model() as model:
z_stars = pymc3.Uniform('z_star', self.z_min_ssp_limit, self.z_max_ssp_limit)
Av_stars = pymc3.Uniform('Av_star', 0.0, 5.00)
sigma_stars = pymc3.Uniform('sigma_star',0.0, 5.0)
#Fit observational wavelength
ssp_fit_output = self.ssp_fit_theano(z_stars, Av_stars, sigma_stars,
self.obj_data['obs_wave_resam'],
self.obj_data['obs_flux_norm_masked'],
self.obj_data['basesWave_resam'],
self.obj_data['bases_flux_norm'],
self.obj_data['int_mask'],
self.obj_data['normFlux_obs'])
#Define likelihood
like = pymc.Normal('ChiSq', mu=ssp_fit_output,
sd=self.obj_data['obs_fluxEr_norm'],
observed=self.obj_data['obs_fluxEr_norm'])
#Run the sampler
trace = pymc3.sample(iterations, step=step, start=start_conditions, trace=db)
where:
#theano.compile.ops.as_op(itypes=[t.dscalar,t.dscalar,t.dscalar,t.dvector,
t.dvector,t.dvector,t.dvector,t.dvector,t.dscalar],
otypes=[t.dvector])
def ssp_fit_theano(self, input_z, input_sigma, input_Av, obs_wave, obs_flux_masked,
rest_wave, bases_flux, int_mask, obsFlux_mean):
...
...
The first three variables are scalars (from the pymc3 uniform distribution). The
remaining variables are numpy arrays and the last one is a float. However, I am
getting this "'numpy.ndarray' object has no attribute 'type'" error:
File "/home/user/anaconda/lib/python2.7/site-packages/theano/gof/op.py", line 615, in __call__
node = self.make_node(*inputs, **kwargs)
File "/home/user/anaconda/lib/python2.7/site-packages/theano/gof/op.py", line 963, in make_node
if not all(inp.type == it for inp, it in zip(inputs, self.itypes)):
File "/home/user/anaconda/lib/python2.7/site-packages/theano/gof/op.py", line 963, in <genexpr>
if not all(inp.type == it for inp, it in zip(inputs, self.itypes)):
AttributeError: 'numpy.ndarray' object has no attribute 'type'
Please any advice in the right direction will be most welcomed.
I had a bunch of time-wasting-stops when I went from pymc2 to pymc3. The problem, I think, is that the doc is quite bad. I suspect they neglect the doc as far as the code is still evolving. 3 comments/advises:
I wish you could find some help using '#theano.compile.ops.as_op' here: failure to adapt pymc2 into pymc3 or here how to fit a method belonging to an instance with pymc3?
The drawback of '#theano.compile.ops.as_op' is that you implicitly exclude any analysis related to the gradient of your function. To have access to the gradient, I think you need to define your function in a more complex way presented here how to fit a method belonging to an instance with pymc3?
warning: for the moment, using theano seems to be a source of problem if you want to distribute your code under Windows. See build a .exe for Windows from a python 3 script importing theano with pyinstaller, but I am not sure whether it is just a personal clumsiness or really a problem. Personally I had to give up theano to be able to distribute my code...

How can I reroute the training input pipeline to test pipeline in tensorflow using tf.contrib.graph_editor?

Suppose now I have a training input pipeline which finally generate train_x and train_y using tf.train.shuffle_batch. I export meta graph and re-import the graph in another code file. Now I want to detach the input pipeline, i.e., the train_x and train_y, and connect a new test_x and test_y. How can I make accomplish this using tf.contrib.graph_editor?
EDIT: As suggested by #iga, I change my input directory using input_map
filenames = tf.train.match_filenames_once(FLAGS.data_dir + '*', name='matching_filenames')
if FLAGS.ckpt != '':
latest = FLAGS.log_dir + FLAGS.ckpt
else:
latest = tf.train.latest_checkpoint(FLAGS.log_dir)
if not latest or not os.path.exists(latest+'.meta'):
print("checkpoint " + latest + " does not exist")
sys.exit(1)
saver = tf.train.import_meta_graph(latest+'.meta',
input_map={'matching_filenames:0':filenames},
import_scope='import')
g = tf.get_default_graph()
but I get the following error:
ValueError: graph_def is invalid at node u'matching_filenames/Assign':
Input tensor 'matching_filenames:0' Cannot convert a tensor of type
string to an input of type string_ref.
Are there any elegant way to resolve this?
For this task, you should be able to just use input_map argument to https://www.tensorflow.org/api_docs/python/tf/import_graph_def. If you are using import_meta_graph, you can pass the input_map into its kwargs and it will get passed down to import_graph_def.
RESPONSE TO EDIT: I am assuming that your original graph (the one you are deserializing) had the same matching_filenames variable. Quite confusingly, the tensor name "matching_filenames:0" actually refers to the tensor going from the VariableV2 op to the Assign op. The type of this edge is string_ref and you don't really want to break that edge.
The output from a variable typically goes through an identity op called matching_filenames/read. This is what you want to use as the key in your input_map. For the value, you want the same tensor in your new filenames. So, your call should probably look like:
tf.train.import_meta_graph(latest+'.meta',
input_map={'matching_filenames/read': filenames.read_value()},
import_scope='import')
In general, variables are fairly complicated. If this does not work, you can use some placeholder op and feed the names into it manually.