How to keep vectors in a dictionary in tensorflow? - tensorflow

It seems that tf.lookup.experimental.DenseHashTable cannot hold vectors and I could not find examples of how to use it.

Below you can find a simple implementation of dictionary of vectors in Tensorflow. It is also an example of usage of tf.lookup.experimental.DenseHashTable and tf.TensorArray.
As said, vectors cannot be kept in tf.lookup.experimental.DenseHashTable, and therefore tf.TensorArray is used to keep the actual vectors.
Of course, this is a simple example, and it does not include deletion of entries in the dictionary - an operation that will require some management of the free cells of the array. Also, you should read in the respective API pages of tf.lookup.experimental.DenseHashTable and tf.TensorArray how to tune them for your needs.
import tensorflow as tf
class DictionaryOfVectors:
def __init__(self, dtype):
empty_key = tf.constant('')
deleted_key = tf.constant('deleted')
self.ht = tf.lookup.experimental.DenseHashTable(key_dtype=tf.string,
value_dtype=tf.int32,
default_value=-1,
empty_key=empty_key,
deleted_key=deleted_key)
self.ta = tf.TensorArray(dtype, size=0, dynamic_size=True, clear_after_read=False)
self.inserts_counter = 0
#tf.function
def insertOrAssign(self, key, vec):
# Insert the vector to the TensorArray. The write() method returns a new
# TensorArray object with flow that ensures the write occurs. It should be
# used for subsequent operations.
with tf.init_scope():
self.ta = self.ta.write(self.inserts_counter, vec)
# Insert the same counter value to the hash table
self.ht.insert_or_assign(key, self.inserts_counter)
self.inserts_counter += 1
#tf.function
def lookup(self, key):
with tf.init_scope():
index = self.ht.lookup(key)
return self.ta.read(index)
dictionary_of_vectors = DictionaryOfVectors(dtype=tf.float32)
dictionary_of_vectors.insertOrAssign('first', [1,2,3,4,5])
print(dictionary_of_vectors.lookup('first'))
The example is a bit more sophisticated, as the insert and lookup methods are decorated with #tf.function. Because the methods change variables defined outside of them, the tf.init_scope() is used. You might ask what is changed in the lookup() method as it actually only reads from the hash table and the array. The reason is that in graph mode, the index that is returned from the lookup() call is a Tensor, and in the TensorArray implementation there is a line containing if index < 0: which fails with:
OperatorNotAllowedInGraphError: using a tf.Tensor as a Python bool is not allowed.
When we use the tf.init_scope(), as explained in its API documentation, "code inside an init_scope block runs with eager execution enabled even when tracing a tf.function". So in that case that index is not a Tensor but as scalar.

Related

What is the intuition behind the Iterator.get_next method?

The name of the method get_next() is a little bit misleading. The documentation says
Returns a nested structure of tf.Tensors representing the next element.
In graph mode, you should typically call this method once and use its result as the input to another computation. A typical loop will then call tf.Session.run on the result of that computation. The loop will terminate when the Iterator.get_next() operation raises tf.errors.OutOfRangeError. The following skeleton shows how to use this method when building a training loop:
dataset = ... # A `tf.data.Dataset` object.
iterator = dataset.make_initializable_iterator()
next_element = iterator.get_next()
# Build a TensorFlow graph that does something with each element.
loss = model_function(next_element)
optimizer = ... # A `tf.compat.v1.train.Optimizer` object.
train_op = optimizer.minimize(loss)
with tf.compat.v1.Session() as sess:
try:
while True:
sess.run(train_op)
except tf.errors.OutOfRangeError:
pass
Python also has a function called next, which needs to be called every time we need the next element of the iterator. However, according to the documentation of get_next() quoted above, get_next() should be called only once and its result should be evaluated by calling the method run of the session, so this is a little bit unintuitive, because I was used to the Python's built-in function next. In this script, get_next() is also called only and the result of the call is evaluated at every step of the computation.
What is the intuition behind get_next() and how is it different from next()? I think that the next element of the dataset (or feedable iterator), in the second example I linked above, is retrieved every time the result of the first call to get_next() is evaluated by calling the method run, but this is a little unintuitive. I don't get why we do not need to call get_next at every step of the computation (to get the next element of the feedable iterator), even after reading the note in the documentation
NOTE: It is legitimate to call Iterator.get_next() multiple times, e.g. when you are distributing different elements to multiple devices in a single step. However, a common pitfall arises when users call Iterator.get_next() in each iteration of their training loop. Iterator.get_next() adds ops to the graph, and executing each op allocates resources (including threads); as a consequence, invoking it in every iteration of a training loop causes slowdown and eventual resource exhaustion. To guard against this outcome, we log a warning when the number of uses crosses a fixed threshold of suspiciousness.
In general, it is not clear how the Iterator works.
The idea is that get_next adds some operations to the graph such that, every time you evaluate them, you get the next element in the dataset. On each iteration, you just need to run the operations that get_next made, you do not need to create them over and over again.
Maybe a good way to get an intuition is to try to write an iterator yourself. Consider something like the following:
import tensorflow as tf
tf.compat.v1.disable_v2_behavior()
# Make an iterator, returns next element and initializer
def iterator_next(data):
data = tf.convert_to_tensor(data)
i = tf.Variable(0)
# Check we are not out of bounds
with tf.control_dependencies([tf.assert_less(i, tf.shape(data)[0])]):
# Get next value
next_val_1 = data[i]
# Update index after the value is read
with tf.control_dependencies([next_val_1]):
i_updated = tf.compat.v1.assign_add(i, 1)
with tf.control_dependencies([i_updated]):
next_val_2 = tf.identity(next_val_1)
return next_val_2, i.initializer
# Test
with tf.compat.v1.Graph().as_default(), tf.compat.v1.Session() as sess:
# Example data
data = tf.constant([1, 2, 3, 4])
# Make operations that give you the next element
next_val, iter_init = iterator_next(data)
# Initialize iterator
sess.run(iter_init)
# Iterate until exception is raised
while True:
try:
print(sess.run(next_val))
# assert throws InvalidArgumentError
except tf.errors.InvalidArgumentError: break
Output:
1
2
3
4
Here, iterator_next gives you something comparable to what get_next in an iterator would give you, plus an initializer operation. Every time you run next_val you get a new element from data, you don't need to call the function every time (which is how next works in Python), you call it once and then evaluate the result multiple times.
EDIT: The function iterator_next above could also be simplified to the following:
def iterator_next(data):
data = tf.convert_to_tensor(data)
# Start from -1
i = tf.Variable(-1)
# First increment i
i_updated = tf.compat.v1.assign_add(i, 1)
with tf.control_dependencies([i_updated]):
# Check i is not out of bounds
with tf.control_dependencies([tf.assert_less(i, tf.shape(data)[0])]):
# Get next value
next_val = data[i]
return next_val, i.initializer
Or even simpler:
def iterator_next(data):
data = tf.convert_to_tensor(data)
i = tf.Variable(-1)
i_updated = tf.compat.v1.assign_add(i, 1)
# Using i_updated directly as a value is equivalent to using i with
# a control dependency to i_updated
with tf.control_dependencies([tf.assert_less(i_updated, tf.shape(data)[0])]):
next_val = data[i_updated]
return next_val, i.initializer

How to initialize tf.contrib.lookup.HashTable used in Tensorflow Estimator model_fn?

I have a tf.contrib.lookup.HashTable declared inside a Tensorflow Estimator model_fn. As the session is not directly available to us in Estimators, I am stuck with not being able to initialize the table. I am aware that if not used with Estimators, table can be initialized with table.init.run() using the session
I tried to initialize the table by using a sessionRunHook which I was already using for some other purpose. I pass the table init op as argument to session run in the before_run function. But table is still not initialized. I also tried to pass tf.tables_initializer() instead, but that did not work too. Another option I tried without success is the tf.add_to_collection(tf.GraphKeys.TABLE_INITIALIZERS.. command.
#sessionRunHook code below
class SaveToCSVHook(tf.train.SessionRunHook):
def begin(self):
samples_weights_table = session.graph.get_tensor_by_name('samples_weights_table:0')
self.samples_weights_table_init_op = samples_weights_table.init
self.table_init_op = tf.tables_initializer() # also tried passing this to self.args instead - same result though
tf.add_to_collection(tf.GraphKeys.TABLE_INITIALIZERS, samples_weights_table.init)
def after_create_session(self, session, coord):
self.args ={'table_init_op':self.samples_weights_table_init_op}
def before_run(self, run_context):
return tf.train.SessionRunArgs(self.args)
def after_run(self, run_context, run_values):
print(f"Got Values: {run_values.results}")
# Estimator model_fn code below
def model_fn(..)
samples_weights_table = tf.contrib.lookup.HashTable(tf.contrib.lookup.KeyValueTensorInitializer(keysb, values, key_dtype=tf.string, value_dtype=tf.float32,name='samples_weights_table_init_op'), -1.0,name='samples_weights_table')
I get error:
FailedPreconditionError (see above for traceback): Table not initialized
which obviously means the table is not getting initialized
If anyone is interested to know the answer, the hashtable need not be explicitly initialized when used with Estimators. They are initialized by default for high level APIs like Estimators. The error goes away when the initializer code is removed and the table works as expected.

Tensorflow Dataset API: parallelising tf.data.Dataset.from_generator with parallel_interleave

In a production environment I have data coming in from N producers that has to go through a network. I found this comment on parallelising tf.data.Dataset.from_generator which really describes what I want.
def generator(n):
# returns n-th generator function
def dataset(n):
return tf.data.Dataset.from_generator(generator(n))
ds = tf.data.Dataset.range(N).apply(tf.contrib.data.parallel_interleave(dataset, cycle_lenght=N))
# where N is the number of generators you use
However how should the generator(n) function look like. Because when I run this sample with
def generator(n):
"""Returns the n-th generator function (for consumer n)
"""
consumer = self.consumers[n]
def gen():
for item in consumer:
yield item
return gen
with self.consumers a Python list then I will get the error:
TypeError: list indices must be integers or slices, not Tensor
The implementation is almost correct, but you're getting an error because the n argument in dataset(n) is a "symbolic" tf.Tensor, and not an actual value that can be used to look up a consumer in self.consumers.
Fortunately, there is a workaround, which involves passing n through the optional args argument to tf.data.Dataset.from_generator():
def dataset(n):
return tf.data.Dataset.from_generator(generator, args=(n,))
Under the covers, from_generator() inserts some code to convert n to a Python integer before each call to generator.

optimize variable with dynamic shape

I want to use a variable where the shape is unknown in advance and it will change from time to time (although ndim is known and fixed).
I declare it like:
initializer = tf.random_uniform_initializer()
shape = (s0, s1, s2) # these are symbolic vars
foo_var = tf.Variable(initializer(shape=shape), name="foo", validate_shape=False)
This seems to work when I create the computation graph up to the point where I want to optimize w.r.t. this variable, i.e.:
optimizer = tf.train.AdamOptimizer(learning_rate=0.1, epsilon=1e-4)
optim = optimizer.minimize(loss, var_list=[foo_var])
That fails in the optimizer in some function create_zeros_slot where it seems to depend on the static shape information (it uses primary.get_shape().as_list()). (I reported this upstream here.)
So, using the optimizer works only with variables with static shape?
I.e. for every change of the shape of the variable, I need to rebuild the computation graph?
Or is there any way to avoid the recreation?
What you are doing does not make any sense. How would you optimize the values of a dynamic variable if its shape is changing? Sometimes a value would be there and sometimes it would not. When you go to save the graph which shape would the variable be in? The adam optimizer also needs to store information about each parameter in a variable between executions which it can not do without knowing the size.
Now if what you are looking to do is only use part of the variable at a time you can take slices of it and use them. This will work fine so long as the variable has a fixed shape of the maximum bounds of the slice. For instance...
initializer = tf.random_uniform_initializer()
shape = (S0, S1, S2) # these are now constants for the max bounds of si
foo_var = tf.Variable(initializer(shape=shape), name="foo")
foo_var_sub = foo_var[:s0, :s1, :s2]
In this case the optimizer will only act on the parts of foo_var which contributed to the slice.
My solution at the moment is a bit ugly but works.
def _tf_create_slot_var(primary, val, scope):
"""Helper function for creating a slot variable."""
from tensorflow.python.ops import variables
slot = variables.Variable(val, name=scope, trainable=False, validate_shape=primary.get_shape().is_fully_defined())
# pylint: disable=protected-access
if isinstance(primary, variables.Variable) and primary._save_slice_info:
# Primary is a partitioned variable, so we need to also indicate that
# the slot is a partitioned variable. Slots have the same partitioning
# as their primaries.
real_slot_name = scope[len(primary.op.name + "/"):-1]
slice_info = primary._save_slice_info
slot._set_save_slice_info(variables.Variable.SaveSliceInfo(
slice_info.full_name + "/" + real_slot_name,
slice_info.full_shape[:],
slice_info.var_offset[:],
slice_info.var_shape[:]))
# pylint: enable=protected-access
return slot
def _tf_create_zeros_slot(primary, name, dtype=None, colocate_with_primary=True):
"""Create a slot initialized to 0 with same shape as the primary object.
Args:
primary: The primary `Variable` or `Tensor`.
name: Name to use for the slot variable.
dtype: Type of the slot variable. Defaults to the type of `primary`.
colocate_with_primary: Boolean. If True the slot is located
on the same device as `primary`.
Returns:
A `Variable` object.
"""
if dtype is None:
dtype = primary.dtype
from tensorflow.python.ops import array_ops
val = array_ops.zeros(
primary.get_shape().as_list() if primary.get_shape().is_fully_defined() else tf.shape(primary),
dtype=dtype)
from tensorflow.python.training import slot_creator
return slot_creator.create_slot(primary, val, name, colocate_with_primary=colocate_with_primary)
def monkey_patch_tf_slot_creator():
"""
The TensorFlow optimizers cannot handle variables with unknown shape.
We hack this.
"""
from tensorflow.python.training import slot_creator
slot_creator._create_slot_var = _tf_create_slot_var
slot_creator.create_zeros_slot = _tf_create_zeros_slot
Then I need to call monkey_patch_tf_slot_creator() at some point.

How to filter tensor from queue based on some predicate in tensorflow?

How can I filter data stored in a queue using a predicate function? For example, let's say we have a queue that stores tensors of features and labels and we just need those that meet the predicate. I tried the following implementation without success:
feature, label = queue.dequeue()
if (predicate(feature, label)):
enqueue_op = another_queue.enqueue(feature, label)
The most straightforward way to do this is to dequeue a batch, run them through the predicate test, use tf.where to produce a dense vector of the ones that match the predicate, and use tf.gather to collect the results, and enqueue that batch. If you want that to happen automatically, you can start a queue runner on the second queue - the easiest way to do that is to use tf.train.batch:
Example:
import numpy as np
import tensorflow as tf
a = tf.constant(np.array([5, 1, 9, 4, 7, 0], dtype=np.int32))
q = tf.FIFOQueue(6, dtypes=[tf.int32], shapes=[])
enqueue = q.enqueue_many([a])
dequeue = q.dequeue_many(6)
predmatch = tf.less(dequeue, [5])
selected_items = tf.reshape(tf.where(predmatch), [-1])
found = tf.gather(dequeue, selected_items)
secondqueue = tf.FIFOQueue(6, dtypes=[tf.int32], shapes=[])
enqueue2 = secondqueue.enqueue_many([found])
dequeue2 = secondqueue.dequeue_many(3) # XXX, hardcoded
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
sess.run(enqueue) # Fill the first queue
sess.run(enqueue2) # Filter, push into queue 2
print sess.run(dequeue2) # Pop items off of queue2
The predicate produces a boolean vector; the tf.where produces a dense vector of the indexes of the true values, and the tf.gather collects items from your original tensor based upon those indexes.
A lot of things are hardcoded in this example that you'd need to make not-hardcoded in reality, of course, but hopefully it shows the structure of what you're trying to do (create a filtering pipeline). In practice, you'd want QueueRunners on there to keep things churning automatically. Using tf.train.batch is very useful to handle that automatically -- see Threading and Queues for more detail.