How to use apply and map function on tf dataset with custom function - tensorflow

I am trying to understand how to use map and apply function in TensorFlow. Objective is to use all the basic data pre-processing step while reading the data into TensorFlow as map gives an option to parallelize the operation.
a_list = [b"THis is for Testing"]
converting a_list into tf dataset format
a_dataset = tf.data.Dataset.from_tensors(a_list)
print(list(a_dataset.as_numpy_iterator()))
[array([b'THis is for Testing'], dtype=object)]
Applying map to calculate len of list
a_dataset_len = a_dataset.map(lambda x: len(x))
print(list(a_dataset_len.as_numpy_iterator()))
[1]
Applying map to convert all strings into lowercase
a_dataset_lower = a_dataset.map(lambda x: x.lower())
AttributeError: 'Tensor' object has no attribute 'lower'
Using apply function on a_list
a_dataset.apply(lambda x:len(x))
TypeError: object of type 'TensorDataset' has no len()
Please help me to understand
Why I am failing to use apply function using len(x) while map is
able to execute ?
How to create own custom function and use the pre-built one to pass it during Dataset.map and Dataset.apply?

Related

TypeError: Failed to convert object of type <class 'tuple'> to Tensor. When calling a model with tf.data.dataset.map

I am calling a model in a function detect:
def detect(img):
detector_output = detector(tf.reshape(img, (1, img.shape[0], img.shape[1], img.shape[2])))
classes = detector_output['detection_classes'][0].numpy()
most_likely = tf.convert_to_tensor(classes[0])
box = detector_output['detection_boxes'][0][0]
box = tf.math.multiply(box, [img.shape[0], img.shape[1], img.shape[0], img.shape[1]])
box = tf.cast(box, tf.int16)
return (box, most_likely)
this is called in another function reads via tf.data.datasets map api
dataset = dataset.map(reads, num_parallel_calls = AUTO).batch(32)
I think the issue is that this tensorflow hub model (or all object detection models I could find) does not support batching.
Calling the function via reads by itself works fine.
except if I use the tf.function decorator, then weirdly even by itself detect(img) throws the same error.
I tried with several models from here with the same result.
detector needs the shape with the 1 dimension up front.
I know there should be some reverse flatten() or squeeze() but I couldn't find it, apologies for the bad style!
The issue is also likely here in the reshaping.
the full error:
TypeError: Failed to convert object of type <class 'tuple'> to Tensor. Contents: (1, None, None, 3). Consider casting elements to a supported type.
Edit: I fixed the error by using tf.expand_dims instead of reshaping above.
I'd still be glad for a good explanation to understand better what went trong.
Thank you for your help!

How to use constants in loss function?

I know this is dumb, but I need the equivalent of np.sqrt(2.0*np.pi) in my loss function. How can I get it? Statements like this give error: 'float object has no attribute dtype':
pi = np.pi
def myLoss(...):
k = K.sqrt(2.0*pi)
...
Even K.sqrt(2.0*3.14159) is disallowed.
Use it like this:
k = K.sqrt(tf.constant([2.0*np.pi]))
Since, it accepts an object which has dtype. One option is a Tensor.
Another option is to not using keras backend, but using numpy:
k = np.sqrt(2.0*np.pi)

Get python string from Tensor without the numpy function

I'm following the tutorial on https://storage.googleapis.com/tensorflow_docs/docs/site/en/tutorials/text/word_embeddings.ipynb
TextVectorization by defalut splits on whitespace but I want to implement custom split. I want to keep punctuations (which I have implemented in custom_standardization), and split between words and punctuations.
For instance, "fn(1,2)=1+2=3" needs to split to ["fn","(","1",",","2",")","=","1","+","2","=","3"].
def custom_split(input_data: tf.Tensor):
assert input_data.dtype.name=='string'
assert hasattr(input_data,'numpy') == False
???
vectorize_layer = TextVectorization(
standardize=custom_standardization,
split=custom_split,
output_mode='int',
output_sequence_length=sequence_length)
I'm confident in such spliting given a standard Python string. However the input is tf.Tensor and following the aforementioned tutorial, input_data does not have numpy() function.
What's the proper way to do such spliting? Is it possible to retrieve Python string from string Tensor?

Dataset API 'flat_map' method producing error for same code which works with 'map' method

I am trying to create a create a pipeline to read multiple CSV files using TensorFlow Dataset API and Pandas. However, using the flat_map method is producing errors. However, if I am using map method I am able to build the code and run it in session. This is the code I am using. I already opened #17415 issue in TensorFlow Github repository. But apparently, it is not an error and they asked me to post here.
folder_name = './data/power_data/'
file_names = os.listdir(folder_name)
def _get_data_for_dataset(file_name,rows=100):#
print(file_name.decode())
df_input=pd.read_csv(os.path.join(folder_name, file_name.decode()),
usecols =['Wind_MWh','Actual_Load_MWh'],nrows = rows)
X_data = df_input.as_matrix()
X_data.astype('float32', copy=False)
return X_data
dataset = tf.data.Dataset.from_tensor_slices(file_names)
dataset = dataset.flat_map(lambda file_name: tf.py_func(_get_data_for_dataset,
[file_name], tf.float64))
dataset= dataset.batch(2)
fiter = dataset.make_one_shot_iterator()
get_batch = iter.get_next()
I get the following error: map_func must return a Dataset object. The pipeline works without error when I use map but it doesn't give the output I want. For example, if Pandas is reading N rows from each of my CSV files I want the pipeline to concatenate data from B files and give me an array with shape (N*B, 2). Instead, it is giving me (B, N,2) where B is the Batch size. map is adding another axis instead of concatenating on the existing axis. From what I understood in the documentation flat_map is supposed to give a flatted output. In the documentation, both map and flat_map returns type Dataset. So how is my code working with map and not with flat_map?
It would also great if you could point me towards code where Dataset API has been used with Pandas module.
As mikkola points out in the comments, the Dataset.map() and Dataset.flat_map() expect functions with different signatures: Dataset.map() takes a function that maps a single element of the input dataset to a single new element, whereas Dataset.flat_map() takes a function that maps a single element of the input dataset to a Dataset of elements.
If you want each row of the array returned by _get_data_for_dataset() to
become a separate element, you should use Dataset.flat_map() and convert the output of tf.py_func() to a Dataset, using Dataset.from_tensor_slices():
folder_name = './data/power_data/'
file_names = os.listdir(folder_name)
def _get_data_for_dataset(file_name, rows=100):
df_input=pd.read_csv(os.path.join(folder_name, file_name.decode()),
usecols=['Wind_MWh', 'Actual_Load_MWh'], nrows=rows)
X_data = df_input.as_matrix()
return X_data.astype('float32', copy=False)
dataset = tf.data.Dataset.from_tensor_slices(file_names)
# Use `Dataset.from_tensor_slices()` to make a `Dataset` from the output of
# the `tf.py_func()` op.
dataset = dataset.flat_map(lambda file_name: tf.data.Dataset.from_tensor_slices(
tf.py_func(_get_data_for_dataset, [file_name], tf.float32)))
dataset = dataset.batch(2)
iter = dataset.make_one_shot_iterator()
get_batch = iter.get_next()

tf.scatter_nd_update Variable Requirement vs RNN.__call__ method

I am developing a RNN and am using Tensorflow 1.1. I got the following error:
tensorflow.python.framework.errors_impl.InvalidArgumentError: The node 'model/att_seq2seq/encode/pocmru_rnn_encoder/rnn/while/Variable/Assign' has inputs from different frames. The input 'model/att_seq2seq/encode/pocmru_rnn_encoder/rnn/while/Identity_3' is in frame 'model/att_seq2seq/encode/pocmru_rnn_encoder/rnn/while/model/att_seq2seq/encode/pocmru_rnn_encoder/rnn/while/'. The input 'model/att_seq2seq/encode/pocmru_rnn_encoder/rnn/while/Variable' is in frame ''.
The error is caused by the lambda function in dynamic rnn method and a piece of code in my RNN.
tensorflow rnn.py "dynamic_rnn / _dynamic_rnn_loop / _time_step" that using a lambda function to call RNN.call method to loop through all inputs.
my code :
if type(myObject) != tf.Variable:
tp = tf.Variable(myObject, validate_shape=False)
else:
tp = myObject
Logically, i repeatedly use tf.scatter_nd_update to update myObject. The pseudo code would be like myObject = scatter_nd_update(myObject, indices, updates). Since tf.scatter_nd_update requires Variable as argument and returns tensor, I need to wrap tensor into Variable. Hence the code above (test variable and then wrap). How should I modify my code to make it work? Thanks!