I'm following the tutorial on https://storage.googleapis.com/tensorflow_docs/docs/site/en/tutorials/text/word_embeddings.ipynb
TextVectorization by defalut splits on whitespace but I want to implement custom split. I want to keep punctuations (which I have implemented in custom_standardization), and split between words and punctuations.
For instance, "fn(1,2)=1+2=3" needs to split to ["fn","(","1",",","2",")","=","1","+","2","=","3"].
def custom_split(input_data: tf.Tensor):
assert input_data.dtype.name=='string'
assert hasattr(input_data,'numpy') == False
???
vectorize_layer = TextVectorization(
standardize=custom_standardization,
split=custom_split,
output_mode='int',
output_sequence_length=sequence_length)
I'm confident in such spliting given a standard Python string. However the input is tf.Tensor and following the aforementioned tutorial, input_data does not have numpy() function.
What's the proper way to do such spliting? Is it possible to retrieve Python string from string Tensor?
Related
I'm trying to highlight specific columns in my dataframe using guideline from this post, https://stackoverflow.com/a/41655055/5158984.
My question is on the use of the subset argument. My guess is that it's part of the **kwargs argument. However, the official documentation from Pandas, https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.applymap.html, vaguely explains it.
So in general, how can I know which key words I can use whenever I see **kwargs?
Thanks!
It seems that you are confusing pandas.DataFrame.applymap and df.style.applymap (where df is an instance of pd.DataFrame), for which subset stands on its own and is not part of the kwargs arguments.
Here is one way to find out (in your terminal or a Jupyter notebook cell) what are the named parameters of this method (or any other Pandas method for that matter):
import pandas as pd
df = pd.DataFrame()
help(df.style.applymap)
# Output
Help on method applymap in module pandas.io.formats.style:
applymap(func: 'Callable', subset: 'Subset | None' = None, **kwargs)
-> 'Styler' method of pandas.io.formats.style.Styler instance
Apply a CSS-styling function elementwise.
Updates the HTML representation with the result.
Parameters
----------
func : function
``func`` should take a scalar and return a string.
subset : label, array-like, IndexSlice, optional
A valid 2d input to `DataFrame.loc[<subset>]`, or, in the case of a 1d input
or single key, to `DataFrame.loc[:, <subset>]` where the columns are
prioritised, to limit ``data`` to *before* applying the function.
**kwargs : dict
Pass along to ``func``.
...
I'm struggling with this for a while. I searched stack and check tf2
doc a bunch of times. There is one solution indicated, but
I don't understand why my solution doesn't work.
In my case, I store a binary string (i.e., bytes) in tfrecords.
if I iterate over dataset via as_numpy_list or directly call numpy()
on each item, I can get back binary string.
while iterating the dataset, it does work.
I'm not sure what exactly map() passes to test_callback.
I see doesn't have a method nor property numpy, and the same about type
tf.io.decode_raw return. (it is Tensor, but it has no numpy as well)
Essentially I need to take a binary string, parse it via my
x = decoder.FromString(y) and then pass it my encoder
that will transform x binary string to tensor.
def test_callback(example_proto):
# I tried to figure out. can I use bytes?decode
# directly and what is the most optimal solution.
parsed_features = tf.io.decode_raw(example_proto, out_type=tf.uint8)
# tf.io.decoder returns tensor with N bytes.
x = creator.FromString(parsed_features.numpy)
encoded_seq = midi_encoder.encode(x)
return encoded_seq
raw_dataset = tf.data.TFRecordDataset(filenames=["main.tfrecord"])
raw_dataset = raw_dataset.map(test_callback)
Thank you, folks.
I found one solution but I would love to see more suggestions.
def test_callback(example_proto):
from_string = creator.FromString(example_proto.numpy())
encoded_seq = encoder.encoder(from_string)
return encoded_seq
raw_dataset = tf.data.TFRecordDataset(filenames=["main.tfrecord"])
raw_dataset = raw_dataset.map(lambda x: tf.py_function(test_callback, [x], [tf.int64]))
My understanding that tf.py_function has a penalty on performance.
Thank you
I am trying to understand how to use map and apply function in TensorFlow. Objective is to use all the basic data pre-processing step while reading the data into TensorFlow as map gives an option to parallelize the operation.
a_list = [b"THis is for Testing"]
converting a_list into tf dataset format
a_dataset = tf.data.Dataset.from_tensors(a_list)
print(list(a_dataset.as_numpy_iterator()))
[array([b'THis is for Testing'], dtype=object)]
Applying map to calculate len of list
a_dataset_len = a_dataset.map(lambda x: len(x))
print(list(a_dataset_len.as_numpy_iterator()))
[1]
Applying map to convert all strings into lowercase
a_dataset_lower = a_dataset.map(lambda x: x.lower())
AttributeError: 'Tensor' object has no attribute 'lower'
Using apply function on a_list
a_dataset.apply(lambda x:len(x))
TypeError: object of type 'TensorDataset' has no len()
Please help me to understand
Why I am failing to use apply function using len(x) while map is
able to execute ?
How to create own custom function and use the pre-built one to pass it during Dataset.map and Dataset.apply?
I am trying to create a create a pipeline to read multiple CSV files using TensorFlow Dataset API and Pandas. However, using the flat_map method is producing errors. However, if I am using map method I am able to build the code and run it in session. This is the code I am using. I already opened #17415 issue in TensorFlow Github repository. But apparently, it is not an error and they asked me to post here.
folder_name = './data/power_data/'
file_names = os.listdir(folder_name)
def _get_data_for_dataset(file_name,rows=100):#
print(file_name.decode())
df_input=pd.read_csv(os.path.join(folder_name, file_name.decode()),
usecols =['Wind_MWh','Actual_Load_MWh'],nrows = rows)
X_data = df_input.as_matrix()
X_data.astype('float32', copy=False)
return X_data
dataset = tf.data.Dataset.from_tensor_slices(file_names)
dataset = dataset.flat_map(lambda file_name: tf.py_func(_get_data_for_dataset,
[file_name], tf.float64))
dataset= dataset.batch(2)
fiter = dataset.make_one_shot_iterator()
get_batch = iter.get_next()
I get the following error: map_func must return a Dataset object. The pipeline works without error when I use map but it doesn't give the output I want. For example, if Pandas is reading N rows from each of my CSV files I want the pipeline to concatenate data from B files and give me an array with shape (N*B, 2). Instead, it is giving me (B, N,2) where B is the Batch size. map is adding another axis instead of concatenating on the existing axis. From what I understood in the documentation flat_map is supposed to give a flatted output. In the documentation, both map and flat_map returns type Dataset. So how is my code working with map and not with flat_map?
It would also great if you could point me towards code where Dataset API has been used with Pandas module.
As mikkola points out in the comments, the Dataset.map() and Dataset.flat_map() expect functions with different signatures: Dataset.map() takes a function that maps a single element of the input dataset to a single new element, whereas Dataset.flat_map() takes a function that maps a single element of the input dataset to a Dataset of elements.
If you want each row of the array returned by _get_data_for_dataset() to
become a separate element, you should use Dataset.flat_map() and convert the output of tf.py_func() to a Dataset, using Dataset.from_tensor_slices():
folder_name = './data/power_data/'
file_names = os.listdir(folder_name)
def _get_data_for_dataset(file_name, rows=100):
df_input=pd.read_csv(os.path.join(folder_name, file_name.decode()),
usecols=['Wind_MWh', 'Actual_Load_MWh'], nrows=rows)
X_data = df_input.as_matrix()
return X_data.astype('float32', copy=False)
dataset = tf.data.Dataset.from_tensor_slices(file_names)
# Use `Dataset.from_tensor_slices()` to make a `Dataset` from the output of
# the `tf.py_func()` op.
dataset = dataset.flat_map(lambda file_name: tf.data.Dataset.from_tensor_slices(
tf.py_func(_get_data_for_dataset, [file_name], tf.float32)))
dataset = dataset.batch(2)
iter = dataset.make_one_shot_iterator()
get_batch = iter.get_next()
Is there any way to convert a string tensor to lower case, without evaluating in the session ? Some sort of tf.string_to_lower op ?
More specifically, I am reading data from tfrecords files, so my data is made of tensors. I then want to use tf.contrib.lookup.index_table_from_* to lookup indices for words in the data, and I need this to be case-insensitive. Lowering the data before writing it to tfrecords is not an option, as it needs to be kept in original format. One option would be to store both original and lowered, but I'd like to avoid this if possible.
Here's an implementation with tensorflow ops:
def lowercase(s):
ucons = tf.constant_initializer([chr(i) for i in range(65, 91)])
lcons = tf.constant_initializer([chr(i) for i in range(97, 123)])
upchars = tf.constant(ucons, dtype=tf.string)
lchars = tf.constant(lcons, dtype=tf.string)
upcharslut = tf.contrib.lookup.index_table_from_tensor(mapping=upchars, num_oov_buckets=1, default_value=-1)
splitchars = tf.string_split(tf.reshape(s, [-1]), delimiter="").values
upcharinds = upcharslut.lookup(splitchars)
return tf.reduce_join(tf.map_fn(lambda x: tf.cond(x[0] > 25, lambda: x[1], lambda: lchars[x[0]]), (upcharinds, splitchars), dtype=tf.string))
if __name__ == "__main__":
s = "komoDO DragoN "
sess = tf.Session()
x = lowercase(s)
sess.run(tf.global_variables_initializer())
sess.run(tf.tables_initializer())
print(sess.run([x]))
returns [b'komodo dragon ']
You can use tf.py_func to use a python function that manipulates your string and it's executed withing the graph.
You can do something like:
# I suppose your string tensor is tensorA
lower = tf.py_func(lambda x: x.lower(), [tensorA], tf.string, stateful=False)
# Starting from TF 2.0 `tf.py_func` is deprecated so correct code will be
lower = tf.py_function(lambda x: x.numpy().lower(), [tensorA], tf.string)
Unfortunately, tf.py_func doesn't work in all cases as serving or TFT. The following snippet is a simple in-graph TF solution.
import tensorflow as tf
def to_lower_case(text):
chars = tf.strings.unicode_decode(text, input_encoding='UTF-8')
capital_mask = tf.logical_and(tf.greater_equal(chars, 65), tf.less(chars, 91))
chars = chars + tf.cast(capital_mask, tf.int32) * 32
return tf.strings.unicode_encode(chars, output_encoding='UTF-8')
with tf.Session() as sess:
print(sess.run(to_lower_case('Test')))
In Tensorflow 1.14, a lower op has been added. A short code snippet (in eager execution mode) looks like the following:
astring = tf.constant('A String', dtype=tf.string)
tf.strings.lower(astring)
<tf.Tensor: id=79, shape=(), dtype=string, numpy=b'a string'>
If the characters your are using are limited to ASCII characters, I have a working solution for that (in graph). The idea is:
Create a lookup table with keys whose values are in [32, 127), while values the same, except those in [65, 91) replaced with [97, 123). Method: tf.contrib.lookup.HashTable.
Split the string into characters. Method: tf.string_split
Using lookup to map upper case characters to lower case characters. Method: case_table.lookup (if the HashTable was called case_table).
Join the characters back into the string. Method: tf.reduce_join.
A concrete example can be found here: https://github.com/bshao001/ChatLearner/blob/master/chatbot/tokenizeddata.py
This approach should be able to be expanded to other character sets. Notice that if you were trying to convert only those characters that need to be changed (such as 26 English uppercase characters), that would be harder (not sure doable or not) as you will have to use tf.cond method and check if the character is in the key set or not, and would be less efficient too.